| java.lang.Object it.unimi.dsi.mg4j.tool.Combine
All known Subclasses: it.unimi.dsi.mg4j.tool.Concatenate, it.unimi.dsi.mg4j.tool.Paste, it.unimi.dsi.mg4j.tool.Merge,
Combine | abstract public class Combine (Code) | | Combines several indices.
Indices may be combined in several different ways. This abstract class
contains code that is common to classes such as
it.unimi.dsi.mg4j.tool.Merge or
it.unimi.dsi.mg4j.tool.Concatenate : essentially, command line parsing,
inded opening, and term list fusion is taken care of. Then, the template method
Combine.combine(int) must write into
Combine.indexWriter the combined inverted
list, returning the resulting frequency.
Note that by combining a single index into a new one you can recompress an index
with different compression parameters (which includes the possibility of eliminating
positions or counts).
The subclasses of this class must implement
Combine.combine(int) so that indices
with different sets of features are combined keeping the largest set of features requested
by the user. For instance, combining an index with positions and an index with counts, but
no positions, should generate an index with counts but no positions.
Warning: a combination requires opening three files per input index,
plus a few more files for the output index. If the combination process is interrupted by
an exception claiming that there are too many open files, check how to increase the
number of files you can open (usually, for instance on UN*X, there is a global and a per-process limit,
so be sure to set both).
Read-once indices, readers, and distributed index combination
If the
and
involved in the
combination are read-once (i.e., opening an index and reading once its contents sequentially
causes each file composing the index to be read exactly once)
then also
it.unimi.dsi.mg4j.tool.Combine implementations should be read-once (
it.unimi.dsi.mg4j.tool.Concatenate ,
it.unimi.dsi.mg4j.tool.Merge and
it.unimi.dsi.mg4j.tool.Paste are).
This means, in particular, that index combination can be performed from pipes, which in
turn can be filled, for instance, with data coming from the network. In other words, albeit this
class is theoretically based on a number of indices existing on a local disk, those indices can be
substituted with suitable pipes filled with remote data without affecting the combination process.
For instance, the following bash code creates three sets of pipes:
for i in 0 1 2; do
for e in frequencies globcounts index offsets properties sizes terms; do
mkfifo pipe$i.$e
done
done
Each pipe should be then filled with suitable data, for instance obtained from the net (assuming
you have indices index0, index1 and index2 on example.com):
for i in 0 1 2; do
for e in frequencies globcounts index offsets properties sizes terms; do
(ssh -x example.com cat index$i.$e >pipe$i.$e &)
done
done
Now all pipes will be filled with data from the corresponding remote files, and
combining the indices pipe0, pipe1 and pipe2
will give the same result as combining index0, index1 and index2
on the remote system.
author: Sebastiano Vigna since: 1.0 |
Inner Class :final protected static class GammaCodedIntIterator extends AbstractIntIterator implements Closeable | |
Constructor Summary | |
public | Combine(String outputBasename, String[] inputBasename, boolean metadataOnly, int bufferSize, Map<Component, Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval) |
DEFAULT_BUFFER_SIZE | final public static int DEFAULT_BUFFER_SIZE(Code) | | The default buffer size.
|
frequency | final protected int[] frequency(Code) | | For each index, the frequency of the current term (given that it is present).
|
indexWriter | protected IndexWriter indexWriter(Code) | | The index writer for the merged index.
|
inputBasename | final protected String[] inputBasename(Code) | | The array of input basenames.
|
maxCount | protected int maxCount(Code) | | The maximum count in the merged index.
|
numIndices | final protected int numIndices(Code) | | The number of indices to be merged.
|
numberOfDocuments | final protected int numberOfDocuments(Code) | | The overall number of documents.
|
numberOfOccurrences | protected long numberOfOccurrences(Code) | | The overall number of occurrences.
|
position | protected int[] position(Code) | | A cache for positions.
|
size | protected int[] size(Code) | | The size of each document.
|
termQueue | protected ObjectHeapSemiIndirectPriorityQueue<MutableString> termQueue(Code) | | The queue containing terms.
|
usedIndex | protected int[] usedIndex(Code) | | An array partially filled with the indices (as offsets in
Combine.index ) participating to the merge process for the current term.
|
Combine | public Combine(String outputBasename, String[] inputBasename, boolean metadataOnly, int bufferSize, Map<Component, Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | |
combine | abstract protected int combine(int numUsedIndices) throws IOException(Code) | | Combines several indices.
When this method is called, exactly numUsedIndices entries
of
Combine.usedIndex contain, in increasing order, the indices containing
inverted lists for the current term. Implementations of this method must
combine the inverted list, save the total global count for the current
term and return the resulting frequency.
Parameters: numUsedIndices - the number of valid entries in Combine.usedIndex. the frequency of the combined lists. |
combineNumberOfDocuments | abstract protected int combineNumberOfDocuments()(Code) | | Combines the number of documents.
the number of documents of the combined index. |
combineSizes | abstract protected int combineSizes() throws IOException(Code) | | Combines size lists.
the maximum size of a document in the combined index. throws: IOException - |
main | public static void main(String[] arg, Class<? extends Combine> combineClass) throws JSAPException, ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | |
sizes | protected IntIterator sizes(int numIndex) throws FileNotFoundException(Code) | | Returns an iterator on sizes.
The purpose of this method is to provide
Combine.combineSizes() implementations with
a way to access the size list from a disk file or from
BitStreamIndex.sizes transparently.
This mechanism is essential to ensure that size files are read exactly once.
The caller should check whether the returned object implements
Closeable ,
and, in this case, invoke
Closeable.close after usage.
Parameters: numIndex - the number of an index. an iterator on the sizes of the index. |
|
|