Partitions an index lexically.
A global index is partitioned lexically by providing a
LexicalPartitioningStrategy that specifies a destination local index for each term, and a local term number. The global index
is read directly at the bit level, and the posting lists are divided among the
local indices using the provided strategy. For instance,
an
ContiguousLexicalStrategy divides an index into
contiguous blocks (of terms) specified by the given strategy.
By choice, document pointers are not remapped. Thus, it may happen that one of the local indices
contains no posting with a certain document. However, computing the subset of documents contained
in each local index to remap them in a contiguous interval is not a good idea, as usually the subset
of documents appearing in the postings of each local index is large.
To speed up the search of the right local index of a not-so-frequent term (in
particular with a
),
after partitioning an index you can create
that will be used to try to avoid
inquiring indices that do not contain a term. The filters will be automatically loaded
by
it.unimi.dsi.mg4j.index.cluster.IndexCluster.getInstance(CharSequencebooleanboolean) .
Note that the size file is the same for each local index and is not copied. Please use
standard operating system features such as symbolic links to provide size files to
local indices.
If you plan to
the partitioned indices and you need document sizes
(e.g., for
), you can use the index property
it.unimi.dsi.mg4j.index.Index.UriKeys.SIZES to load the original size file.
If you plan on partitioning an index requiring
document sizes, you should consider a custom index loading scheme
that shares the
among all local indices.
Important: this class just partitions the index. No auxiliary files (most notably,
or
) will be generated. Please refer to a
StringMap implementation (e.g.,
ShiftAddXorSignedStringMap or
ImmutableExternalPrefixMap ).
Write-once output and distributed index partitioning
The partitioning process writes each index file sequentially exactly once, so index partitioning
can output its results to pipes, which in
turn can spill their content, for instance, through the network. In other words, albeit this
class theoretically creates a number of local indices on disk, those indices can be
substituted with suitable pipes creating remote local indices without affecting the partitioning process.
For instance, the following bash code creates three sets of pipes:
for i in 0 1 2; do
for e in frequencies globcounts index offsets properties sizes terms; do
mkfifo pipe-$i.$e
done
done
Each pipe must be emptied elsewhere, for instance (assuming
you want local indices index0, index1 and index2 on example.com):
for i in 0 1 2; do
for e in frequencies globcounts index offsets properties sizes terms; do
(cat pipe-$i.$e | ssh -x example.com "cat >index-$i.$e" &)
done
done
If we now start a partitioning process generating three local indices named pipe-0,
pipe-1 and pipe-2
all pipes will be written to by the process, and the data will create remotely
indices index-0, index-1 and index-2.
author: Sebastiano Vigna since: 1.0.1 |