it.unimi.dsi.mg4j.index |
MG4J: Managing Gigabytes for Java
Index generation and access.
This package contains the classes that handle index generation and
access. The interval iterators defined in {@link it.unimi.dsi.mg4j.search}
build upon the classes of this package to provide answer to queries using interval semantics,
but it is also possible to access an index directly.
You can easily build indices using the tools in {@link it.unimi.dsi.mg4j.tool}. Once an index
has been built, it can be opened using an {@link it.unimi.dsi.mg4j.index.Index} object, which
gathers metadata that is necessary to access the index. You do not create an {@link it.unimi.dsi.mg4j.index.Index}
with a constructor: rather, you use the static factory {@link
it.unimi.dsi.mg4j.index.Index#getInstance(CharSequence)} (or one of its variants) to create an instance.
This is necessary so that different kind of indices can be treated transparently: for example, the factory
may return a {@link it.unimi.dsi.mg4j.index.cluster.IndexCluster} if the index is actually a cluster,
but you do not need to know that.
From an {@link it.unimi.dsi.mg4j.index.Index},
you can easily obtain either an {@link it.unimi.dsi.mg4j.index.IndexReader}, which allows to
scan sequentially or randomly the index. In turn from an {@link it.unimi.dsi.mg4j.index.IndexReader}
you can obtain a {@link it.unimi.dsi.mg4j.index.IndexIterator}
returning the documents containing a certain term and the position of the term within the document.
But there is more: an {@link it.unimi.dsi.mg4j.index.IndexIterator}
is a kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator}, and
{@link it.unimi.dsi.mg4j.search.DocumentIterator}s can be combined in several ways
using the classes of the package {@link it.unimi.dsi.mg4j.search}: for instance, you can combine
document iterators using AND/OR. Note that you can combine document iterators on different
indices, but of course the operation is meaningful only if the two indices contain different information
about the same document collection (e.g., title and main text).
More importantly, if the index is full text (the default) for each document containing the term you can get
interval iterators that return intervals representing extents of text satisfying the query: for
instance, in case of an AND of two terms, the intervals will contain both terms.
Structure of an inverted index
An inverted index is made by a sequence of inverted lists (one inverted
list for each term). Inverted lists are made by document records: each
document record contains information about the occurrences of the term
within a certain document.
More precisely, each inverted list starts with a suitably encoded
integer, called the frequency, which is the number of document
records that will follow (i.e., the number of documents in which the term
appears). After that, there are exactly as many document records as the
frequency.
Each document record is made by two parts:
- a suitably encoded integer, called the (document) pointer,
which identifies the document within the collection;
- a (possibly empty) sequence of bits, called the data; the
data have no special structure per se: the only assumption is that they
are a self-delimiting bit sequence (i.e., one knows when the sequence is
over).
As a basic and fundamental implementation, the classes of this package provide methods
that write and read document data in a default form. In this default
structure, each document data is a suitable coding of a (strictly
increasing) sequence of integers, that correspond to the positions
where the term occurs within the document. The length of the sequence
(i.e., the number of positions in at which the term appears) is called the
count (it is also common to call it “within-document frequency”, but we find this
usage confusing).
|
Java Source File Name | Type | Comment |
AbstractBitStreamIndexWriter.java | Class | An abstract bitstream-based index writer, providing common variables and a basic
AbstractBitStreamIndexWriter.printStats(PrintStream) implementation. |
AbstractIndexIterator.java | Class | A very basic abstract implementation of an index interator,
providing an obvious implementation of
AbstractIndexIterator.term() ,
AbstractIndexIterator.id() ,
and of the
. |
AbstractIndexReader.java | Class | An abstract,
implementation of an index reader. |
AbstractPrefixMap.java | Class | An abstract implementation of a map from prefixes to term intervals. |
AbstractTermMap.java | Class | An abstract implementation of a map from term to term indices. |
BitStreamHPIndex.java | Class | A
index.
Implementing subclasses must provide access to the index bitstream (as it
happens for a
BitStreamIndex ) but also to the positions stream,
both at
and
level.
Wired implementations
The standard readers associated to an instance of this class are of type
BitStreamHPIndexReader .
Nonetheless, it is possible to generate automatically sources for wired classes that
work only for a particular set of codings and flags. |
BitStreamHPIndexReader.java | Class | A bitstream-based
for
. |
BitStreamHPIndexWriter.java | Class | Writes a bitstream-based high-performance index. |
BitStreamIndex.java | Class | A
index. |
BitStreamIndexReader.java | Class | A bitstream-based
. |
BitStreamIndexWriter.java | Class | Writes a bitstream-based interleaved index.
Offsets bit stream
An inverted index may have an associated
OutputBitStream of
offsets: this file contains T+1 integers, where T
is the number of inverted lists (i.e., the number of terms), and the
i -th entry is a suitable coding of the position in bits where
the i -th inverted list starts (the last entry is actually the
length, in bytes, of the inverted index file itself). |
CachingOutputBitStream.java | Class | A special output bit stream with an additional
method
CachingOutputBitStream.buffer() that returns the internal buffer
if the internal buffer contains all that has been written since
the last call to
OutputBitStream.position(long) position(0) .
This bit stream is used every time that it is necessary to cache quickly a bit stream.
By sizing the buffer size appropriately, most of the times data written to the stream
will be readable directly from the buffer. |
CompressionFlags.java | Class | A container for constants and enums related to index compression. |
DiskBasedIndex.java | Class | A static container providing facilities to load an index based on data stored on disk.
This class contains several useful static methods
such as
DiskBasedIndex.readOffsets(InputBitStream,int) and
DiskBasedIndex.readSizes(InputBitStream,int) ,
and static factor methods such as
DiskBasedIndex.getInstance(CharSequence,boolean,boolean,boolean,EnumMap) that take care of reading the properties associated to the index, identify
the correct
it.unimi.dsi.mg4j.index.Index implementation that
should be used to load the index, and load the necessary data into memory. |
DowncaseTermProcessor.java | Class | A term processor downcasing all characters. |
FileHPIndex.java | Class | A file-based high-performance index. |
FileIndex.java | Class | A file-based index. |
Index.java | Class | An abstract representation of an index.
Concrete subclasses of this class represent abstract index access
information: for instance, the basename or IP address/port,
flags, etc. |
IndexIterator.java | Interface | An iterator over an inverted list.
An index iterator scans the inverted list of an indexed term. |
IndexIterators.java | Class | A class providing static methods and objects that do useful things with index iterators. |
IndexReader.java | Interface | Provides access to an inverted index.
An
it.unimi.dsi.mg4j.index.Index contains global read-only metadata. |
IndexWriter.java | Interface | An interface for classes that generate indices.
Implementations of this interface are used to write inverted lists in
sequential order, as follows:
- to create a new inverted list, you must call
IndexWriter.newInvertedList() ;
- then, you must specified the frequency using
IndexWriter.writeFrequency(int) ;
- the document records follow; before writing a new document record, you must call
IndexWriter.newDocumentRecord() ;
note that, all in all, the number of calls to
IndexWriter.newDocumentRecord() must be equal to the frequency;
- for each document record, you must supply the information needed for the index you are building
(
,
,
, and
, in this order).
IndexWriter.newDocumentRecord() returns an
OutputBitStream that must be used to write the document-record data. |
InMemoryHPIndex.java | Class | A
BitStreamHPIndex index loaded in memory. |
InMemoryIndex.java | Class | A local bitstream index loaded in memory. |
MemoryMappedHPIndex.java | Class | A memory-mapped
BitStreamHPIndex . |
MemoryMappedIndex.java | Class | A local memory-mapped bistream index.
Memory-mapped indices are created by mapping the index file into memory
using a
MappedByteBuffer . |
MultiTermIndexIterator.java | Class | A virtual
that merges several component index iterators.
This class adds to
it.unimi.dsi.mg4j.search.AbstractUnionDocumentIterator an interval interator generating the OR of the intervals returned for each of the documents involved.
The main difference with an
OrDocumentIterator built on the same array of component iterators
is that this class implements
IndexIterator and hence provides a
MultiTermIndexIterator.count() (the sum
of counts of those component iterators positioned on the current document) and a
MultiTermIndexIterator.frequency() . |
NullTermProcessor.java | Class | A term processor that accepts all terms and does not do any processing. |
PrefixMap.java | Interface | A map from prefixes to term intervals (and possibly viceversa).
Given a list of terms in lexicographic order numbered from 0, we can ask, given a prefix,
which interval of terms starts with the given prefix. |
SkipBitStreamIndexWriter.java | Class | Provides facilities to write skip inverted indices,
that is, inverted indices with an additional skip structure. |
TermMap.java | Interface | A map from terms to numbers (and possibly viceversa). |
TermMaps.java | Class | A class providing static methods and objects that do useful things with
and
. |
TermProcessor.java | Interface | A term processor, implementing term/prefix transformation and possibly term/prefix filtering.
Index contruction requires sometimes modifications of
the given terms: downcasing, stemming, and so on. |
TooManyTermsException.java | Class | Thrown to indicate that a prefix query generated too many terms. |