| java.lang.Object it.unimi.dsi.mg4j.index.Index
All known Subclasses: it.unimi.dsi.mg4j.index.BitStreamIndex, it.unimi.dsi.mg4j.index.cluster.IndexCluster, it.unimi.dsi.mg4j.index.remote.RemoteIndex,
Index | abstract public class Index implements Serializable(Code) | | An abstract representation of an index.
Concrete subclasses of this class represent abstract index access
information: for instance, the basename or IP address/port,
flags, etc. It allows to build easily
over the index:
in turn, index readers provide
.
In principle, this class should just contain methods declarations,
and attributes for all data that is common to any form of index.
Note that we use an abstract class, rather than an interface, because
interfaces do not allow to declare attributes.
This class provide static factory methods (e.g.,
Index.getInstance(CharSequence) )
that return an index given a suitable URI string. If the scheme part is mg4j, then
the URI is assumed to point at a remote index. Otherwise, it is assumed to be the
basename of a local index. In both cases, a query part introduced by ? can
specify additional parameters (key=value pairs separated
by ;). For instance, the URI example?inmemory=1 will load
the index with basename example, caching its content in core memory.
Please have a look at constants in
Index.UriKeys
(and analogous enums in subclasses) for additional parameters.
Thread safety
Indices are a natural candidate for multithreaded access. An instance of this class
must be thread safe as long as external data structures provided to its
constructors are. For instance, the tool
it.unimi.dsi.mg4j.tool.IndexBuilder generates
a
ImmutableExternalPrefixMap so that by default the resulting index is thread safe.
For instance, a
it.unimi.dsi.mg4j.index.DiskBasedIndex requires a list of
term offsets, term maps, etc. As long as all these data structures are thread safe, the
same is true of the index. Data structures created by static factory methods such as
it.unimi.dsi.mg4j.index.DiskBasedIndex.getInstance(CharSequence) are thread safe.
Note that
it.unimi.dsi.mg4j.index.IndexReader s returned by
Index.getReader() are not thread safe (even if the method
Index.getReader() is). The logic behind
this arrangement is that you create as many reader as you need, and then
java.io.Closeable.close them. In a multithreaded
environment, a pool of index readers can be created, and a custom
it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor can be used to build
it.unimi.dsi.mg4j.search.DocumentIterator s using the given pool of readers. In
this case readers are not closed, but rather reused.
Read-once load
Implementations of this class are strongly encouraged to offer read-once constructors
and factory methods: property files and other data related to the index (but not to an
it.unimi.dsi.mg4j.index.IndexReader should be read exactly once, and sequentially. This feature is very useful when
.
author: Paolo Boldi author: Sebastiano Vigna since: 0.9 |
Inner Class :public static enum PropertyKeys | |
Inner Class :public static enum UriKeys | |
Field Summary | |
final public EmptyIndexIterator | emptyIndexIterator A singleton for an iterator returning no documents based on this index. | final public String | field The field indexed by this index, or null . | final public boolean | hasCounts Whether this index contains counts. | final public boolean | hasPayloads Whether this index contains payloads; if true,
Index.payload is non-null . | final public boolean | hasPositions Whether this index contains positions. | public Index | keyIndex The index used as a key to retrieve intervals. | final public int | maxCount The maximum number of positions in an position list, or -1 if it is unknown. | final public int | numberOfDocuments The number of documents of the collection. | final public long | numberOfOccurrences The number of occurrences of the collection. | final public long | numberOfPostings The number of postings (pairs term/document) of the collection. | final public int | numberOfTerms The number of terms of the collection. | final public Payload | payload The payload for this index, or null . | final public Properties | properties The properties of this index. | public ReferenceSet<Index> | singletonSet An immutable singleton set containing just
Index.keyIndex . | final public IntList | sizes The size of each document, or null if sizes are not necessary or not loaded in this index. | final public TermProcessor | termProcessor The term processor used to build this index. |
Constructor Summary | |
protected | Index(int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, IntList sizes, Properties properties) Creates a new instance, initialising all fields. |
Method Summary | |
public IndexIterator | documents(int term) Creates a new
IndexReader for this index and uses it to return
an index iterator over the documents containing a term.
Since the reader is created from scratch, it is essential
to
the
returned iterator after usage. | public IndexIterator | documents(CharSequence term) Creates a new
IndexReader for this index and uses it to return
an index iterator over the documents containing a term; the term is
given explicitly, and the index
is used, if present.
Since the reader is created from scratch, it is essential
to
the
returned iterator after usage. | abstract public IndexIterator | documents(CharSequence prefix, int limit) Creates a number of instances of
IndexReader for this index and uses them to return
a document iterator over the documents containing a set of terms defined
by a prefix; the prefix is given explicitly, and unless the index has a
, an
UnsupportedOperationException will be thrown. | public static Index | getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) Returns a new index using the given URI.
If uri has scheme mg4j, the index is considered to be remote
and index creation delegated to
IndexServer.getIndex(Stringintbooleanboolean) . | public static Index | getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes) Returns a new index using the given URI, searching dynamically for term and prefix maps. | public static Index | getInstance(CharSequence uri, boolean randomAccess) Returns a new index using the given URI, searching dynamically for term and prefix maps and loading
document sizes only if it is necessary. | public static Index | getInstance(CharSequence uri) Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading
document sizes only if it is necessary. | public IndexReader | getReader() Creates and returns a new
IndexReader based on this index, using
the default buffer size. | abstract public IndexReader | getReader(int bufferSize) Creates and returns a new
IndexReader based on this index. | protected static TermProcessor | getTermProcessor(Properties properties) | public void | keyIndex(Index newKeyIndex) Set the index used as a key to retrieve intervals from iterators generated from this index.
This setter is a compromise between clarity of design and efficiency.
Each index iterator is based on an index, and when that index is passed
to
DocumentIterator.intervalIterator(Index) , intervals corresponding
to the positions of the term in the current document are returned. |
emptyIndexIterator | final public EmptyIndexIterator emptyIndexIterator(Code) | | A singleton for an iterator returning no documents based on this index.
|
field | final public String field(Code) | | The field indexed by this index, or null .
|
hasCounts | final public boolean hasCounts(Code) | | Whether this index contains counts.
|
hasPayloads | final public boolean hasPayloads(Code) | | Whether this index contains payloads; if true,
Index.payload is non-null .
|
hasPositions | final public boolean hasPositions(Code) | | Whether this index contains positions.
|
keyIndex | public Index keyIndex(Code) | | The index used as a key to retrieve intervals. Usually equal to this , but it is
.
|
maxCount | final public int maxCount(Code) | | The maximum number of positions in an position list, or -1 if it is unknown.
|
numberOfDocuments | final public int numberOfDocuments(Code) | | The number of documents of the collection.
|
numberOfOccurrences | final public long numberOfOccurrences(Code) | | The number of occurrences of the collection.
|
numberOfPostings | final public long numberOfPostings(Code) | | The number of postings (pairs term/document) of the collection.
|
numberOfTerms | final public int numberOfTerms(Code) | | The number of terms of the collection. This field might be set to -1 in some cases
(for instance, in certain documental clusters).
|
payload | final public Payload payload(Code) | | The payload for this index, or null .
|
properties | final public Properties properties(Code) | | The properties of this index. It is stored here for convenience (for instance,
if custom keys are added to the property file), but it may be null .
|
sizes | final public IntList sizes(Code) | | The size of each document, or null if sizes are not necessary or not loaded in this index.
|
termProcessor | final public TermProcessor termProcessor(Code) | | The term processor used to build this index.
|
Index | protected Index(int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, IntList sizes, Properties properties)(Code) | | Creates a new instance, initialising all fields.
|
documents | public IndexIterator documents(CharSequence term) throws IOException(Code) | | Creates a new
IndexReader for this index and uses it to return
an index iterator over the documents containing a term; the term is
given explicitly, and the index
is used, if present.
Since the reader is created from scratch, it is essential
to
the
returned iterator after usage. See
IndexReader.documents(int) for a method with the same semantics, but making reader reuse possible.
Unless the
of
this index is null , words coming from a query will
have to be processed before being used with this method.
Parameters: term - a term. throws: IOException - if an exception occurred while accessing the index. throws: UnsupportedOperationException - if the is not available for this index. See Also: IndexReader.documents(CharSequence) |
documents | abstract public IndexIterator documents(CharSequence prefix, int limit) throws IOException, TooManyTermsException(Code) | | Creates a number of instances of
IndexReader for this index and uses them to return
a document iterator over the documents containing a set of terms defined
by a prefix; the prefix is given explicitly, and unless the index has a
, an
UnsupportedOperationException will be thrown.
This method is not provided by
IndexReader because it requires the
creation of several index readers at the same time. These readers must be
afterwards.
Parameters: prefix - a prefix. Parameters: limit - a limit on the number of terms that will be used to resolvethe prefix query; if the terms starting with prefix are more thanlimit , a TooManyTermsException will be thrown. throws: IOException - if an exception occurred while accessing the index. throws: UnsupportedOperationException - if this index cannot resolve prefixes. throws: TooManyTermsException - if there are more than limit terms starting with prefix . |
getInstance | public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | Returns a new index using the given URI.
If uri has scheme mg4j, the index is considered to be remote
and index creation delegated to
IndexServer.getIndex(Stringintbooleanboolean) . Otherwise,
we delegate to
DiskBasedIndex.getInstance(CharSequencebooleanbooleanbooleanEnumMap) .
Parameters: uri - the URI defining the index. Parameters: randomAccess - whether the index should be accessible randomly. Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). Parameters: maps - if true, and maps will be guessed and loaded (thisfeature might not be available with some kind of index). |
getInstance | public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | Returns a new index using the given URI, searching dynamically for term and prefix maps.
Parameters: uri - the URI defining the index. Parameters: randomAccess - whether the index should be accessible randomly. Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). See Also: Index.getInstance(CharSequence,boolean,boolean,boolean) |
getInstance | public static Index getInstance(CharSequence uri, boolean randomAccess) throws ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | Returns a new index using the given URI, searching dynamically for term and prefix maps and loading
document sizes only if it is necessary.
Parameters: uri - the URI defining the index. Parameters: randomAccess - whether the index should be accessible randomly. See Also: Index.getInstance(CharSequence,boolean,boolean) |
getInstance | public static Index getInstance(CharSequence uri) throws ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException(Code) | | Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading
document sizes only if it is necessary.
Parameters: uri - the URI defining the index. See Also: Index.getInstance(CharSequence,boolean) |
getReader | abstract public IndexReader getReader(int bufferSize) throws IOException(Code) | | Creates and returns a new
IndexReader based on this index. After that, you
can use the reader to read this index.
Parameters: bufferSize - the size of the buffer to be used accessing the reader, or -1for a default buffer size. a new IndexReader to read this index. |
keyIndex | public void keyIndex(Index newKeyIndex)(Code) | | Set the index used as a key to retrieve intervals from iterators generated from this index.
This setter is a compromise between clarity of design and efficiency.
Each index iterator is based on an index, and when that index is passed
to
DocumentIterator.intervalIterator(Index) , intervals corresponding
to the positions of the term in the current document are returned. Analogously,
it.unimi.dsi.mg4j.search.DocumentIterator.indices returns a singleton
set containing the index. However, when composing indices into clusters,
often iterators generated by a local index must act as if they really belong
to the global index. This method allows to set the index that is used as
a key to return intervals, and that is contained in
Index.singletonSet .
Note that setting this value will only influence
created afterwards.
Parameters: newKeyIndex - the new index to be used as a key for interval retrieval. |
|
|