| |
Package Name | Comment | it.unimi.dsi.mg4j.compression | | it.unimi.dsi.mg4j.document |
This package contains all the logics related to and useful for managing documents, document collections and such.
Warning: We are still working on the document infrastructure. It should be pretty stable, but
changes should not be unexpected. Suggestions are welcome. Note also that most of the classes in this
package should be considered examples and suggestions: while a casual user will find them
invaluable in indexing data, a custom, large-scale application will usually require writing your own
{@link it.unimi.dsi.mg4j.document.DocumentCollection}.
Basic interfaces
The {@link it.unimi.dsi.mg4j.document.Document} interface
MG4J aims at indexing the content of entities called documents. The main
classes that describe documents and sets of documents are included in the it.unimi.dsi.mg4j.document
package. In particular, documents are instances of the {@link it.unimi.dsi.mg4j.document.Document}
interface. A document is characterized abstractly by the following data:
- a title, a character sequence that represents the document; the document title is
returned by the {@link it.unimi.dsi.mg4j.document.Document#title()} method;
- a URI, that somehow characterizes the document uniquely; the document URI is returned by
the {@link it.unimi.dsi.mg4j.document.Document#uri()} method;
- a number of fields; every field is abstractly represented by a number, the field index;
fields are numbered from 0, but the user should know how many fields a document exhibits, because this
information is not available in the document itself. For every field, the document exhibits the following data:
- the field content, that is an
Object returned by the
{@link it.unimi.dsi.mg4j.document.Document#content(int)} method; the type of object that
this method returns must be known by the calling class in advance; in particular, for
textual fields (see below), the content will be a {@link java.io.Reader};
- for textual fields only, a {@linkplain it.unimi.dsi.mg4j.io.WordReader word reader}:
an object that is able to split the field
content (in this case: a sequence of characters) into a sequence of words; the word reader
is returned by the {@link it.unimi.dsi.mg4j.document.Document#wordReader(int)}
method (which must be called only for textual fields).
Users should always close a document after usage by calling the {@link it.unimi.dsi.mg4j.document.Document#close()}
method: the method is responsible for relinquishing all resources that a document instantiated for its very existence.
The {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface
Documents usually do not come alone, but they are grouped into collections: documents within a collection are
of the same type, and this fact explains why the document structure (number and type of fields) are not contained in
the document itself. Indeed, documents are produced by document factories.
A document factory is an instance of the {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface, that in
particular is able to produce a document. All documents produced by the same factory are of the same kind, and exhibit
the same number and type of fields. A factory gives information about the documents it produces through the following methods:
- {@link it.unimi.dsi.mg4j.document.DocumentFactory#numberOfFields()}: returns the number of fields contained
in each document produced by this factory; recall that fields are indexed starting from 0;
- {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldName(int)}: returns a mnemonic explanatory name
for the given field;
- {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldIndex(String)}: returns the index of a field, given its mnemonic
name;
- {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldType(int)}: returns (an integer representing)
the type of the given field; possible types are static constants declared in the {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType}
interface; one of the possible types is {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType#TEXT}, used for textual
fields; note that the type of objects returned by the {@link it.unimi.dsi.mg4j.document.Document#content(int)}
method of the {@link it.unimi.dsi.mg4j.document.Document} interface depend on the type of the field.
The abovementioned methods provide information about documents produced by the factory. The actual documents are
produced by the {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap) getDocument(rawContent,metadata)}
method.
This method returns a new document from the factory. The rawContent parameter is
the most important one: it is a stream
of bytes that the factory uses to produce the document.
The factory knows how the sequence of bytes should be interpreted
to produce a document of the desired kind.
Note that even though the interpretation of the sequence of bytes representing the raw
document content is entirely left to implementors, often you
might prefer to think of the input byte sequence as of a list of consecutive self-delimiting byte
subsequences, one for each field: in this case, the
{@link java.io.InputStream#reset()} method of the {@link java.io.InputStream} class is used to divide the
subsequences from one another.
The metadata parameter is a map providing some basic
data about the document as derived by the collection. The map is a reference map with
suitable {@link java.util.Enum} keys, and as such must be
queried using the keys in {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys}, or other
similar factory-specific keys. For instance, the key {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#TITLE}
gives a suggested title to the document (but the factory may ignore it, if it has a better way to determine
a title for the document), whereas {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#URI} specifies a suggested URI for
the document (which, once more, may be ignored by the factory).
Usually, a factory is built using a list of properties that define default values for metadata such as charset
encoding or MIME types. There properties can be passed in several ways, and usually the main method of a collection
provides an option (typically, -p) that let the user specify default metadata for the factory. The
property resolution algorithm is explained in the documentation for
{@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory}.
The {@link it.unimi.dsi.mg4j.document.DocumentSequence} interface
Up to this point, we have interpreted documents and document factories in a very abstract manner, but we gave no
importance on the way the byte sequences representing the raw data are produced.
Basically, a source of documents is a {@link it.unimi.dsi.mg4j.document.DocumentSequence}. More precisely, instances of
this class represent sources that are able to generate documents. Typically, a document sequence is able to produce a
stream of documents after one another, through a special kind of iterator, called
{@link it.unimi.dsi.mg4j.document.DocumentIterator}, returned by the {@link it.unimi.dsi.mg4j.document.DocumentSequence#iterator()} method.
A document iterator is not really a Java {@link java.util.Iterator}: it is simply a class that
exposes a method {@link it.unimi.dsi.mg4j.document.DocumentIterator#nextDocument()} that returns the next document,
if any, or null when there are no more documents. Thus, document iterators can be lazy, which
is preferrable in several circumstance (e.g., documents coming from an input stream).
The {@link it.unimi.dsi.mg4j.document.DocumentCollection} interface
In some cases, the documents only appear as an uninterrupted stream and applications do not have direct accesses to
single documents; in particular, it might be the case that documents just disappear after being enumerated (as it happens
when the document source is standard input). In all such situations, a {@link it.unimi.dsi.mg4j.document.DocumentSequence}
provides the only way to get the documents, because it only guarantees sequential access.
Nonetheless, there are other cases where documents can be easily accessed in a direct fashion, and can be read many
times (for example, when documents are files in the file system). In such cases, {@link it.unimi.dsi.mg4j.document.DocumentCollection}
(an extension of {@link it.unimi.dsi.mg4j.document.DocumentSequence})
can be used.
Apart for the methods of a document sequence, a document collection provides the following additional access methods
to the documents:
- {@link it.unimi.dsi.mg4j.document.DocumentCollection#size()}, that returns the collection size, i.e., the number
of documents in the collection;
- {@link it.unimi.dsi.mg4j.document.DocumentCollection#document(int)}, that returns the document with given pointer;
the document pointer is an integer representing uniquely a document within the collection: the i-th
document produced by the collection's {@linkplain it.unimi.dsi.mg4j.document.DocumentCollection#iterator() document iterator}
has pointer i−1 (so, document pointers range from 0, inclusive, to
size() , exclusive);
- {@link it.unimi.dsi.mg4j.document.DocumentCollection#stream(int)}, that returns the {@link java.io.InputStream} of raw data
that this collection would use to produce the document with given pointer.
After a document collection has been created, for example, starting from a set of files in the filesystem, it can usually
be saved (serialized) on a file: the extension used for the filename is, by default, {@link it.unimi.dsi.mg4j.document.DocumentCollection#DEFAULT_EXTENSION}.
Implementors of this interface should always specify explicitly which assumptions on the existence of
external data are made for the consistency of a collection to be preserved. For example, a collection produced from a set of
files will be consistent until no file has been changed or deleted; if the latter situation happens, the collection usually
becomes inconsistent, and in any case you might expect that the indices thereof produced will no longer match the content
of the collection.
Note that a document collection has very weak requirements (and thus
very weak obligations) on the concurrent creation of several objects (documents, iterators, etc.). Please read carefully the
{@link it.unimi.dsi.mg4j.document.DocumentCollection class description}.
Relations between document sequences/collections and document factories
As we explained above, document sequences/collections extract raw data (byte sequences) from some source and use
a specified document factory to turn such data into documents. Hence, there exists a tight connection between each document
sequence and the document factory it uses.
Typically, the document factory is provided to the document sequence at construction time, and this fact provides a form of
flexibility, because different sources (e.g., the file system and the standard input) may be coupled with the same document
factory (e.g., a document factory parsing HTML documents into text), or, conversely, the same document source may be
used to produce documents with different formats.
Users should always be careful, however. Often, document sequences make assumptions about the factory they use, which
reduces the number of possible combinations the user may adopt. Implementors of the {@link it.unimi.dsi.mg4j.document.DocumentSequence}
interface should always clarify all the assumptions they make about the factories that can be used for the sequence.
Document factories
Recall that a document factory is an object that is capable of producing homogeneous documents (documents with the
same number/type of fields). Every document is produced starting from a raw bytestream.
The {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}
The simplest possible document factory is the {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}: this factory
produces documents with a single textual field, called text, that is actually obtained by transforming the
byte sequence into a sequence of characters, using some default encoding. Actually, a
document factory must also provide a way to break the text into words. With this aim, the identity factory may be provided
with a {@link java.util.Locale} that is used to determine how words are best broken in the given locale's language.
Other examples of document factories
Other implementations of document factories that are provided with MG4J are:
- {@link it.unimi.dsi.mg4j.document.HtmlDocumentFactory}: a factory used to parse HTML documents; this factory
produces documents by parsing HTML streams. Bytes are converted into characters using a specified encoding;
the resulting HTML character sequence is parsed to extract text (that is returned as
a textual field named text) and title (the HTML title element, returned as a textual field named
title). The title, if present, is also used as document title (otherwise, the suggested title is used).
Note that a document collection might have information about the charset encoding (e.g., by means of HTTP headers):
in this case the
metadata field of {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap)}
should pass this information.
- {@link it.unimi.dsi.mg4j.document.PdfDocumentFactory}: a factory used to parse PDF documents. Currently, a single
textual field named text is exported.
Composing document factories
As we said, many document factories interpret the raw content data (an {@link java.io.InputStream}, i.e., a sequence
of bytes) as if it is really made by a concatenation of many {@link java.io.InputStream}s, where each stream is typically
parsed to a field; to pass from one stream to the next, the {@link java.io.InputStream#reset()} method is called.
Suppose you have n document factories D1, …, Dn,
with f1, …, fn fields, respectively. One may want to
build a new factory with f1+…+fn fields, where each document
is produced by composing the document factories D1, …, Dn sequentially:
in other words, the raw data are first passed to the first factory (that extracts f1 fields, typically
resetting the stream as many times), then it is passed to the second factory (that extracts f2 fields)
etc.
The {@link it.unimi.dsi.mg4j.document.CompositeDocumentFactory} does the job, and also allows one to change the field
names (that are otherwise named as they were in the subfactories).
The class {@link it.unimi.dsi.mg4j.io.MultipleInputStream} is a useful tool to produce raw data for composite factories:
it allows one to convert an array of input streams into a single input stream: each time the resulting stream is reset,
the multiple input stream will offer you the next stream in the array.
A special form of composite document factory is obtained using {@link it.unimi.dsi.mg4j.document.ReplicatedDocumentFactory},
that allows one to compose sequentially the same document factory with itself a certain number of times.
Document collections and sequences
The {@link it.unimi.dsi.mg4j.document.InputStreamDocumentSequence}
This is the simplest kind of document sequence: it just breaks a single {@link java.io.InputStream} on the basis of a given
separator character; each piece of the stream is interpreted as the raw data corresponding to a document, and it is passed
to a factory (specified at construction time) for converting it into a {@link it.unimi.dsi.mg4j.document.Document}.
The {@link it.unimi.dsi.mg4j.document.FileSetDocumentCollection}
This kind of collection is built starting from a set of files in the file system. Each file is interpreted as a document, and
passed to a factory (specified at construction time). The suggested title for a document is the corresponding filename, and the
suggested URI is the URI of the file.
The {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} facility
There are cases in which one would like to turn a document sequence into a document collection. This may happen for one of the
following reasons:
- the sequence is, by its very nature, volatile (e.g., it is coming from standard input, and cannot be re-produced),
but we would like to make it into a resident non-volatile collection;
- the sequence is not amenable to be accessed at random;
- the documents in the sequence are difficult to parse, and it is not advisable to repeat the parsing process every
time they are accessed.
In all such cases, it may be advisable to produce a compact copy of the sequence that is easily and efficiently accessible at random.
To do this, one may use the {@link it.unimi.dsi.mg4j.document.ZipDocumentCollectionBuilder}, that takes a document sequence
and produces a "zipped clone" of the documents in the sequence: there are some mild limitations to the sequences that can
be used in this context, and the resulting collection is only a partial copy of the original one, but in most cases this is
sufficient for all indexing purposes. The builder will save two files: one contains the essential data concerning the zipped
collection, and the other contains the zipped version of the documents.
After this, the produced {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} may be used as any other collection.
| it.unimi.dsi.mg4j.index |
MG4J: Managing Gigabytes for Java
Index generation and access.
This package contains the classes that handle index generation and
access. The interval iterators defined in {@link it.unimi.dsi.mg4j.search}
build upon the classes of this package to provide answer to queries using interval semantics,
but it is also possible to access an index directly.
You can easily build indices using the tools in {@link it.unimi.dsi.mg4j.tool}. Once an index
has been built, it can be opened using an {@link it.unimi.dsi.mg4j.index.Index} object, which
gathers metadata that is necessary to access the index. You do not create an {@link it.unimi.dsi.mg4j.index.Index}
with a constructor: rather, you use the static factory {@link
it.unimi.dsi.mg4j.index.Index#getInstance(CharSequence)} (or one of its variants) to create an instance.
This is necessary so that different kind of indices can be treated transparently: for example, the factory
may return a {@link it.unimi.dsi.mg4j.index.cluster.IndexCluster} if the index is actually a cluster,
but you do not need to know that.
From an {@link it.unimi.dsi.mg4j.index.Index},
you can easily obtain either an {@link it.unimi.dsi.mg4j.index.IndexReader}, which allows to
scan sequentially or randomly the index. In turn from an {@link it.unimi.dsi.mg4j.index.IndexReader}
you can obtain a {@link it.unimi.dsi.mg4j.index.IndexIterator}
returning the documents containing a certain term and the position of the term within the document.
But there is more: an {@link it.unimi.dsi.mg4j.index.IndexIterator}
is a kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator}, and
{@link it.unimi.dsi.mg4j.search.DocumentIterator}s can be combined in several ways
using the classes of the package {@link it.unimi.dsi.mg4j.search}: for instance, you can combine
document iterators using AND/OR. Note that you can combine document iterators on different
indices, but of course the operation is meaningful only if the two indices contain different information
about the same document collection (e.g., title and main text).
More importantly, if the index is full text (the default) for each document containing the term you can get
interval iterators that return intervals representing extents of text satisfying the query: for
instance, in case of an AND of two terms, the intervals will contain both terms.
Structure of an inverted index
An inverted index is made by a sequence of inverted lists (one inverted
list for each term). Inverted lists are made by document records: each
document record contains information about the occurrences of the term
within a certain document.
More precisely, each inverted list starts with a suitably encoded
integer, called the frequency, which is the number of document
records that will follow (i.e., the number of documents in which the term
appears). After that, there are exactly as many document records as the
frequency.
Each document record is made by two parts:
- a suitably encoded integer, called the (document) pointer,
which identifies the document within the collection;
- a (possibly empty) sequence of bits, called the data; the
data have no special structure per se: the only assumption is that they
are a self-delimiting bit sequence (i.e., one knows when the sequence is
over).
As a basic and fundamental implementation, the classes of this package provide methods
that write and read document data in a default form. In this default
structure, each document data is a suitable coding of a (strictly
increasing) sequence of integers, that correspond to the positions
where the term occurs within the document. The length of the sequence
(i.e., the number of positions in at which the term appears) is called the
count (it is also common to call it “within-document frequency”, but we find this
usage confusing).
| it.unimi.dsi.mg4j.index.cluster |
MG4J: Managing Gigabytes for Java
Index partitioning and clustering.
This package contains the classes that provide the infrastructure for index partitioning
and clustering. The tools the actually perform partitioning can be found in {@link it.unimi.dsi.mg4j.tool}.
An index cluster is a set of local indices that are viewed as a single index. In
a {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalCluster lexical cluster} each local index has a disjoint set of terms, but the document pointers contained
in each local index refer to the same documents. In a {@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalCluster documental cluster}
each index contains postings referring to a disjoint subset of a collection.
Clustering indices requires mapping term number and document pointers back and forth between the global
index and local indices. This mapping is provided by {@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalClusteringStrategy documental clustering strategies}
and {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalClusteringStrategy lexical clustering strategies}.
Clusters are often generated by partitioning an index (albeit, for instance,
{@link it.unimi.dsi.mg4j.tool.Scan} produces a cluster as output of the indexing process). In this case, a
{@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalPartitioningStrategy documental partitioning strategy}
or a {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalPartitioningStrategy lexical partitioning strategy}
explain how to divide and remap term numbers and document pointers. Of course, the clustering and partitioning
strategy must be suitably matched.
| it.unimi.dsi.mg4j.index.payload | | it.unimi.dsi.mg4j.index.remote |
MG4J: Managing Gigabytes for Java
Remote index classes.
Package Specification
The classes in this package are extremely experimental
and need considerable rework and optimisation to be actually useful. They
are distributed with MG4J because excising them from the MG4J structure is
rather complicated, and because they are fully usable: feedback is more
than welcome.
| it.unimi.dsi.mg4j.index.snowball |
MG4J: Managing Gigabytes for Java
Snowball-based {@linkplain it.unimi.dsi.mg4j.index.TermProcessor term processors}.
Package Specification
This package gathers a number of algorithmic stemmer (most notably,
the famous {@link it.unimi.dsi.mg4j.index.snowball.PorterStemmer})
automatically generated from scripts written in Snowball—a string
language developped by Martin Porter to describe easily algorithmic stemmers.
Please refer to the documentation on the Snowball site
for information about the stemmers.
The code in this package is distributed under the BSD License.
It is Copyright (C) 2001 Martin Porter, and (for the Java developments) Copyright (C) 2002 Richard Boulton.
| it.unimi.dsi.mg4j.index.wired | | it.unimi.dsi.mg4j.io |
MG4J: Managing Gigabytes for Java
Bit-level I/O classes.
Package Specification
The standard Java API lacks bit-level I/O classes: to this purpose, MG4J
provides {@link it.unimi.dsi.mg4j.io.InputBitStream} and {@link
it.unimi.dsi.mg4j.io.OutputBitStream}, which can wrap any standard Java
corresponding stream and make it work at the bit level; moreover, they
provide support for several useful formats (such as unary, binary, minimal
binary, γ, δ and Golomb encoding).
Compression can be achieved using self-delimiting formats supported by
the classes above, or also by arithmetic coding, using the
classes {@link it.unimi.dsi.mg4j.io.ArithmeticCoder} and {@link
it.unimi.dsi.mg4j.io.ArithmeticDecoder}. Note that arithmetic coding is
not very efficient in the present implementation, as it does not allow a
varying number of symbols.
Bit input and output streams offer also efficient buffering and a way to
reposition the bit stream in case the underlying byte stream is a
file-based stream or a {@link it.unimi.dsi.fastutil.io.RepositionableStream}.
Conventions
All coding methods work on natural numbers. The
encoding of zero is very natural for some techniques, and much less natural
for others. To keep methods rationally organised, all methods are able to
encode any natural number. If, for instance, you want to write positive
numbers in unary encoding and you do not want to waste a bit, you have to
decrement them first (i.e., instead of p you must encode
p−1).
| it.unimi.dsi.mg4j.query |
MG4J: Managing Gigabytes for Java
User interfaces for querying indices.
Most classes in this package are just suggestions or experiments. The
{@link it.unimi.dsi.mg4j.query.Query} tool lets you query one or more indices on
the command line or possibly using a browser. The displayed results can be linked
to actual data (files, collections, etc.) using items such as
{@link it.unimi.dsi.mg4j.query.GenericItem} or {@link it.unimi.dsi.mg4j.query.FileSystemItem}.
| it.unimi.dsi.mg4j.query.nodes |
MG4J: Managing Gigabytes for Java
Composite representation for queries
This package contains the classes that represent queries as
an abstract syntax tree, or, in design-pattern jargon, as a composite.
Warning: Before reading this overview, the reader is encouraged to consult the first part
of the {@link it.unimi.dsi.mg4j.search} package
documentation, where the notion of query is briefly presented.
The basic idea is to perform a complete decoupling of the query construction and
resolution into three levels:
- String level: the query is a string, written in a suitable language where, for example, every operator has a
certain syntax, uses a certain keyword etc.
- Tree level: the query is an abstract tree, that contains one internal node for each operator, but that has
nothing to do with the actual way the query was written; for example, one may want to give the user the possibility
to write the query by using a visual editor, in which case there will be some way to obtain the tree representation
of the query without having any string-level representation for it.
- Iterator level: the document iterator that lazily solves the query over a certain index or set of indices;
this is the semantic counterpart of a query, and it is discussed at large in the overview of the {@link it.unimi.dsi.mg4j.search}
package.
This package overview shows how to pass from the string to the tree level using the built-in
{@link it.unimi.dsi.mg4j.query.parser.QueryParser}, and explains what is needed to implement the tree level, as well as the basic instruments
needed to convert the tree level into the iterator level.
In MG4J, a query parser is an implementation of {@link it.unimi.dsi.mg4j.query.parser.QueryParser}:
essentially, an object with a method that transforms strings in {@linkplain it.unimi.dsi.mg4j.query.nodes.Query queries}.
You can use any parser you like, or build your queries programmatically.
The parser provided with MG4J and whose syntax is
described in the manual is currently generated using JavaCC: unfortunately, this tool
produces some public classes that somehow clutter the package
documentation. All the parsing logic is contained in the
{@link it.unimi.dsi.mg4j.query.parser.SimpleParser} class, which is generated by the
JavaCC source SimpleParser.jj.
Building a query tree out of a query string
MG4J allows one to perform queries on multiple indices; by saying so, we mean that
you may have many indices concerning the same document collection, and you want to perform
a query that may search on some of them. Think, for example, of a collection of emails.
You might have generated a number of indices, to index their subjects, sender, recipient(s),
body etc. All these indices can be thought of as individual entities, their only relation being
that they index collections with the same number of documents, and that same-numbered documents
conceptually "come" from the same source.
To get a query tree from a string representation, you must first build a {@link it.unimi.dsi.mg4j.query.parser.QueryParser} object.
For instance, if you use the built-in {@link it.unimi.dsi.mg4j.query.parser.SimpleParser} you have
to provide a map whose keys are index names (called index aliases) and
with each index alias mapped to the corresponding {@link it.unimi.dsi.mg4j.index.Index}.
Moreover, one of the aliases is taken to be the default index alias used for the queries.
In our example, we will assume that we have three index aliases (say, subject,
from and body), and that subject is the
default index alias.
The parser would then be created as follows (here, we are stipulating that index with alias
x has basename mail-x):
Object2ReferenceMap<String,Index> indexAlias2Index = new Object2ReferenceOpenHashMap<String,Index>();
Index defaultIndex;
indexAlias2Index.put( "subject", defaultIndex = Index.getInstance( "mail-subject" ) );
indexAlias2Index.put( "from", Index.getInstance( "mail-from" ) );
indexAlias2Index.put( "body", Index.getInstance( "mail-body" ) );
QueryParser parser = new QueryParser( indexAlias2Index.keySet(), "subject" );
after which you can use it for example as follows:
Query query = parser.parse( "meeting AND body:(schedule OR urgent)" );
DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( indexAlias2Index, defaultIndex, 0 );
DocumentIterator it = query.accept( visitor );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
Structure of a {@link it.unimi.dsi.mg4j.query.nodes.Query}
As explained above, in this package a {@link it.unimi.dsi.mg4j.query.nodes.Query} is simply an abstract tree: its leaves will correspond to ground queries, whereas
internal nodes correspond to query operators.
Ground queries can be {@linkplain it.unimi.dsi.mg4j.query.nodes.Term term queries}, {@linkplain it.unimi.dsi.mg4j.query.nodes.Prefix prefix queries}, and
{@linkplain it.unimi.dsi.mg4j.query.nodes.MultiTerm multiterm queries} (the latter usually generated by some query-expansion mechanism).
Any other query is either a {@linkplain it.unimi.dsi.mg4j.query.nodes.Composite composite query} (e.g.,
a {@linkplain it.unimi.dsi.mg4j.query.nodes.And conjunctive query}, a {@linkplain it.unimi.dsi.mg4j.query.nodes.Or disjunctive query} etc.),
that is, a query composed by other subqueries, or it is obtained by some other operator (e.g.,
a {@linkplain it.unimi.dsi.mg4j.query.nodes.LowPass low-pass query} etc.).
Every query has a method {@link it.unimi.dsi.mg4j.query.nodes.Query#accept(it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor)}
a {@link it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor} object, through which you can visit the query tree.
More precisely, when the accept method is invoked on a query, a recursive visit of the query tree is performed: at every
node n of type Q (Q being any class implementing {@link it.unimi.dsi.mg4j.query.nodes.Query}),
the following steps are performed:
- the visitor's
visitPre(Q n) method is called; if the method returns false, the visit is
interrupted; otherwise…
- the subtrees are recursively visited, obtaining an array of results
T[] x ; if at a certain
point during this visit some call decides that the visit should be interrupted, we stop; otherwise…
- the visitor's
visitPost(Q n, T[] x) method is called, and the result (of type T )
is returned.
To make the idea of a visitor easier to understand, consider the following simple example of a visitor:
static class PrinterVisitor extends AbstractQueryBuilderVisitor<String> {
private static void appendArray( MutableString res, String t[], char sep ) {
for ( int i = 0; i < t.length - 1; i++ ) res.append( t[ i ] + sep );
res.append( t[ t.length - 1 ] );
}
public String[] newArray( int len ) { return new String[ len ]; }
public String visitPost( OrderedAnd node, String[] t ) {
MutableString res = new MutableString();
res.append( "OAND(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( Or node, String[] t ) {
MutableString res = new MutableString();
res.append( "OR(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( And node, String[] t ) {
MutableString res = new MutableString();
res.append( "AND(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( Consecutive node, String[] t ) {
MutableString res = new MutableString();
res.append( "\"" );
appendArray( res, t, ' ' );
res.append( "\"" );
return res.toString();
}
public String visitPost( Not node, String t ) {
return "NOT(" + t + ")";
}
public String visitPost( LowPass node, String t ) {
return t + "~" + node.k;
}
public String visitPost( Select node, String t ) {
return node.index + ":" + t;
}
public String visit( Term node ) { return node.term.toString(); }
public String visit( Prefix node ) { return node.prefix + "*"; }
}
Now, you can pass an instance of this visitor to a query, and as a result get a string (linear) representation
of the query itself. (Of course, there is a simple way to get the same result, that is, calling the toString()
method directly, but for this example is easy enough for illustrative purposes.)
Here is an example of how the above visitor can be used:
public static void main( String arg[] ) throws Exception {
Query qa = new Term( "a" );
Query qb = new Prefix( "b" );
Query qc = new Term( "c" );
Query qd = new Term( "d" );
Query q = new LowPass( new And( qa, new Select( "another_index", qb ), new Or( qc, qd ) ), 20 );
System.out.println( q.accept( new PrinterVisitor() ) );
}
that produces AND(a,another_index:b*,OR(c,d))~20.
Building a document iterator out of a query
The interface {@link it.unimi.dsi.mg4j.query.nodes.Query} and its implementations provide the basic classes
for the tree-level representation of a query. Building an iterator out of it can be thought of as a process of recursive
instantiation of a query tree into an iterator tree; this is actually a sort of a copy visit, and it is indeed implemented
as a special kind of visitor: the {@link it.unimi.dsi.mg4j.search.DocumentIteratorBuilderVisitor}.
You can create one such visitor and use it to visit a query tree, obtaining as a result a document iterator
that you can use to get the results.
To be more precise, let us recall the example presented in the last part of the {@linkplain it.unimi.dsi.mg4j.search overview of the search package}:
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = AndDocumentIterator.getInstance(
subjectIndex.documents( "meeting" ),
subjectIndex.documents( "schedule" ),
subjectIndex.documents( "monday" )
);
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
Here is an equivalent piece of code that uses the idea of decoupling the tree level from the iterator level.
Index subjectIndex = Index.getInstance( "mail-subject" );
Query q = new And( new Term( "meeting" ), new Term( "schedule" ), new Term( "monday" ) );
DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( null, subjectIndex, 0 );
DocumentIterator it = q.accept( visitor );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
| it.unimi.dsi.mg4j.query.parser |
MG4J: Managing Gigabytes for Java
A simple JavaCC-generated parser used
by the {@link it.unimi.dsi.mg4j.query.Query} class. For details about the syntax,
please consult the {@link it.unimi.dsi.mg4j.search} package documentation. If you need
to know the relation between queries, syntax trees and document iterators please consult the
{@link it.unimi.dsi.mg4j.query.nodes} package documentation.
| it.unimi.dsi.mg4j.search |
MG4J: Managing Gigabytes for Java
Classes that compose {@linkplain
it.unimi.dsi.mg4j.search.DocumentIterator iterators over
documents}. Such iterators are returned, for instance, by {@link
it.unimi.dsi.mg4j.index.IndexReader#documents(int)}.
Minimal-interval semantics
MG4J provides minimal-interval semantics. That is, if the index
is full-text, a {@link it.unimi.dsi.mg4j.search.DocumentIterator} will provide a list of documents and, for
each document, a list of minimal intervals. This intervals denote ranges of
positions in the document that satisfy the iterator: for instance, if you
compose two documents iterators using an {@link
it.unimi.dsi.mg4j.search.AndDocumentIterator}, you will get as a result the
intersection of the document lists of the underlying iterators. Moreover,
for each document you will get the minimal set of intervals that contain
one interval both from the first iterators and from the second one.
This information is of course very useful if you're going to assign a
score to the document, as smaller intervals mean a more precise match. At
the basic level (e.g., iterators returned by an index), the intervals
returned upon a document are intervals of length one containing the term
that was used to generate the iterator. Intervals for compound iterators
are built in a natural way, preserving minimality. More details can be
found in Charles L. A. Clarke and Gordon V. Cormack, Shortest-Substring
Retrieval and Ranking (ACM Transactions on Information Systems,
vol. 18, no. 1, Jan 2000, pages 44−78). Scorers for documents may be
found in the {@link it.unimi.dsi.mg4j.search.score} package.
The algorithms used by classes in this package to compute minimal-interval
operators are new: details can be found
here.
Note that MG4J provides minimal-interval semantics for a set of
indices. This extension is a significant improvement over single-index
semantics. However, defining the exact meaning of a query is a nontrivial
problem that will be fully dealt with in a forthcoming paper.
Searching with minimal-interval semantics
The aim of this section is to provide a minimal insight of how minimal-interval semantics works,
and explain the basic syntax used by the {@link it.unimi.dsi.mg4j.query.Query} command-line tool.
In this section we shall try to discuss this issue only through examples; we shall later explain
how you can actually perform searches of this kind using MG4J.
Note that you do not have to understand the details of minimal-interval semantics to fruitfully
use MG4J. Several natural operators (ordered conjunction, proximity limitation, etc.) are computed
by MG4J very efficiently using minimal-interval semantics just under the hood.
MG4J solves queries on multiple indices; by saying so, we mean that
you may have many indices concerning the same document collection, and you want to perform
a query that may search on some of them. Think, for example, of a collection of emails:
you might have generated a number of indices, to index their subjects, sender, recipient(s),
body etc. All these indices can be thought of as individual entities, their only relation being
that they index collections with the same number of documents, and that same-numbered documents
conceptually "come" from the same source. The notion of multiple indices should not be new to
the reader that is familiar with the {@link it.unimi.dsi.mg4j.document} package.
In our examples, we will assume that we have three indices (say, subject,
from and body), and that subject is the
used as default. Be warned that the actual syntax of queries in this section is immaterial (even though
we shall stick to the syntax of {@link it.unimi.dsi.mg4j.query.parser.SimpleParser}).
Two different aspects should be taken into consideration when trying to determine which document actually match (i.e., satisfy)
the query:
- first, one can consider this in a purely Boolean (true/false) setting: thus a document may either satisfy the query or not; this is actually the
only information you can get for indices that do not contain positions;
- second, one can consider, for a document that matches the query in the above sense, which intervals (i.e., minimal
sequences of consecutive words within the document) actually witness the match; this information will be available
if the index contains positions.
In the following subsections, we shall give information about both kind of satisfiability.
Queries available on all indices
Simple queries
The simplest possible query consists in a single search term. The documents matching such a query
are exactly those that contain the given term, with respect to the default index. In our example, the query
meeting
will be matched by the documents (emails) that contain the term "meeting" in their subject. If you want,
you can perform the query on another index (different from the default one); thus, for example, the query
body: meeting
will be matched by the documents that contain the term "meeting" in their body. In both cases, the intervals
witnessing the match will be the single occurrences of the term "meeting" in the subject and in the body field, respectively.
Conjunctive queries
You can specify that more than one condition should be met in conjunction by using the
AND operator. For example:
meeting AND schedule
will be matched by those document whose subject contains both the term "meeting" and the term "schedule" (not
necessarily in this order). The witnesses will be minimal intervals in the subjects that contain both terms. For
example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have three witnesses: "schedule the meeting", "meeting (should we schedule" and
"schedule this meeting".
The keyword AND can be subsituted with the symbol & or can even be
omitted. So the above query is equivalent to:
meeting & schedule
and to
meeting schedule
Also in this case, you can select a different index for the query to be matched. For example:
body: meeting schedule
(or, equivalently, body: meeting AND schedule or body: meeting & schedule)
will be matched by documents that contain the term "meeting" in their body and the term "schedule" in
their subject. In this case, witnesses come from different sources: a witness will be any single occurrence of
the word "meeting" in the body (there should be at least one to make the document match the query) and
any single occurrence of the word "schedule" in the subject (again, there should be at least one to make the
document match the query).
If you want both terms to be searched for in the body index, you can use:
body: meeting body: schedule
or, simply,
body: (meeting schedule)
Disjunctive queries
You can also introduce a disjunctive (OR) query, like in
meeting OR schedule
that will be matched by the documents that contain the term "meeting" or the term "schedule" (or both) in their
subject. A witness will then be every single occurrence of either word in the subject. The keyword OR can
be substituted with |, hence the previous query is equivalent to
meeting | schedule
Conjunctive and disjunctive operators can appear in the same query, with the rule that AND has higher priority
than OR. So, for example:
meeting AND schedule OR time
will be matched by documents whose subject contains both "meeting" and "schedule", and by documents
whose subject contains "time". In this case, a witness will either be a (possibly long) interval containing both the words
"meeting" and "schedule", or a one-word interval containing the word "time". If you want to change this behaviour, you should use parenthesis, like:
meeting AND (schedule OR time)
Again, you can use index selectors, like in:
body:meeting AND (schedule OR time)
that will be matched by documents containing "meeting" in their body (a witness being every single occurrence of the word in the body),
and "schedule" or "time" in their subject (a witness being every single occurrence of either word in the subject). Similarly:
body:(meeting AND (schedule OR time))
will match documents that contain "meeting" and either "schedule" or "time" (or both) in their body.
Negative (NOT) queries
You can specify that you want to exclude documents containing a certain term, or, more in general,
satisfying a certain query, by using the (unary, prefix) operator NOT. For example:
body:(meeting AND NOT tomorrow) AND subject:schedule
will be satistied by the emails that contain the term "schedule" in their subject, and the term "meeting"
but not the term "tomorrow" in their body. The operator NOT can be substituted with !, like in:
body:(meeting !tomorrow) subject:schedule
Negative queries are easily understood in a Boolean context, but may be more difficult as far as witnesses are
concerned. Basically, the implementation of NOT works in such a way that NOT is actually used only for the
Boolean match, but does not influence witnesses. In more detail, the only witness associated to a true
NOT query is an empty interval.
Prefix and multiterm queries
A prefix query is a simple query that is matched by all terms starting from the same nonempty prefix. For example:
govern*
is matched by all documents containing any word starting with "govern". For the prefix operator * to work, you
have to endow your index with a {@link it.unimi.dsi.mg4j.index.PrefixMap}. What really happens in this case is that the query
is essentially expanded into a disjunction that contains all the words in the dictionary that start with "govern".
To be true, the expansion of a prefix query does not really lead to an OR, but rather to what we call a multiterm query:
a multiterm query is like an OR, but it can only contain terms as subquery, and behaves under many respects like a single term.
It is unusual to specify manually a multiterm query—rather, some query expansion mechanism like prefixes should
be used, but if you want to try manually, a multiterm query can be obtained using the
+ operator. For example:
house + houses + housing
is a correct multiterm query, and it is loosely equivalent to
house OR houses OR housing
Note, however, that trying to use + instead of OR does not work if the subqueries are not simple queries, or if they
concern different indices. For example:
house + title:meeting
would produce an error.
You may wonder why multiterm queries are needed, if they are essentially the same as OR queries. The first answer is efficiency: a multiterm
query should be more efficient than an OR query.
The second answer is more subtle, and has to do with scorers. A scorer is a way to assign a score to a document that satisfies a query. Many
scorers actually work by summing up suitable partial scores that depend on the document and on one of the terms in the query. Such
partial scores are often function of the count (number of times the term appears in the document) and on the frequency (number of
documents where the term appears), and they are often really high when the term has a low frequency. The idea behind this is that
if I write:
computer OR methacrylic
a document that satisfies the query because it contains "methacrylic" is more valuable than one that contains the word "computer", being the former
much more infrequent.
Nonetheless, trying to use these scorers on automatically expanded queries may lead to many problems. For example, suppose you expanded
govern*
as
government OR governance OR governor OR governing
(we are here assuming that the four terms above are the only ones that appear in the dictionary and start with "govern"). Now, since
"governance" is presumably much rarer than "government", we expect all documents containing only "governance" to be given a high score.
using
government + governance + governor + governing
the scorer acts on this bunch of words as a whole, and the frequency is assumed to be the maximum frequency (hence, it is the same
for all words), avoiding the "governance"-prevalence problem.
Queries available on indices with positions
Ordered conjunctive queries
The operator of ordered conjunction < works like AND, but requires the subqueries to be satisfied in
the exact order in which they are specified, even though not necessarily consecutively. For example:
meeting < schedule
will only be matched by documents that contain in their subject at least one occurrence of the word "meeting" followed (maybe not immediately)
by at least one occurrence of the word schedule. Again, for
example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have only one witness: "meeting (should we schedule"; the
other two minimal intervals that contain both words ("schedule the meeting" and
"schedule this meeting") are not witnesses because words appear in the wrong order.
Note that the ordering between witnesses is strict: for instance, the query
meeting < meeting
has as only witness "meeting (should we schedule meeting". The single word "meeting"
alone is not a witness for the query.
In this case, it makes no sense (and it is indeed forbidden) to select a different index for the subqueries to be matched.
Consecutivity (phrasal queries)
You can specify that you want that some terms appear consecutively by using " (quotes). For example:
"meeting schedule"
will be matched if the terms "meeting" and "schedule" appear in this order, and consecutively, in the subject.
Inside quotes, you can also use subqueries, surrounding them with parenthesis, like in:
"meeting (schedule OR time)"
that is matched by documents whose subject contains the term "meeting" followed by either "schedule" or
"time". A witness will this time be necessarily an interval of exactly two words (the first being "meeting"
and the second being either "schedule" or "time").
More precisely, the quotes operators is satisfied if there is a sequence
of consecutive witnesses, with each witness coming from a different
subquery, in the same order in which the queries appear.
Note that
"meeting schedule OR time"
would be invalid: if you want to use operators within quotes, you should do so between parenthesis.
Moreover, within quotes you cannot change index. So you can say
body:"meeting schedule"
but you cannot use
"body:meeting subject:schedule"
The symbol $ (dollar) can be used to specify an arbitrary word in a consecutive query. For instance,
meeting $ schedule
will match "meeting our schedule" as well as "meeting my schedule". You can add dollars also at the start of
a phrase, but not at the end (in the latter case, they will be ignored).
Proximity limit
As we have discussed, when a document matches a given query, there will be one or more witnesses within the
document. Each such witness is a consecutive sequence of positions in the document that witness the matching. For
example, consider the query
body:((meeting schedule) OR "John Smith") OR subject:alarm
This query will be matched by documents that contain the term "alarm" in their subject, and by documents that contain
either the terms "meeting" and "schedule" or the (exact) sentence "John Smith" in their body.
For every document that matches the query, there will be two sets of matching intervals, one about the body and
the other about the subject; at least one of these two sets will be nonempty (because of the OR keyword).
Intervals concerning the subject will simply be intervals of length one that correspond to the positions where the term
"alarm" appears in the subject.
Intervals concerning the body will be either intervals of length two corresponding to the positions where the sentence
"John Smith" appears in the body, or intervals of length two or more where both "meeting" and "schedule" appear.
You might want to accept only matching intervals up to a certain length; for example, suppose you don't want
to take into considerations intervals that contain "meeting" and "schedule" too far apart, say at a distance greater than
10 words. You can do this by using the proximity limit operator ~. Just rewrite the previous query as
body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
This way, you are simply discarding the matching intervals that contain the terms "meeting" and "schedule" if their
length (number of words) is greater than 10 (i.e., if "meeting" and "schedule" are separated by more than 8 words).
The proximity limit operator can be used at any point, and limits the length of all matching intervals of the
query it is applied to. Note, however, that it may only be used on full-text indices.
Difference
The Brouwerian difference operator is specified using - (minus). It is a rather
esoteric operator that is rarely met by the end user, and that, given two subqueries, kills the
witnesses of the first query (the minuend) that contain one or more witnesses of the second query (the subtrahend).
By definition, for documents that satisfy the minuend, but not the subtrahend, the witnesses are unchanged. For
instance, the following query
schedule < meeting - this
will be matched only if the term "schedule" and the term "meeting" appear in this order without the term "this" inbetween.
If the subject is
schedule the meeting (should we schedule this meeting or not)?
the only valid witness is "schedule this meeting", and indeed, the following query
schedule < meeting - (this | the)
will not match at all the subject above, as all witnesses of the minuend are killed by witnesses of the subtrahend.
As an additional feature, you can specify a left and a right margin that will be used to enlarge the intervals of the subtrahend. For instance,
"schedule < meeting - [[1,2]] this"
will kill intervals of the minuend that contain the whole fragment "schedule this meeting or" (so no interval will be killed at all).
Queries available on payload-based indices
Actually, the atomic queries discussed above (term, prefix, etc.) can be used with standard indices, that is,
indices of fields containing text. For payload-based indices, which represent document metadata such as dates,
the standard query available in MG4J is a range query in which the first and last valid values are specified by
the user. The resulting query is satisfying by all documents whose field is in the range. Both the first and the last value can be omitted.
for instance, the following query
date:[ 20/2/2007 .. 23/2/2007 ]
will search for documents between 20 February and 23 February 2007, inclusive, whereas the query
date:[ .. 23/2/2007 ]
will search for documents up to 23 February 2007. Note that in the built-in parser
spaces are necessary.
They make it possible to separate the different tokens composing the query.
Range queries must not be used as a generic query mechanism, but rather to refine the result of
a query over document content: a ranked query composed uniquely by a range query will have to scan the whole payload-based index
just to return a few results.
Building and composing document iterators
The {@link it.unimi.dsi.mg4j.search} package contains all the classes needed to build a query
and to match it against a certain collection of indices. This is actually only the semantic
counterpart to a query; for the syntactic aspects, please refer to the {@link it.unimi.dsi.mg4j.query.nodes} package.
Basic classes
An {@link it.unimi.dsi.mg4j.search.Interval} represents a consecutive set of natural numbers, that is, a witness within a document
(in this case, numbers represent the positions within a document: 0 is the position of the first word,
1 is the position of the second and so on). An {@link it.unimi.dsi.mg4j.search.IntervalIterator} is an
iterator that returns intervals: typically, an interval iterator will return all intervals witnessing
a certain query for a certain document (and a certain index).
For example, the query
body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
will give rise to an interval iterator for the body and an interval iterator for the subject: the former
will return intervals within the body witnessing the first part of the query, and the latter will return the intervals
the intervals witnessing the second part of the query. Note that even upon a matching document either iterator
may actually return no interval (because the overall query is disjunctive); nonetheless, the two iterators
cannot be both empty.
It is always understood that intervals are returned in increasing order (of their left, or equivalently right, extreme).
A {@link it.unimi.dsi.mg4j.search.DocumentIterator} is used to scan a whole collection of indices
for a query. At every given moment, the iterator will be able to return the next document matching the query,
and, for full-text indices, you will also be able to {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator(it.unimi.dsi.mg4j.index.Index)
get the interval iterators of the witnesses for that document and for a specific index}.
Obtaining and composing document iterators
The simplest kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator} you can build is an {@link it.unimi.dsi.mg4j.index.IndexIterator}: it
is a document iterator that scans a specific index for a specific term. You don't actually build an index iterator directly, but
you rather obtain one by calling the {@link it.unimi.dsi.mg4j.index.Index#documents(CharSequence)} (or, equivalently,
{@link it.unimi.dsi.mg4j.index.IndexReader#documents(CharSequence)}) method, that returns the
set of documents containing a given term (and witnesses will be the single occurrences of such term).
Hence, for example, the following snippet opens a full-text index whose basename is mail-subject, and
prints out all documents containing the word "meeting", each with the sequence of positions where the word
appears (all intervals will be actually singletons). (Note that a document iterator over a single index
is itself iterable, and the {@link it.unimi.dsi.mg4j.search.DocumentIterator#iterator()} method is actually an alias
for {@link it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator()}).
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = subjectIndex.documents( "meeting" );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
A number of classes in this package can be used to compose iterators; more precisely, for each
query operator discussed above there is a corresponding class in this package. Each such class has
a factory method that allows one to build new document iterators by composing existing iterators.
For example, the following snippet shows how to search for mails containing the words "meeting",
"schedule" and "monday".
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = AndDocumentIterator.getInstance(
subjectIndex.documents( "meeting" ),
subjectIndex.documents( "schedule" ),
subjectIndex.documents( "monday" )
);
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
The following table shows the correspondence between query operators and composition classes:
Operator | Class |
AND & (conjunction) | {@link it.unimi.dsi.mg4j.search.AndDocumentIterator} |
OR | (disjunction) | {@link it.unimi.dsi.mg4j.search.OrDocumentIterator} |
NOT ! (negation) | {@link it.unimi.dsi.mg4j.search.NotDocumentIterator} |
+ (multiterm) | {@link it.unimi.dsi.mg4j.index.MultiTermIndexIterator} |
"..." (phrase) | {@link it.unimi.dsi.mg4j.search.ConsecutiveDocumentIterator} |
< (ordered conjunction) | {@link it.unimi.dsi.mg4j.search.OrderedAndDocumentIterator} |
~ (proximity) | {@link it.unimi.dsi.mg4j.search.LowPassDocumentIterator} |
- (difference) | {@link it.unimi.dsi.mg4j.search.DifferenceDocumentIterator} |
[ .. ] (range) | {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} |
Note, however, that {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} is actually a completely generic
predicate-based class that just returns documents whose payload satisfis a predicate.
Queries and document iterators
Even though it is perfectly legal to build document iterators by using these classes directly, this is not the
natural way to do that. One should rather build a syntactic object corresponding to a query, and then make it
into a document iterator that is, in some sense, the semantic counterpart of the query itself. To have more
information about how this works exaclty, please consult the overview of the {@link it.unimi.dsi.mg4j.query.nodes} package.
| it.unimi.dsi.mg4j.search.score |
MG4J: Managing Gigabytes for Java
Classes for assigning scores to documents.
The content of this package has changed significantly in MG4J 1.1 (hopefully for the better).
A {@link it.unimi.dsi.mg4j.search.score.Scorer} is an
object that wraps an underlying {@link it.unimi.dsi.mg4j.search.DocumentIterator} and assigns scores to the documents
returned by the underlying iterator. In general, once a scorer has
{@linkplain it.unimi.dsi.mg4j.search.score.Scorer#wrap(it.unimi.dsi.mg4j.search.DocumentIterator) wrapped a document iterator}
one just calls {@link it.unimi.dsi.fastutil.ints.IntIterator#nextInt() nextInt()} and
{@link it.unimi.dsi.mg4j.search.score.Scorer#score()} to get scored documents (some iterators
might support {@linkplain it.unimi.dsi.mg4j.search.score.Scorer#score(it.unimi.dsi.mg4j.index.Index) index-restricted scoring},
but this is optional).
If the scorer is a {@link it.unimi.dsi.mg4j.search.score.DelegatingScorer}, then by contract
it just delegates all {@link it.unimi.dsi.fastutil.ints.IntIterator}'s
methods to the underlying iterator.
In this case, it is possible to advance manually the underlying iterator and call
{@link it.unimi.dsi.mg4j.search.score.Scorer#score()}. While this behaviour is useless for
general users, it is essential for {@linkplain it.unimi.dsi.mg4j.search.score.AbstractAggregator aggregated scorers},
which combine several delegating scorers and provide services such as equalisation
and interval caching (in case more than one component scorer uses intervals). See, for instance,
{@link it.unimi.dsi.mg4j.search.score.LinearAggregator}.
| it.unimi.dsi.mg4j.search.visitor |
MG4J: Managing Gigabytes for Java
Visitors for composite {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator document iterators}.
Composites and visitors
A {@link it.unimi.dsi.mg4j.search.DocumentIterator}
(in particular, those provided by MG4J in the package {@link it.unimi.dsi.mg4j.search})
is usually structured as a composite,
with operators as internal nodes and {@link it.unimi.dsi.mg4j.index.IndexIterator}s
as leaves. A composite can be explored using a visitor: thus,
the {@link it.unimi.dsi.mg4j.search.DocumentIterator} interface provides two methods,
{@link it.unimi.dsi.mg4j.search.DocumentIterator#accept(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor) accept(DocumentIteratorVisitor)} and
{@link it.unimi.dsi.mg4j.search.DocumentIterator#acceptOnTruePaths(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor) acceptOnTruePaths(DocumentIteratorVisitor)},
that let a {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor} visit the composite structure.
A {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor} provides methods
for visiting in {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visitPre(it.unimi.dsi.mg4j.search.DocumentIterator) preorder}
and in {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visitPost(it.unimi.dsi.mg4j.search.DocumentIterator) postorder} all internal nodes.
Leaves have a {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visit(it.unimi.dsi.mg4j.index.IndexIterator) single visit method} instead. Note that a {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor}
must be (re)usable after each call
to {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#prepare() prepare()}.
The abstract class {@link it.unimi.dsi.mg4j.search.visitor.AbstractDocumentIteratorVisitor} provides
stubs implementing internal visits and {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#prepare() prepare()}
as no-ops.
Counting term occurrences
An example of the utility of visitors for document iterators is given by term counting:
using a number of coordinated visitors, it is possible to compute
a count for each term appearing in a (no matter how complex) query. The count can be used as
an input for counting-based scoring schemes, such as BM25 or cosine-based measures. For more information,
please read the documentation of {@link it.unimi.dsi.mg4j.search.visitor.CounterCollectionVisitor}.
| it.unimi.dsi.mg4j.test | | it.unimi.dsi.mg4j.tool |
MG4J: Managing Gigabytes for Java
Line-command tools for index construction.
The classes in this package
contain a main method, and can be used to build indices starting from a
document sequence. Please have a look at the MG4J manual to learn
how to build an index.
| it.unimi.dsi.mg4j.util |
MG4J: Managing Gigabytes for Java
General-purpose utility classes.
Package Specification
The classes in this package are completely general utility classes that
may be useful outside of MG4J. In particular, {@link
it.unimi.dsi.mg4j.util.MutableString} is a highly-tuned, versatile class
for content-based mutable strings.
Another general-purpose important component is {@link
it.unimi.dsi.mg4j.util.MinimalPerfectHash}, which implements
state-of-the-art, hypergraph-based algorithms for the construction of
ordered minimal perfect hash tables.
If minimal perfect hash tables are too large, a good alternative is
an {@link it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary}, which keeps most data in compressed
form on disk, and uses a small in-memory index to access those data.
| it.unimi.dsi.mg4j.util.parser |
MG4J: Managing Gigabytes for Java
A fast, lightweight, on-demand (X)HTML parser.
| it.unimi.dsi.mg4j.util.parser.callback |
MG4J: Managing Gigabytes for Java
Callbacks for the {@link it.unimi.dsi.mg4j.util.parser.BulletParser}.
| test.it.unimi.dsi.mg4j | | test.it.unimi.dsi.mg4j.document | | test.it.unimi.dsi.mg4j.index | | test.it.unimi.dsi.mg4j.index.cluster | | test.it.unimi.dsi.mg4j.index.payload | | test.it.unimi.dsi.mg4j.query.nodes | | test.it.unimi.dsi.mg4j.search | | test.it.unimi.dsi.mg4j.search.score | | test.it.unimi.dsi.mg4j.tool | | test.it.unimi.dsi.mg4j.util | |
|