MG4J: Managing Gigabytes for Java
Classes that compose {@linkplain
it.unimi.dsi.mg4j.search.DocumentIterator iterators over
documents}. Such iterators are returned, for instance, by {@link
it.unimi.dsi.mg4j.index.IndexReader#documents(int)}.
Minimal-interval semantics
MG4J provides minimal-interval semantics. That is, if the index
is full-text, a {@link it.unimi.dsi.mg4j.search.DocumentIterator} will provide a list of documents and, for
each document, a list of minimal intervals. This intervals denote ranges of
positions in the document that satisfy the iterator: for instance, if you
compose two documents iterators using an {@link
it.unimi.dsi.mg4j.search.AndDocumentIterator}, you will get as a result the
intersection of the document lists of the underlying iterators. Moreover,
for each document you will get the minimal set of intervals that contain
one interval both from the first iterators and from the second one.
This information is of course very useful if you're going to assign a
score to the document, as smaller intervals mean a more precise match. At
the basic level (e.g., iterators returned by an index), the intervals
returned upon a document are intervals of length one containing the term
that was used to generate the iterator. Intervals for compound iterators
are built in a natural way, preserving minimality. More details can be
found in Charles L. A. Clarke and Gordon V. Cormack, Shortest-Substring
Retrieval and Ranking (ACM Transactions on Information Systems,
vol. 18, no. 1, Jan 2000, pages 44−78). Scorers for documents may be
found in the {@link it.unimi.dsi.mg4j.search.score} package.
The algorithms used by classes in this package to compute minimal-interval
operators are new: details can be found
here.
Note that MG4J provides minimal-interval semantics for a set of
indices. This extension is a significant improvement over single-index
semantics. However, defining the exact meaning of a query is a nontrivial
problem that will be fully dealt with in a forthcoming paper.
Searching with minimal-interval semantics
The aim of this section is to provide a minimal insight of how minimal-interval semantics works,
and explain the basic syntax used by the {@link it.unimi.dsi.mg4j.query.Query} command-line tool.
In this section we shall try to discuss this issue only through examples; we shall later explain
how you can actually perform searches of this kind using MG4J.
Note that you do not have to understand the details of minimal-interval semantics to fruitfully
use MG4J. Several natural operators (ordered conjunction, proximity limitation, etc.) are computed
by MG4J very efficiently using minimal-interval semantics just under the hood.
MG4J solves queries on multiple indices; by saying so, we mean that
you may have many indices concerning the same document collection, and you want to perform
a query that may search on some of them. Think, for example, of a collection of emails:
you might have generated a number of indices, to index their subjects, sender, recipient(s),
body etc. All these indices can be thought of as individual entities, their only relation being
that they index collections with the same number of documents, and that same-numbered documents
conceptually "come" from the same source. The notion of multiple indices should not be new to
the reader that is familiar with the {@link it.unimi.dsi.mg4j.document} package.
In our examples, we will assume that we have three indices (say, subject,
from and body), and that subject is the
used as default. Be warned that the actual syntax of queries in this section is immaterial (even though
we shall stick to the syntax of {@link it.unimi.dsi.mg4j.query.parser.SimpleParser}).
Two different aspects should be taken into consideration when trying to determine which document actually match (i.e., satisfy)
the query:
- first, one can consider this in a purely Boolean (true/false) setting: thus a document may either satisfy the query or not; this is actually the
only information you can get for indices that do not contain positions;
- second, one can consider, for a document that matches the query in the above sense, which intervals (i.e., minimal
sequences of consecutive words within the document) actually witness the match; this information will be available
if the index contains positions.
In the following subsections, we shall give information about both kind of satisfiability.
Queries available on all indices
Simple queries
The simplest possible query consists in a single search term. The documents matching such a query
are exactly those that contain the given term, with respect to the default index. In our example, the query
meeting
will be matched by the documents (emails) that contain the term "meeting" in their subject. If you want,
you can perform the query on another index (different from the default one); thus, for example, the query
body: meeting
will be matched by the documents that contain the term "meeting" in their body. In both cases, the intervals
witnessing the match will be the single occurrences of the term "meeting" in the subject and in the body field, respectively.
Conjunctive queries
You can specify that more than one condition should be met in conjunction by using the
AND operator. For example:
meeting AND schedule
will be matched by those document whose subject contains both the term "meeting" and the term "schedule" (not
necessarily in this order). The witnesses will be minimal intervals in the subjects that contain both terms. For
example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have three witnesses: "schedule the meeting", "meeting (should we schedule" and
"schedule this meeting".
The keyword AND can be subsituted with the symbol & or can even be
omitted. So the above query is equivalent to:
meeting & schedule
and to
meeting schedule
Also in this case, you can select a different index for the query to be matched. For example:
body: meeting schedule
(or, equivalently, body: meeting AND schedule or body: meeting & schedule)
will be matched by documents that contain the term "meeting" in their body and the term "schedule" in
their subject. In this case, witnesses come from different sources: a witness will be any single occurrence of
the word "meeting" in the body (there should be at least one to make the document match the query) and
any single occurrence of the word "schedule" in the subject (again, there should be at least one to make the
document match the query).
If you want both terms to be searched for in the body index, you can use:
body: meeting body: schedule
or, simply,
body: (meeting schedule)
Disjunctive queries
You can also introduce a disjunctive (OR) query, like in
meeting OR schedule
that will be matched by the documents that contain the term "meeting" or the term "schedule" (or both) in their
subject. A witness will then be every single occurrence of either word in the subject. The keyword OR can
be substituted with |, hence the previous query is equivalent to
meeting | schedule
Conjunctive and disjunctive operators can appear in the same query, with the rule that AND has higher priority
than OR. So, for example:
meeting AND schedule OR time
will be matched by documents whose subject contains both "meeting" and "schedule", and by documents
whose subject contains "time". In this case, a witness will either be a (possibly long) interval containing both the words
"meeting" and "schedule", or a one-word interval containing the word "time". If you want to change this behaviour, you should use parenthesis, like:
meeting AND (schedule OR time)
Again, you can use index selectors, like in:
body:meeting AND (schedule OR time)
that will be matched by documents containing "meeting" in their body (a witness being every single occurrence of the word in the body),
and "schedule" or "time" in their subject (a witness being every single occurrence of either word in the subject). Similarly:
body:(meeting AND (schedule OR time))
will match documents that contain "meeting" and either "schedule" or "time" (or both) in their body.
Negative (NOT) queries
You can specify that you want to exclude documents containing a certain term, or, more in general,
satisfying a certain query, by using the (unary, prefix) operator NOT. For example:
body:(meeting AND NOT tomorrow) AND subject:schedule
will be satistied by the emails that contain the term "schedule" in their subject, and the term "meeting"
but not the term "tomorrow" in their body. The operator NOT can be substituted with !, like in:
body:(meeting !tomorrow) subject:schedule
Negative queries are easily understood in a Boolean context, but may be more difficult as far as witnesses are
concerned. Basically, the implementation of NOT works in such a way that NOT is actually used only for the
Boolean match, but does not influence witnesses. In more detail, the only witness associated to a true
NOT query is an empty interval.
Prefix and multiterm queries
A prefix query is a simple query that is matched by all terms starting from the same nonempty prefix. For example:
govern*
is matched by all documents containing any word starting with "govern". For the prefix operator * to work, you
have to endow your index with a {@link it.unimi.dsi.mg4j.index.PrefixMap}. What really happens in this case is that the query
is essentially expanded into a disjunction that contains all the words in the dictionary that start with "govern".
To be true, the expansion of a prefix query does not really lead to an OR, but rather to what we call a multiterm query:
a multiterm query is like an OR, but it can only contain terms as subquery, and behaves under many respects like a single term.
It is unusual to specify manually a multiterm query—rather, some query expansion mechanism like prefixes should
be used, but if you want to try manually, a multiterm query can be obtained using the
+ operator. For example:
house + houses + housing
is a correct multiterm query, and it is loosely equivalent to
house OR houses OR housing
Note, however, that trying to use + instead of OR does not work if the subqueries are not simple queries, or if they
concern different indices. For example:
house + title:meeting
would produce an error.
You may wonder why multiterm queries are needed, if they are essentially the same as OR queries. The first answer is efficiency: a multiterm
query should be more efficient than an OR query.
The second answer is more subtle, and has to do with scorers. A scorer is a way to assign a score to a document that satisfies a query. Many
scorers actually work by summing up suitable partial scores that depend on the document and on one of the terms in the query. Such
partial scores are often function of the count (number of times the term appears in the document) and on the frequency (number of
documents where the term appears), and they are often really high when the term has a low frequency. The idea behind this is that
if I write:
computer OR methacrylic
a document that satisfies the query because it contains "methacrylic" is more valuable than one that contains the word "computer", being the former
much more infrequent.
Nonetheless, trying to use these scorers on automatically expanded queries may lead to many problems. For example, suppose you expanded
govern*
as
government OR governance OR governor OR governing
(we are here assuming that the four terms above are the only ones that appear in the dictionary and start with "govern"). Now, since
"governance" is presumably much rarer than "government", we expect all documents containing only "governance" to be given a high score.
using
government + governance + governor + governing
the scorer acts on this bunch of words as a whole, and the frequency is assumed to be the maximum frequency (hence, it is the same
for all words), avoiding the "governance"-prevalence problem.
Queries available on indices with positions
Ordered conjunctive queries
The operator of ordered conjunction < works like AND, but requires the subqueries to be satisfied in
the exact order in which they are specified, even though not necessarily consecutively. For example:
meeting < schedule
will only be matched by documents that contain in their subject at least one occurrence of the word "meeting" followed (maybe not immediately)
by at least one occurrence of the word schedule. Again, for
example, if the subject was
schedule the meeting (should we schedule this meeting or not)?
then the above query will have only one witness: "meeting (should we schedule"; the
other two minimal intervals that contain both words ("schedule the meeting" and
"schedule this meeting") are not witnesses because words appear in the wrong order.
Note that the ordering between witnesses is strict: for instance, the query
meeting < meeting
has as only witness "meeting (should we schedule meeting". The single word "meeting"
alone is not a witness for the query.
In this case, it makes no sense (and it is indeed forbidden) to select a different index for the subqueries to be matched.
Consecutivity (phrasal queries)
You can specify that you want that some terms appear consecutively by using " (quotes). For example:
"meeting schedule"
will be matched if the terms "meeting" and "schedule" appear in this order, and consecutively, in the subject.
Inside quotes, you can also use subqueries, surrounding them with parenthesis, like in:
"meeting (schedule OR time)"
that is matched by documents whose subject contains the term "meeting" followed by either "schedule" or
"time". A witness will this time be necessarily an interval of exactly two words (the first being "meeting"
and the second being either "schedule" or "time").
More precisely, the quotes operators is satisfied if there is a sequence
of consecutive witnesses, with each witness coming from a different
subquery, in the same order in which the queries appear.
Note that
"meeting schedule OR time"
would be invalid: if you want to use operators within quotes, you should do so between parenthesis.
Moreover, within quotes you cannot change index. So you can say
body:"meeting schedule"
but you cannot use
"body:meeting subject:schedule"
The symbol $ (dollar) can be used to specify an arbitrary word in a consecutive query. For instance,
meeting $ schedule
will match "meeting our schedule" as well as "meeting my schedule". You can add dollars also at the start of
a phrase, but not at the end (in the latter case, they will be ignored).
Proximity limit
As we have discussed, when a document matches a given query, there will be one or more witnesses within the
document. Each such witness is a consecutive sequence of positions in the document that witness the matching. For
example, consider the query
body:((meeting schedule) OR "John Smith") OR subject:alarm
This query will be matched by documents that contain the term "alarm" in their subject, and by documents that contain
either the terms "meeting" and "schedule" or the (exact) sentence "John Smith" in their body.
For every document that matches the query, there will be two sets of matching intervals, one about the body and
the other about the subject; at least one of these two sets will be nonempty (because of the OR keyword).
Intervals concerning the subject will simply be intervals of length one that correspond to the positions where the term
"alarm" appears in the subject.
Intervals concerning the body will be either intervals of length two corresponding to the positions where the sentence
"John Smith" appears in the body, or intervals of length two or more where both "meeting" and "schedule" appear.
You might want to accept only matching intervals up to a certain length; for example, suppose you don't want
to take into considerations intervals that contain "meeting" and "schedule" too far apart, say at a distance greater than
10 words. You can do this by using the proximity limit operator ~. Just rewrite the previous query as
body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
This way, you are simply discarding the matching intervals that contain the terms "meeting" and "schedule" if their
length (number of words) is greater than 10 (i.e., if "meeting" and "schedule" are separated by more than 8 words).
The proximity limit operator can be used at any point, and limits the length of all matching intervals of the
query it is applied to. Note, however, that it may only be used on full-text indices.
Difference
The Brouwerian difference operator is specified using - (minus). It is a rather
esoteric operator that is rarely met by the end user, and that, given two subqueries, kills the
witnesses of the first query (the minuend) that contain one or more witnesses of the second query (the subtrahend).
By definition, for documents that satisfy the minuend, but not the subtrahend, the witnesses are unchanged. For
instance, the following query
schedule < meeting - this
will be matched only if the term "schedule" and the term "meeting" appear in this order without the term "this" inbetween.
If the subject is
schedule the meeting (should we schedule this meeting or not)?
the only valid witness is "schedule this meeting", and indeed, the following query
schedule < meeting - (this | the)
will not match at all the subject above, as all witnesses of the minuend are killed by witnesses of the subtrahend.
As an additional feature, you can specify a left and a right margin that will be used to enlarge the intervals of the subtrahend. For instance,
"schedule < meeting - [[1,2]] this"
will kill intervals of the minuend that contain the whole fragment "schedule this meeting or" (so no interval will be killed at all).
Queries available on payload-based indices
Actually, the atomic queries discussed above (term, prefix, etc.) can be used with standard indices, that is,
indices of fields containing text. For payload-based indices, which represent document metadata such as dates,
the standard query available in MG4J is a range query in which the first and last valid values are specified by
the user. The resulting query is satisfying by all documents whose field is in the range. Both the first and the last value can be omitted.
for instance, the following query
date:[ 20/2/2007 .. 23/2/2007 ]
will search for documents between 20 February and 23 February 2007, inclusive, whereas the query
date:[ .. 23/2/2007 ]
will search for documents up to 23 February 2007. Note that in the built-in parser
spaces are necessary.
They make it possible to separate the different tokens composing the query.
Range queries must not be used as a generic query mechanism, but rather to refine the result of
a query over document content: a ranked query composed uniquely by a range query will have to scan the whole payload-based index
just to return a few results.
Building and composing document iterators
The {@link it.unimi.dsi.mg4j.search} package contains all the classes needed to build a query
and to match it against a certain collection of indices. This is actually only the semantic
counterpart to a query; for the syntactic aspects, please refer to the {@link it.unimi.dsi.mg4j.query.nodes} package.
Basic classes
An {@link it.unimi.dsi.mg4j.search.Interval} represents a consecutive set of natural numbers, that is, a witness within a document
(in this case, numbers represent the positions within a document: 0 is the position of the first word,
1 is the position of the second and so on). An {@link it.unimi.dsi.mg4j.search.IntervalIterator} is an
iterator that returns intervals: typically, an interval iterator will return all intervals witnessing
a certain query for a certain document (and a certain index).
For example, the query
body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
will give rise to an interval iterator for the body and an interval iterator for the subject: the former
will return intervals within the body witnessing the first part of the query, and the latter will return the intervals
the intervals witnessing the second part of the query. Note that even upon a matching document either iterator
may actually return no interval (because the overall query is disjunctive); nonetheless, the two iterators
cannot be both empty.
It is always understood that intervals are returned in increasing order (of their left, or equivalently right, extreme).
A {@link it.unimi.dsi.mg4j.search.DocumentIterator} is used to scan a whole collection of indices
for a query. At every given moment, the iterator will be able to return the next document matching the query,
and, for full-text indices, you will also be able to {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator(it.unimi.dsi.mg4j.index.Index)
get the interval iterators of the witnesses for that document and for a specific index}.
Obtaining and composing document iterators
The simplest kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator} you can build is an {@link it.unimi.dsi.mg4j.index.IndexIterator}: it
is a document iterator that scans a specific index for a specific term. You don't actually build an index iterator directly, but
you rather obtain one by calling the {@link it.unimi.dsi.mg4j.index.Index#documents(CharSequence)} (or, equivalently,
{@link it.unimi.dsi.mg4j.index.IndexReader#documents(CharSequence)}) method, that returns the
set of documents containing a given term (and witnesses will be the single occurrences of such term).
Hence, for example, the following snippet opens a full-text index whose basename is mail-subject, and
prints out all documents containing the word "meeting", each with the sequence of positions where the word
appears (all intervals will be actually singletons). (Note that a document iterator over a single index
is itself iterable, and the {@link it.unimi.dsi.mg4j.search.DocumentIterator#iterator()} method is actually an alias
for {@link it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator()}).
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = subjectIndex.documents( "meeting" );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
A number of classes in this package can be used to compose iterators; more precisely, for each
query operator discussed above there is a corresponding class in this package. Each such class has
a factory method that allows one to build new document iterators by composing existing iterators.
For example, the following snippet shows how to search for mails containing the words "meeting",
"schedule" and "monday".
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = AndDocumentIterator.getInstance(
subjectIndex.documents( "meeting" ),
subjectIndex.documents( "schedule" ),
subjectIndex.documents( "monday" )
);
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
The following table shows the correspondence between query operators and composition classes:
Operator | Class |
AND & (conjunction) | {@link it.unimi.dsi.mg4j.search.AndDocumentIterator} |
OR | (disjunction) | {@link it.unimi.dsi.mg4j.search.OrDocumentIterator} |
NOT ! (negation) | {@link it.unimi.dsi.mg4j.search.NotDocumentIterator} |
+ (multiterm) | {@link it.unimi.dsi.mg4j.index.MultiTermIndexIterator} |
"..." (phrase) | {@link it.unimi.dsi.mg4j.search.ConsecutiveDocumentIterator} |
< (ordered conjunction) | {@link it.unimi.dsi.mg4j.search.OrderedAndDocumentIterator} |
~ (proximity) | {@link it.unimi.dsi.mg4j.search.LowPassDocumentIterator} |
- (difference) | {@link it.unimi.dsi.mg4j.search.DifferenceDocumentIterator} |
[ .. ] (range) | {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} |
Note, however, that {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} is actually a completely generic
predicate-based class that just returns documents whose payload satisfis a predicate.
Queries and document iterators
Even though it is perfectly legal to build document iterators by using these classes directly, this is not the
natural way to do that. One should rather build a syntactic object corresponding to a query, and then make it
into a document iterator that is, in some sense, the semantic counterpart of the query itself. To have more
information about how this works exaclty, please consult the overview of the {@link it.unimi.dsi.mg4j.query.nodes} package.
|