MG4J: Managing Gigabytes for Java
Composite representation for queries
This package contains the classes that represent queries as
an abstract syntax tree, or, in design-pattern jargon, as a composite.
Warning: Before reading this overview, the reader is encouraged to consult the first part
of the {@link it.unimi.dsi.mg4j.search} package
documentation, where the notion of query is briefly presented.
The basic idea is to perform a complete decoupling of the query construction and
resolution into three levels:
- String level: the query is a string, written in a suitable language where, for example, every operator has a
certain syntax, uses a certain keyword etc.
- Tree level: the query is an abstract tree, that contains one internal node for each operator, but that has
nothing to do with the actual way the query was written; for example, one may want to give the user the possibility
to write the query by using a visual editor, in which case there will be some way to obtain the tree representation
of the query without having any string-level representation for it.
- Iterator level: the document iterator that lazily solves the query over a certain index or set of indices;
this is the semantic counterpart of a query, and it is discussed at large in the overview of the {@link it.unimi.dsi.mg4j.search}
package.
This package overview shows how to pass from the string to the tree level using the built-in
{@link it.unimi.dsi.mg4j.query.parser.QueryParser}, and explains what is needed to implement the tree level, as well as the basic instruments
needed to convert the tree level into the iterator level.
In MG4J, a query parser is an implementation of {@link it.unimi.dsi.mg4j.query.parser.QueryParser}:
essentially, an object with a method that transforms strings in {@linkplain it.unimi.dsi.mg4j.query.nodes.Query queries}.
You can use any parser you like, or build your queries programmatically.
The parser provided with MG4J and whose syntax is
described in the manual is currently generated using JavaCC: unfortunately, this tool
produces some public classes that somehow clutter the package
documentation. All the parsing logic is contained in the
{@link it.unimi.dsi.mg4j.query.parser.SimpleParser} class, which is generated by the
JavaCC source SimpleParser.jj.
Building a query tree out of a query string
MG4J allows one to perform queries on multiple indices; by saying so, we mean that
you may have many indices concerning the same document collection, and you want to perform
a query that may search on some of them. Think, for example, of a collection of emails.
You might have generated a number of indices, to index their subjects, sender, recipient(s),
body etc. All these indices can be thought of as individual entities, their only relation being
that they index collections with the same number of documents, and that same-numbered documents
conceptually "come" from the same source.
To get a query tree from a string representation, you must first build a {@link it.unimi.dsi.mg4j.query.parser.QueryParser} object.
For instance, if you use the built-in {@link it.unimi.dsi.mg4j.query.parser.SimpleParser} you have
to provide a map whose keys are index names (called index aliases) and
with each index alias mapped to the corresponding {@link it.unimi.dsi.mg4j.index.Index}.
Moreover, one of the aliases is taken to be the default index alias used for the queries.
In our example, we will assume that we have three index aliases (say, subject,
from and body), and that subject is the
default index alias.
The parser would then be created as follows (here, we are stipulating that index with alias
x has basename mail-x):
Object2ReferenceMap<String,Index> indexAlias2Index = new Object2ReferenceOpenHashMap<String,Index>();
Index defaultIndex;
indexAlias2Index.put( "subject", defaultIndex = Index.getInstance( "mail-subject" ) );
indexAlias2Index.put( "from", Index.getInstance( "mail-from" ) );
indexAlias2Index.put( "body", Index.getInstance( "mail-body" ) );
QueryParser parser = new QueryParser( indexAlias2Index.keySet(), "subject" );
after which you can use it for example as follows:
Query query = parser.parse( "meeting AND body:(schedule OR urgent)" );
DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( indexAlias2Index, defaultIndex, 0 );
DocumentIterator it = query.accept( visitor );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
Structure of a {@link it.unimi.dsi.mg4j.query.nodes.Query}
As explained above, in this package a {@link it.unimi.dsi.mg4j.query.nodes.Query} is simply an abstract tree: its leaves will correspond to ground queries, whereas
internal nodes correspond to query operators.
Ground queries can be {@linkplain it.unimi.dsi.mg4j.query.nodes.Term term queries}, {@linkplain it.unimi.dsi.mg4j.query.nodes.Prefix prefix queries}, and
{@linkplain it.unimi.dsi.mg4j.query.nodes.MultiTerm multiterm queries} (the latter usually generated by some query-expansion mechanism).
Any other query is either a {@linkplain it.unimi.dsi.mg4j.query.nodes.Composite composite query} (e.g.,
a {@linkplain it.unimi.dsi.mg4j.query.nodes.And conjunctive query}, a {@linkplain it.unimi.dsi.mg4j.query.nodes.Or disjunctive query} etc.),
that is, a query composed by other subqueries, or it is obtained by some other operator (e.g.,
a {@linkplain it.unimi.dsi.mg4j.query.nodes.LowPass low-pass query} etc.).
Every query has a method {@link it.unimi.dsi.mg4j.query.nodes.Query#accept(it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor)}
a {@link it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor} object, through which you can visit the query tree.
More precisely, when the accept method is invoked on a query, a recursive visit of the query tree is performed: at every
node n of type Q (Q being any class implementing {@link it.unimi.dsi.mg4j.query.nodes.Query}),
the following steps are performed:
- the visitor's
visitPre(Q n) method is called; if the method returns false, the visit is
interrupted; otherwise…
- the subtrees are recursively visited, obtaining an array of results
T[] x ; if at a certain
point during this visit some call decides that the visit should be interrupted, we stop; otherwise…
- the visitor's
visitPost(Q n, T[] x) method is called, and the result (of type T )
is returned.
To make the idea of a visitor easier to understand, consider the following simple example of a visitor:
static class PrinterVisitor extends AbstractQueryBuilderVisitor<String> {
private static void appendArray( MutableString res, String t[], char sep ) {
for ( int i = 0; i < t.length - 1; i++ ) res.append( t[ i ] + sep );
res.append( t[ t.length - 1 ] );
}
public String[] newArray( int len ) { return new String[ len ]; }
public String visitPost( OrderedAnd node, String[] t ) {
MutableString res = new MutableString();
res.append( "OAND(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( Or node, String[] t ) {
MutableString res = new MutableString();
res.append( "OR(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( And node, String[] t ) {
MutableString res = new MutableString();
res.append( "AND(" );
appendArray( res, t, ',' );
res.append( ")" );
return res.toString();
}
public String visitPost( Consecutive node, String[] t ) {
MutableString res = new MutableString();
res.append( "\"" );
appendArray( res, t, ' ' );
res.append( "\"" );
return res.toString();
}
public String visitPost( Not node, String t ) {
return "NOT(" + t + ")";
}
public String visitPost( LowPass node, String t ) {
return t + "~" + node.k;
}
public String visitPost( Select node, String t ) {
return node.index + ":" + t;
}
public String visit( Term node ) { return node.term.toString(); }
public String visit( Prefix node ) { return node.prefix + "*"; }
}
Now, you can pass an instance of this visitor to a query, and as a result get a string (linear) representation
of the query itself. (Of course, there is a simple way to get the same result, that is, calling the toString()
method directly, but for this example is easy enough for illustrative purposes.)
Here is an example of how the above visitor can be used:
public static void main( String arg[] ) throws Exception {
Query qa = new Term( "a" );
Query qb = new Prefix( "b" );
Query qc = new Term( "c" );
Query qd = new Term( "d" );
Query q = new LowPass( new And( qa, new Select( "another_index", qb ), new Or( qc, qd ) ), 20 );
System.out.println( q.accept( new PrinterVisitor() ) );
}
that produces AND(a,another_index:b*,OR(c,d))~20.
Building a document iterator out of a query
The interface {@link it.unimi.dsi.mg4j.query.nodes.Query} and its implementations provide the basic classes
for the tree-level representation of a query. Building an iterator out of it can be thought of as a process of recursive
instantiation of a query tree into an iterator tree; this is actually a sort of a copy visit, and it is indeed implemented
as a special kind of visitor: the {@link it.unimi.dsi.mg4j.search.DocumentIteratorBuilderVisitor}.
You can create one such visitor and use it to visit a query tree, obtaining as a result a document iterator
that you can use to get the results.
To be more precise, let us recall the example presented in the last part of the {@linkplain it.unimi.dsi.mg4j.search overview of the search package}:
Index subjectIndex = Index.getInstance( "mail-subject" );
DocumentIterator it = AndDocumentIterator.getInstance(
subjectIndex.documents( "meeting" ),
subjectIndex.documents( "schedule" ),
subjectIndex.documents( "monday" )
);
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
Here is an equivalent piece of code that uses the idea of decoupling the tree level from the iterator level.
Index subjectIndex = Index.getInstance( "mail-subject" );
Query q = new And( new Term( "meeting" ), new Term( "schedule" ), new Term( "monday" ) );
DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( null, subjectIndex, 0 );
DocumentIterator it = q.accept( visitor );
while ( it.hasNext() ) {
System.out.println( "Document #: " + it.nextDocument() );
System.out.print( "\tPositions:" );
for ( Interval interval: it )
System.out.print( " " + interval );
System.out.println();
}
|