it.unimi.dsi.mg4j.search

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » mg4j » it.unimi.dsi.mg4j.search 
it.unimi.dsi.mg4j.search
MG4J: Managing Gigabytes for Java

Classes that compose {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator iterators over documents}. Such iterators are returned, for instance, by {@link it.unimi.dsi.mg4j.index.IndexReader#documents(int)}.

Minimal-interval semantics

MG4J provides minimal-interval semantics. That is, if the index is full-text, a {@link it.unimi.dsi.mg4j.search.DocumentIterator} will provide a list of documents and, for each document, a list of minimal intervals. This intervals denote ranges of positions in the document that satisfy the iterator: for instance, if you compose two documents iterators using an {@link it.unimi.dsi.mg4j.search.AndDocumentIterator}, you will get as a result the intersection of the document lists of the underlying iterators. Moreover, for each document you will get the minimal set of intervals that contain one interval both from the first iterators and from the second one.

This information is of course very useful if you're going to assign a score to the document, as smaller intervals mean a more precise match. At the basic level (e.g., iterators returned by an index), the intervals returned upon a document are intervals of length one containing the term that was used to generate the iterator. Intervals for compound iterators are built in a natural way, preserving minimality. More details can be found in Charles L. A. Clarke and Gordon V. Cormack, Shortest-Substring Retrieval and Ranking (ACM Transactions on Information Systems, vol. 18, no. 1, Jan 2000, pages 44−78). Scorers for documents may be found in the {@link it.unimi.dsi.mg4j.search.score} package.

The algorithms used by classes in this package to compute minimal-interval operators are new: details can be found here.

Note that MG4J provides minimal-interval semantics for a set of indices. This extension is a significant improvement over single-index semantics. However, defining the exact meaning of a query is a nontrivial problem that will be fully dealt with in a forthcoming paper.

Searching with minimal-interval semantics

The aim of this section is to provide a minimal insight of how minimal-interval semantics works, and explain the basic syntax used by the {@link it.unimi.dsi.mg4j.query.Query} command-line tool. In this section we shall try to discuss this issue only through examples; we shall later explain how you can actually perform searches of this kind using MG4J.

Note that you do not have to understand the details of minimal-interval semantics to fruitfully use MG4J. Several natural operators (ordered conjunction, proximity limitation, etc.) are computed by MG4J very efficiently using minimal-interval semantics just under the hood.

MG4J solves queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails: you might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source. The notion of multiple indices should not be new to the reader that is familiar with the {@link it.unimi.dsi.mg4j.document} package.

In our examples, we will assume that we have three indices (say, subject, from and body), and that subject is the used as default. Be warned that the actual syntax of queries in this section is immaterial (even though we shall stick to the syntax of {@link it.unimi.dsi.mg4j.query.parser.SimpleParser}).

Two different aspects should be taken into consideration when trying to determine which document actually match (i.e., satisfy) the query:

  • first, one can consider this in a purely Boolean (true/false) setting: thus a document may either satisfy the query or not; this is actually the only information you can get for indices that do not contain positions;
  • second, one can consider, for a document that matches the query in the above sense, which intervals (i.e., minimal sequences of consecutive words within the document) actually witness the match; this information will be available if the index contains positions.

In the following subsections, we shall give information about both kind of satisfiability.

Queries available on all indices

Simple queries

The simplest possible query consists in a single search term. The documents matching such a query are exactly those that contain the given term, with respect to the default index. In our example, the query

		 meeting
		 
will be matched by the documents (emails) that contain the term "meeting" in their subject. If you want, you can perform the query on another index (different from the default one); thus, for example, the query
		body: meeting
		 
will be matched by the documents that contain the term "meeting" in their body. In both cases, the intervals witnessing the match will be the single occurrences of the term "meeting" in the subject and in the body field, respectively.

Conjunctive queries

You can specify that more than one condition should be met in conjunction by using the AND operator. For example:

			meeting AND schedule
			
will be matched by those document whose subject contains both the term "meeting" and the term "schedule" (not necessarily in this order). The witnesses will be minimal intervals in the subjects that contain both terms. For example, if the subject was
			schedule the meeting (should we schedule this meeting or not)?
			
then the above query will have three witnesses: "schedule the meeting", "meeting (should we schedule" and "schedule this meeting".

The keyword AND can be subsituted with the symbol & or can even be omitted. So the above query is equivalent to:

			meeting & schedule
			
and to
			meeting schedule
			
Also in this case, you can select a different index for the query to be matched. For example:
			body: meeting schedule
			
(or, equivalently, body: meeting AND schedule or body: meeting & schedule) will be matched by documents that contain the term "meeting" in their body and the term "schedule" in their subject. In this case, witnesses come from different sources: a witness will be any single occurrence of the word "meeting" in the body (there should be at least one to make the document match the query) and any single occurrence of the word "schedule" in the subject (again, there should be at least one to make the document match the query). If you want both terms to be searched for in the body index, you can use:
			body: meeting body: schedule
			
or, simply,
			body: (meeting schedule)
			

Disjunctive queries

You can also introduce a disjunctive (OR) query, like in

			meeting OR schedule
			
that will be matched by the documents that contain the term "meeting" or the term "schedule" (or both) in their subject. A witness will then be every single occurrence of either word in the subject. The keyword OR can be substituted with |, hence the previous query is equivalent to
			meeting | schedule
			

Conjunctive and disjunctive operators can appear in the same query, with the rule that AND has higher priority than OR. So, for example:

			meeting AND schedule OR time
			
will be matched by documents whose subject contains both "meeting" and "schedule", and by documents whose subject contains "time". In this case, a witness will either be a (possibly long) interval containing both the words "meeting" and "schedule", or a one-word interval containing the word "time". If you want to change this behaviour, you should use parenthesis, like:
			meeting AND (schedule OR time)
			

Again, you can use index selectors, like in:

			body:meeting AND (schedule OR time)
			
that will be matched by documents containing "meeting" in their body (a witness being every single occurrence of the word in the body), and "schedule" or "time" in their subject (a witness being every single occurrence of either word in the subject). Similarly:
			body:(meeting AND (schedule OR time))
			
will match documents that contain "meeting" and either "schedule" or "time" (or both) in their body.

Negative (NOT) queries

You can specify that you want to exclude documents containing a certain term, or, more in general, satisfying a certain query, by using the (unary, prefix) operator NOT. For example:

				body:(meeting AND NOT tomorrow) AND subject:schedule
				
will be satistied by the emails that contain the term "schedule" in their subject, and the term "meeting" but not the term "tomorrow" in their body. The operator NOT can be substituted with !, like in:
				body:(meeting !tomorrow) subject:schedule
				

Negative queries are easily understood in a Boolean context, but may be more difficult as far as witnesses are concerned. Basically, the implementation of NOT works in such a way that NOT is actually used only for the Boolean match, but does not influence witnesses. In more detail, the only witness associated to a true NOT query is an empty interval.

Prefix and multiterm queries

A prefix query is a simple query that is matched by all terms starting from the same nonempty prefix. For example:

			   govern*
			   
is matched by all documents containing any word starting with "govern". For the prefix operator * to work, you have to endow your index with a {@link it.unimi.dsi.mg4j.index.PrefixMap}. What really happens in this case is that the query is essentially expanded into a disjunction that contains all the words in the dictionary that start with "govern".

To be true, the expansion of a prefix query does not really lead to an OR, but rather to what we call a multiterm query: a multiterm query is like an OR, but it can only contain terms as subquery, and behaves under many respects like a single term. It is unusual to specify manually a multiterm query—rather, some query expansion mechanism like prefixes should be used, but if you want to try manually, a multiterm query can be obtained using the + operator. For example:

			   house + houses + housing
			   
is a correct multiterm query, and it is loosely equivalent to
			   house OR houses OR housing
			   

Note, however, that trying to use + instead of OR does not work if the subqueries are not simple queries, or if they concern different indices. For example:

			   house + title:meeting
			   
would produce an error.

You may wonder why multiterm queries are needed, if they are essentially the same as OR queries. The first answer is efficiency: a multiterm query should be more efficient than an OR query.

The second answer is more subtle, and has to do with scorers. A scorer is a way to assign a score to a document that satisfies a query. Many scorers actually work by summing up suitable partial scores that depend on the document and on one of the terms in the query. Such partial scores are often function of the count (number of times the term appears in the document) and on the frequency (number of documents where the term appears), and they are often really high when the term has a low frequency. The idea behind this is that if I write:

			   computer OR methacrylic
			   
a document that satisfies the query because it contains "methacrylic" is more valuable than one that contains the word "computer", being the former much more infrequent.

Nonetheless, trying to use these scorers on automatically expanded queries may lead to many problems. For example, suppose you expanded

			   govern*
			   
as
			   government OR governance OR governor OR governing 
			   
(we are here assuming that the four terms above are the only ones that appear in the dictionary and start with "govern"). Now, since "governance" is presumably much rarer than "government", we expect all documents containing only "governance" to be given a high score. using
			   government + governance + governor + governing 
			   
the scorer acts on this bunch of words as a whole, and the frequency is assumed to be the maximum frequency (hence, it is the same for all words), avoiding the "governance"-prevalence problem.

Queries available on indices with positions

Ordered conjunctive queries

The operator of ordered conjunction < works like AND, but requires the subqueries to be satisfied in the exact order in which they are specified, even though not necessarily consecutively. For example:

			meeting < schedule
			
will only be matched by documents that contain in their subject at least one occurrence of the word "meeting" followed (maybe not immediately) by at least one occurrence of the word schedule. Again, for example, if the subject was
			schedule the meeting (should we schedule this meeting or not)?
			
then the above query will have only one witness: "meeting (should we schedule"; the other two minimal intervals that contain both words ("schedule the meeting" and "schedule this meeting") are not witnesses because words appear in the wrong order.

Note that the ordering between witnesses is strict: for instance, the query

			meeting < meeting
			
has as only witness "meeting (should we schedule meeting". The single word "meeting" alone is not a witness for the query.

In this case, it makes no sense (and it is indeed forbidden) to select a different index for the subqueries to be matched.

Consecutivity (phrasal queries)

You can specify that you want that some terms appear consecutively by using " (quotes). For example:

				"meeting schedule"
				
will be matched if the terms "meeting" and "schedule" appear in this order, and consecutively, in the subject. Inside quotes, you can also use subqueries, surrounding them with parenthesis, like in:
				"meeting (schedule OR time)"
				
that is matched by documents whose subject contains the term "meeting" followed by either "schedule" or "time". A witness will this time be necessarily an interval of exactly two words (the first being "meeting" and the second being either "schedule" or "time").

More precisely, the quotes operators is satisfied if there is a sequence of consecutive witnesses, with each witness coming from a different subquery, in the same order in which the queries appear.

Note that

				"meeting schedule OR time"
				
would be invalid: if you want to use operators within quotes, you should do so between parenthesis. Moreover, within quotes you cannot change index. So you can say
				body:"meeting schedule"
				
but you cannot use
				"body:meeting subject:schedule"
				

The symbol $ (dollar) can be used to specify an arbitrary word in a consecutive query. For instance,

			meeting $ schedule
			
will match "meeting our schedule" as well as "meeting my schedule". You can add dollars also at the start of a phrase, but not at the end (in the latter case, they will be ignored).

Proximity limit

As we have discussed, when a document matches a given query, there will be one or more witnesses within the document. Each such witness is a consecutive sequence of positions in the document that witness the matching. For example, consider the query

			  body:((meeting schedule) OR "John Smith") OR subject:alarm
			  

This query will be matched by documents that contain the term "alarm" in their subject, and by documents that contain either the terms "meeting" and "schedule" or the (exact) sentence "John Smith" in their body. For every document that matches the query, there will be two sets of matching intervals, one about the body and the other about the subject; at least one of these two sets will be nonempty (because of the OR keyword). Intervals concerning the subject will simply be intervals of length one that correspond to the positions where the term "alarm" appears in the subject. Intervals concerning the body will be either intervals of length two corresponding to the positions where the sentence "John Smith" appears in the body, or intervals of length two or more where both "meeting" and "schedule" appear.

You might want to accept only matching intervals up to a certain length; for example, suppose you don't want to take into considerations intervals that contain "meeting" and "schedule" too far apart, say at a distance greater than 10 words. You can do this by using the proximity limit operator ~. Just rewrite the previous query as

			  body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
			  

This way, you are simply discarding the matching intervals that contain the terms "meeting" and "schedule" if their length (number of words) is greater than 10 (i.e., if "meeting" and "schedule" are separated by more than 8 words).

The proximity limit operator can be used at any point, and limits the length of all matching intervals of the query it is applied to. Note, however, that it may only be used on full-text indices.

Difference

The Brouwerian difference operator is specified using - (minus). It is a rather esoteric operator that is rarely met by the end user, and that, given two subqueries, kills the witnesses of the first query (the minuend) that contain one or more witnesses of the second query (the subtrahend). By definition, for documents that satisfy the minuend, but not the subtrahend, the witnesses are unchanged. For instance, the following query

				schedule < meeting - this
				
will be matched only if the term "schedule" and the term "meeting" appear in this order without the term "this" inbetween. If the subject is
				schedule the meeting (should we schedule this meeting or not)?
				
the only valid witness is "schedule this meeting", and indeed, the following query
				schedule < meeting - (this | the)
				
will not match at all the subject above, as all witnesses of the minuend are killed by witnesses of the subtrahend.

As an additional feature, you can specify a left and a right margin that will be used to enlarge the intervals of the subtrahend. For instance,

				"schedule < meeting - [[1,2]] this"
				

will kill intervals of the minuend that contain the whole fragment "schedule this meeting or" (so no interval will be killed at all).

Queries available on payload-based indices

Actually, the atomic queries discussed above (term, prefix, etc.) can be used with standard indices, that is, indices of fields containing text. For payload-based indices, which represent document metadata such as dates, the standard query available in MG4J is a range query in which the first and last valid values are specified by the user. The resulting query is satisfying by all documents whose field is in the range. Both the first and the last value can be omitted. for instance, the following query

				date:[ 20/2/2007 .. 23/2/2007 ]
				
will search for documents between 20 February and 23 February 2007, inclusive, whereas the query
				date:[ .. 23/2/2007 ]
				
will search for documents up to 23 February 2007. Note that in the built-in parser spaces are necessary. They make it possible to separate the different tokens composing the query.

Range queries must not be used as a generic query mechanism, but rather to refine the result of a query over document content: a ranked query composed uniquely by a range query will have to scan the whole payload-based index just to return a few results.

Building and composing document iterators

The {@link it.unimi.dsi.mg4j.search} package contains all the classes needed to build a query and to match it against a certain collection of indices. This is actually only the semantic counterpart to a query; for the syntactic aspects, please refer to the {@link it.unimi.dsi.mg4j.query.nodes} package.

Basic classes

An {@link it.unimi.dsi.mg4j.search.Interval} represents a consecutive set of natural numbers, that is, a witness within a document (in this case, numbers represent the positions within a document: 0 is the position of the first word, 1 is the position of the second and so on). An {@link it.unimi.dsi.mg4j.search.IntervalIterator} is an iterator that returns intervals: typically, an interval iterator will return all intervals witnessing a certain query for a certain document (and a certain index).

For example, the query

			  body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
			  
will give rise to an interval iterator for the body and an interval iterator for the subject: the former will return intervals within the body witnessing the first part of the query, and the latter will return the intervals the intervals witnessing the second part of the query. Note that even upon a matching document either iterator may actually return no interval (because the overall query is disjunctive); nonetheless, the two iterators cannot be both empty.

It is always understood that intervals are returned in increasing order (of their left, or equivalently right, extreme).

A {@link it.unimi.dsi.mg4j.search.DocumentIterator} is used to scan a whole collection of indices for a query. At every given moment, the iterator will be able to return the next document matching the query, and, for full-text indices, you will also be able to {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator(it.unimi.dsi.mg4j.index.Index) get the interval iterators of the witnesses for that document and for a specific index}.

Obtaining and composing document iterators

The simplest kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator} you can build is an {@link it.unimi.dsi.mg4j.index.IndexIterator}: it is a document iterator that scans a specific index for a specific term. You don't actually build an index iterator directly, but you rather obtain one by calling the {@link it.unimi.dsi.mg4j.index.Index#documents(CharSequence)} (or, equivalently, {@link it.unimi.dsi.mg4j.index.IndexReader#documents(CharSequence)}) method, that returns the set of documents containing a given term (and witnesses will be the single occurrences of such term).

Hence, for example, the following snippet opens a full-text index whose basename is mail-subject, and prints out all documents containing the word "meeting", each with the sequence of positions where the word appears (all intervals will be actually singletons). (Note that a document iterator over a single index is itself iterable, and the {@link it.unimi.dsi.mg4j.search.DocumentIterator#iterator()} method is actually an alias for {@link it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator()}).

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = subjectIndex.documents( "meeting" );
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

A number of classes in this package can be used to compose iterators; more precisely, for each query operator discussed above there is a corresponding class in this package. Each such class has a factory method that allows one to build new document iterators by composing existing iterators.

For example, the following snippet shows how to search for mails containing the words "meeting", "schedule" and "monday".

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = AndDocumentIterator.getInstance( 
				subjectIndex.documents( "meeting" ), 
				subjectIndex.documents( "schedule" ), 
				subjectIndex.documents( "monday" ) 
			);
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

The following table shows the correspondence between query operators and composition classes:

OperatorClass
AND & (conjunction){@link it.unimi.dsi.mg4j.search.AndDocumentIterator}
OR | (disjunction){@link it.unimi.dsi.mg4j.search.OrDocumentIterator}
NOT ! (negation){@link it.unimi.dsi.mg4j.search.NotDocumentIterator}
+ (multiterm){@link it.unimi.dsi.mg4j.index.MultiTermIndexIterator}
"..." (phrase){@link it.unimi.dsi.mg4j.search.ConsecutiveDocumentIterator}
< (ordered conjunction){@link it.unimi.dsi.mg4j.search.OrderedAndDocumentIterator}
~ (proximity){@link it.unimi.dsi.mg4j.search.LowPassDocumentIterator}
- (difference){@link it.unimi.dsi.mg4j.search.DifferenceDocumentIterator}
[ .. ] (range){@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator}

Note, however, that {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} is actually a completely generic predicate-based class that just returns documents whose payload satisfis a predicate.

Queries and document iterators

Even though it is perfectly legal to build document iterators by using these classes directly, this is not the natural way to do that. One should rather build a syntactic object corresponding to a query, and then make it into a document iterator that is, in some sense, the semantic counterpart of the query itself. To have more information about how this works exaclty, please consult the overview of the {@link it.unimi.dsi.mg4j.query.nodes} package.

Java Source File NameTypeComment
AbstractCompositeDocumentIterator.javaClass An abstract iterator on documents, based on a list of component iterators.

The caches into AbstractCompositeDocumentIterator.documentIterator the component iterators, and sets up a number of protected fields that can be useful to implementors.

AbstractDocumentIterator.javaClass An abstract iterator on documents that implements IntIterator.hasNext hasNext() and IntIterator.nextInt nextInt() using DocumentIterator.nextDocument nextDocument() .

As explained elsewhere, since MG4J 1.2 the iteration logic has been made fully lazy, and the standard IntIterator methods are available as a commodity, but their use in performance-sensitive environments is strongly discouraged.

AbstractIntersectionDocumentIterator.javaClass An abstract iterator on documents, generating the intersection of the documents returned by a number of document iterators.

To be usable, this class must be subclassed so to provide also an iterator on intervals. Such iterators must be instantiable using AbstractIntersectionDocumentIterator.getComposedIntervalIterator(Index) . The latter is an example of a non-static factory method, that is, a factory method which depends on the enclosing instance.

AbstractOrderedIntervalDocumentIterator.javaClass An abstract document iterator helping in the implementation of it.unimi.dsi.mg4j.search.ConsecutiveDocumentIterator and it.unimi.dsi.mg4j.search.OrderedAndDocumentIterator .
AbstractUnionDocumentIterator.javaClass A document iterator on documents, generating the union of the documents returned by a number of document iterators.
AndDocumentIterator.javaClass A document iterator that returns the AND of a number of document iterators.

This class adds to it.unimi.dsi.mg4j.search.AbstractIntersectionDocumentIterator an interval interator generating the AND of the intervals returned for each of the documents involved.

CachingDocumentIterator.javaClass A decorator that caches the intervals produced by the underlying document iterator.

Often, scores exhaust the intervals produced by a document iterator to compute their result.

ConsecutiveDocumentIterator.javaClass An iterator returning documents containing consecutive intervals (in query order) satisfying the underlying queries.

As an additional service, this class makes it possible to specify gaps between intervals.

DifferenceDocumentIterator.javaClass A document iterator that computes the Brouwerian difference between two given document iterators.

In the lattice of interval antichains, the Brouwerian difference is obtained by deleting from the first operand all intervals that contain some interval of the second operand.

DocumentIterator.javaInterface An iterator over documents (pointers) and their intervals.

Warning: the semantics of DocumentIterator.nextDocument() has changed significantly in MG4J 1.2.

Warning: from MG4J 1.2, most methods throw an IOException (such exceptions used to be catched and wrapped into a RuntimeException ).

Warning: the semantics of DocumentIterator.skipTo(int) has changed significantly in MG4J 1.1.

Each call to DocumentIterator.nextDocument() will return a document pointer, or -1 if no more documents are available.

DocumentIteratorBuilderVisitor.javaClass A it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor that builds a it.unimi.dsi.mg4j.search.DocumentIterator resolving the queries using the objects in it.unimi.dsi.mg4j.search .

This elementary builder visitor invokes it.unimi.dsi.mg4j.index.Index.documents(CharSequence) to build the leaf .

DocumentIterators.javaClass A class providing static methods and objects that do useful things with document iterators.
Interval.javaClass An integral interval.
IntervalIterator.javaInterface An iterator over . Apart for the usual methods of a (type-specific) iterator, it has a special (optional) IntervalIterator.reset() method that allows one to reset the iterator: the exact meaning of this operation is decided by the implementing classes.
IntervalIterators.javaClass A class providing static methods and objects that do useful things with interval iterators.
Intervals.javaClass A class providing static methods and objects that do useful things with intervals.
LowPassDocumentIterator.javaClass A document iterator that filters another document iterator, returning just intervals (and containing documents) whose length does not exceed a given threshold.
NotDocumentIterator.javaClass A document iterator that returns documents not returned by its underlying iterator, and returns just it.unimi.dsi.mg4j.search.IntervalIterators.TRUE on all interval iterators.
OrderedAndDocumentIterator.javaClass An iterator returning documents containing nonoverlapping intervals in query order satisfying the underlying queries.
OrDocumentIterator.javaClass An iterator on documents that returns the OR of a number of document iterators.

This class adds to it.unimi.dsi.mg4j.search.AbstractUnionDocumentIterator an interval iterator generating the OR of the intervals returned for each of the documents involved.

PayloadPredicateDocumentIterator.javaClass A document iterator that filters an IndexIterator , returning just documents whose payload satisfies a given predicate. The interval iterators are computed by delegation to the underlying IndexIterator .

Besides the classic PayloadPredicateDocumentIterator.skipTo(int) method, this class provides a PayloadPredicateDocumentIterator.skipUnconditionallyTo(int) method that skips to a given document even if the document does not match the predicate.

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.