| |
Package Name | Comment | com.sleepycat.db | | lucli |
Lucene Command Line Interface
| net.sf.snowball |
Snowball system classes.
| net.sf.snowball.ext |
Snowball generated stemmer classes.
| org.apache.lucene | Top-level package.
| org.apache.lucene.analysis |
API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.
Parsing? Tokenization? Analysis!
Lucene, indexing and search library, accepts only plain text input.
Parsing
Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few.
Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the
application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene.
Tokenization
Plain text passed to Lucene for indexing goes through a process generally called tokenization – namely breaking of the
input text into small indexing elements –
{@link org.apache.lucene.analysis.Token Tokens}.
The way input text is broken into tokens very
much dictates further capabilities of search upon that text.
For instance, sentences beginnings and endings can be identified to provide for more accurate phrase
and proximity searches (though sentence identification is not provided by Lucene).
In some cases simply breaking the input text into tokens is not enough – a deeper Analysis is needed,
providing for several functions, including (but not limited to):
- Stemming –
Replacing of words by their stems.
For instance with English stemming "bikes" is replaced by "bike";
now query "bike" can find both documents containing "bike" and those containing "bikes".
- Stop Words Filtering –
Common words like "the", "and" and "a" rarely add any value to a search.
Removing them shrinks the index size and increases performance.
It may also reduce some "noise" and actually improve search quality.
- Text Normalization –
Stripping accents and other character markings can make for better searching.
- Synonym Expansion –
Adding in synonyms at the same token position as the current word can mean better
matching when users search with words in the synonym set.
Core Analysis
The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There
are three main classes in the package from which all analysis processes are derived. These are:
- {@link org.apache.lucene.analysis.Analyzer} – An Analyzer is responsible for building a {@link org.apache.lucene.analysis.TokenStream} which can be consumed
by the indexing and searching processes. See below for more information on implementing your own Analyzer.
- {@link org.apache.lucene.analysis.Tokenizer} – A Tokenizer is a {@link org.apache.lucene.analysis.TokenStream} and is responsible for breaking
up incoming text into {@link org.apache.lucene.analysis.Token}s. In most cases, an Analyzer will use a Tokenizer as the first step in
the analysis process.
- {@link org.apache.lucene.analysis.TokenFilter} – A TokenFilter is also a {@link org.apache.lucene.analysis.TokenStream} and is responsible
for modifying {@link org.apache.lucene.analysis.Token}s that have been created by the Tokenizer. Common modifications performed by a
TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters
Hints, Tips and Traps
The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer}
is sometimes confusing. To ease on this confusion, some clarifications:
- The {@link org.apache.lucene.analysis.Analyzer} is responsible for the entire task of
creating tokens out of the input text, while the {@link org.apache.lucene.analysis.Tokenizer}
is only responsible for breaking the input text into tokens. Very likely, tokens created
by the {@link org.apache.lucene.analysis.Tokenizer} would be modified or even omitted
by the {@link org.apache.lucene.analysis.Analyzer} (via one or more
{@link org.apache.lucene.analysis.TokenFilter}s) before being returned.
- {@link org.apache.lucene.analysis.Tokenizer} is a {@link org.apache.lucene.analysis.TokenStream},
but {@link org.apache.lucene.analysis.Analyzer} is not.
- {@link org.apache.lucene.analysis.Analyzer} is "field aware", but
{@link org.apache.lucene.analysis.Tokenizer} is not.
Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link
org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more
than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning:
- {@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} – Most Analyzers perform the same operation on all
{@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different
{@link org.apache.lucene.document.Field}s.
- The contrib/analyzers library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety
of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
- The {@link org.apache.lucene.analysis.snowball contrib/snowball library}
located at the root of the Lucene distribution has Analyzer and TokenFilter
implementations for a variety of Snowball stemmers.
See http://snowball.tartarus.org
for more information on Snowball stemmers.
- There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.
Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases).
Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a
{@link org.apache.lucene.analysis.StopFilter}. The contrib/benchmark library can be useful for testing out the speed of the analysis process.
Invoking the Analyzer
Applications usually do not invoke analysis – Lucene does it for them:
- At indexing, as a consequence of
{@link org.apache.lucene.index.IndexWriter#addDocument(org.apache.lucene.document.Document) addDocument(doc)},
the Analyzer in effect for indexing is invoked for each indexed field of the added document.
- At search, as a consequence of
{@link org.apache.lucene.queryParser.QueryParser#parse(java.lang.String) QueryParser.parse(queryText)},
the QueryParser may invoke the Analyzer in effect.
Note that for some queries analysis does not take place, e.g. wildcard queries.
However an application might invoke Analysis of any text for testing or for any other purpose, something like:
Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here"));
Token t = ts.next();
while (t!=null) {
System.out.println("token: "+t));
t = ts.next();
}
Indexing Analysis vs. Search Analysis
Selecting the "correct" analyzer is crucial
for search quality, and can also affect indexing and search performance.
The "correct" analyzer differs between applications.
Lucene java's wiki page
AnalysisParalysis
provides some data on "analyzing your analyzer".
Here are some rules of thumb:
- Test test test... (did we say test?)
- Beware of over analysis – might hurt indexing performance.
- Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to...
- In some cases a different analyzer is required for indexing and search, for instance:
- Certain searches require more stop words to be filtered. (I.e. more than those that were filtered at indexing.)
- Query expansion by synonyms, acronyms, auto spell correction, etc.
This might sometimes require a modified analyzer – see the next section on how to do that.
Implementing your own Analyzer
Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer
or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile
to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists.
If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at
the source code of any one of the many samples located in this package.
The following sections discuss some aspects of implementing your own analyzer.
Field Section Boundaries
When {@link org.apache.lucene.document.Document#add(org.apache.lucene.document.Fieldable) document.add(field)}
is called multiple times for the same field name, we could say that each such call creates a new
section for that field in that document.
In fact, a separate call to
{@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)}
would take place for each of these so called "sections".
However, the default Analyzer behavior is to treat all these sections as one large section.
This allows phrase search and proximity search to seamlessly cross
boundaries between these "sections".
In other words, if a certain field "f" is added like this:
document.add(new Field("f","first ends",...);
document.add(new Field("f","starts two",...);
indexWriter.addDocument(document);
Then, a phrase search for "ends starts" would find that document.
Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections",
simply by overriding
{@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}:
Analyzer myAnalyzer = new StandardAnalyzer() {
public int getPositionIncrementGap(String fieldName) {
return 10;
}
};
Token Position Increments
By default, all tokens created by Analyzers and Tokenizers have a
{@link org.apache.lucene.analysis.Token#getPositionIncrement() position increment} of one.
This means that the position stored for that token in the index would be one more than
that of the previous token.
Recall that phrase and proximity searches rely on position info.
If the selected analyzer filters the stop words "is" and "the", then for a document
containing the string "blue is the sky", only the tokens "blue", "sky" are indexed,
with position("sky") = 1 + position("blue"). Now, a phrase query "blue is the sky"
would find that document, because the same analyzer filters the same stop words from
that query. But also the phrase query "blue sky" would find that document.
If this behavior does not fit the application needs,
a modified analyzer can be used, that would increment further the positions of
tokens following a removed stop word, using
{@link org.apache.lucene.analysis.Token#setPositionIncrement(int)}.
This can be done with something like:
public TokenStream tokenStream(final String fieldName, Reader reader) {
final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
TokenStream res = new TokenStream() {
public Token next() throws IOException {
int extraIncrement = 0;
while (true) {
Token t = ts.next();
if (t!=null) {
if (stopWords.contains(t.termText())) {
extraIncrement++; // filter this word
continue;
}
if (extraIncrement>0) {
t.setPositionIncrement(t.getPositionIncrement()+extraIncrement);
}
}
return t;
}
}
};
return res;
}
Now, with this modified analyzer, the phrase query "blue sky" would find that document.
But note that this is yet not a perfect solution, because any phrase query "blue w1 w2 sky"
where both w1 and w2 are stop words would match that document.
Few more use cases for modifying position increments are:
- Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that
identifies a new sentence can add 1 to the position increment of the first token of the new sentence.
- Injecting synonyms – here, synonyms of a token should be added after that token,
and their position increment should be set to 0.
As result, all synonyms of a token would be considered to appear in exactly the
same position as that token, and so would they be seen by phrase and proximity searches.
| org.apache.lucene.analysis.br |
Analyzer for Brazilian.
| org.apache.lucene.analysis.cjk |
Analyzer for Chinese, Japanese and Korean.
| org.apache.lucene.analysis.cn |
Analyzer for Chinese.
| org.apache.lucene.analysis.cz |
Analyzer for Czech.
| org.apache.lucene.analysis.de |
Analyzer for German.
| org.apache.lucene.analysis.el |
Analyzer for Greek.
| org.apache.lucene.analysis.fr |
Analyzer for French.
| org.apache.lucene.analysis.ngram | | org.apache.lucene.analysis.nl |
Analyzer for Dutch.
| org.apache.lucene.analysis.payloads |
org.apache.lucene.analysis.payloads
Provides various convenience classes for creating payloads on Tokens.
| org.apache.lucene.analysis.ru |
Analyzer for Russian.
| org.apache.lucene.analysis.sinks |
org.apache.lucene.analysis.sinks
Implementations of the SinkTokenizer that might be useful.
| org.apache.lucene.analysis.snowball |
{@link org.apache.lucene.analysis.TokenFilter} and {@link
org.apache.lucene.analysis.Analyzer} implementations that use Snowball
stemmers.
| org.apache.lucene.analysis.standard |
A fast grammar-based tokenizer constructed with JFlex.
| org.apache.lucene.analysis.th | | org.apache.lucene.ant |
Ant task to create Lucene indexes.
| org.apache.lucene.benchmark | | org.apache.lucene.benchmark.byTask |
Benchmarking Lucene By Tasks
Benchmarking Lucene By Tasks.
This package provides "task based" performance benchmarking of Lucene.
One can use the predefined benchmarks, or create new ones.
Contained packages:
Package |
Description |
stats |
Statistics maintained when running benchmark tasks. |
tasks |
Benchmark tasks. |
feeds |
Sources for benchmark inputs: documents and queries. |
utils |
Utilities used for the benchmark, and for the reports. |
programmatic |
Sample performance test written programatically. |
Table Of Contents
- Benchmarking By Tasks
- How to use
- Benchmark "algorithm"
- Supported tasks/commands
- Benchmark properties
- Example input algorithm and the result benchmark
report.
- Results record counting clarified
Benchmarking By Tasks
Benchmark Lucene using task primitives.
A benchmark is composed of some predefined tasks, allowing for creating an
index, adding documents,
optimizing, searching, generating reports, and more. A benchmark run takes an
"algorithm" file
that contains a description of the sequence of tasks making up the run, and some
properties defining a few
additional characteristics of the benchmark run.
How to use
Easiest way to run a benchmarks is using the predefined ant task:
- ant run-task
- would run the micro-standard.alg "algorithm".
- ant run-task -Dtask.alg=conf/compound-penalty.alg
- would run the compound-penalty.alg "algorithm".
- ant run-task -Dtask.alg=[full-path-to-your-alg-file]
- would run your perf test "algorithm".
- java org.apache.lucene.benchmark.byTask.programmatic.Sample
- would run a performance test programmatically - without using an alg
file. This is less readable, and less convinient, but possible.
You may find existing tasks sufficient for defining the benchmark you
need, otherwise, you can extend the framework to meet your needs, as explained
herein.
Each benchmark run has a DocMaker and a QueryMaker. These two should usually
match, so that "meaningful" queries are used for a certain collection.
Properties set at the header of the alg file define which "makers" should be
used. You can also specify your own makers, implementing the DocMaker and
QureyMaker interfaces.
Benchmark .alg file contains the benchmark "algorithm". The syntax is described
below. Within the algorithm, you can specify groups of commands, assign them
names, specify commands that should be repeated,
do commands in serial or in parallel,
and also control the speed of "firing" the commands.
This allows, for instance, to specify
that an index should be opened for update,
documents should be added to it one by one but not faster than 20 docs a minute,
and, in parallel with this,
some N queries should be searched against that index,
again, no more than 2 queries a second.
You can have the searches all share an index reader,
or have them each open its own reader and close it afterwords.
If the commands available for use in the algorithm do not meet your needs,
you can add commands by adding a new task under
org.apache.lucene.benchmark.byTask.tasks -
you should extend the PerfTask abstract class.
Make sure that your new task class name is suffixed by Task.
Assume you added the class "WonderfulTask" - doing so also enables the
command "Wonderful" to be used in the algorithm.
External classes: It is sometimes useful to invoke the benchmark
package with your external alg file that configures the use of your own
doc/query maker and or html parser. You can work this out without
modifying the benchmark package code, by passing your class path
with the benchmark.ext.classpath property:
- ant run-task -Dtask.alg=[full-path-to-your-alg-file]
-Dbenchmark.ext.classpath=/mydir/classes
-Dtask.mem=512M
Benchmark "algorithm"
The following is an informal description of the supported syntax.
-
Measuring: When a command is executed, statistics for the elapsed
execution time and memory consumption are collected.
At any time, those statistics can be printed, using one of the
available ReportTasks.
-
Comments start with '#'.
-
Serial sequences are enclosed within '{ }'.
-
Parallel sequences are enclosed within
'[ ]'
-
Sequence naming: To name a sequence, put
'"name"' just after
'{' or '['.
Example - { "ManyAdds" AddDoc } : 1000000 -
would
name the sequence of 1M add docs "ManyAdds", and this name would later appear
in statistic reports.
If you don't specify a name for a sequence, it is given one: you can see it as
the algorithm is printed just before benchmark execution starts.
-
Repeating:
To repeat sequence tasks N times, add ': N' just
after the
sequence closing tag - '}' or
']' or '>'.
Example - [ AddDoc ] : 4 - would do 4 addDoc
in parallel, spawning 4 threads at once.
Example - [ AddDoc AddDoc ] : 4 - would do
8 addDoc in parallel, spawning 8 threads at once.
Example - { AddDoc } : 30 - would do addDoc
30 times in a row.
Example - { AddDoc AddDoc } : 30 - would do
addDoc 60 times in a row.
Exhaustive repeating: use * instead of
a number to repeat exhaustively.
This is sometimes useful, for adding as many files as a doc maker can create,
without iterating over the same file again, especially when the exact
number of documents is not known in advance. For insance, TREC files extracted
from a zip file. Note: when using this, you must also set
doc.maker.forever to false.
Example - { AddDoc } : * - would add docs
until the doc maker is "exhausted".
-
Command parameter: a command can optionally take a single parameter.
If the certain command does not support a parameter, or if the parameter is of
the wrong type,
reading the algorithm will fail with an exception and the test would not start.
Currently the following tasks take optional parameters:
- AddDoc takes a numeric parameter, indicating the required size of
added document. Note: if the DocMaker implementation used in the test
does not support makeDoc(size), an exception would be thrown and the test
would fail.
- DeleteDoc takes numeric parameter, indicating the docid to be
deleted. The latter is not very useful for loops, since the docid is
fixed, so for deletion in loops it is better to use the
doc.delete.step property.
- SetProp takes a
name,value mandatory param,
',' used as a separator.
- SearchTravRetTask and SearchTravTask take a numeric
parameter, indicating the required traversal size.
- SearchTravRetLoadFieldSelectorTask takes a string
parameter: a comma separated list of Fields to load.
Example - AddDoc(2000) - would add a document
of size 2000 (~bytes).
See conf/task-sample.alg for how this can be used, for instance, to check
which is faster, adding
many smaller documents, or few larger documents.
Next candidates for supporting a parameter may be the Search tasks,
for controlling the qurey size.
-
Statistic recording elimination: - a sequence can also end with
'>',
in which case child tasks would not store their statistics.
This can be useful to avoid exploding stats data, for adding say 1M docs.
Example - { "ManyAdds" AddDoc > : 1000000 -
would add million docs, measure that total, but not save stats for each addDoc.
Notice that the granularity of System.currentTimeMillis() (which is used
here) is system dependant,
and in some systems an operation that takes 5 ms to complete may show 0 ms
latency time in performance measurements.
Therefore it is sometimes more accurate to look at the elapsed time of a larger
sequence, as demonstrated here.
-
Rate:
To set a rate (ops/sec or ops/min) for a sequence, add
': N : R' just after sequence closing tag.
This would specify repetition of N with rate of R operations/sec.
Use 'R/sec' or
'R/min'
to explicitely specify that the rate is per second or per minute.
The default is per second,
Example - [ AddDoc ] : 400 : 3 - would do 400
addDoc in parallel, starting up to 3 threads per second.
Example - { AddDoc } : 100 : 200/min - would
do 100 addDoc serially,
waiting before starting next add, if otherwise rate would exceed 200 adds/min.
-
Command names: Each class "AnyNameTask" in the
package org.apache.lucene.benchmark.byTask.tasks,
that extends PerfTask, is supported as command "AnyName" that can be
used in the benchmark "algorithm" description.
This allows to add new commands by just adding such classes.
Supported tasks/commands
Existing tasks can be divided into a few groups:
regular index/search work tasks, report tasks, and control tasks.
-
Report tasks: There are a few Report commands for generating reports.
Only task runs that were completed are reported.
(The 'Report tasks' themselves are not measured and not reported.)
-
RepAll - all (completed) task runs.
-
RepSumByName - all statistics,
aggregated by name. So, if AddDoc was executed 2000 times,
only 1 report line would be created for it, aggregating all those
2000 statistic records.
-
RepSelectByPref prefixWord - all
records for tasks whose name start with
prefixWord.
-
RepSumByPref prefixWord - all
records for tasks whose name start with
prefixWord,
aggregated by their full task name.
-
RepSumByNameRound - all statistics,
aggregated by name and by Round.
So, if AddDoc was executed 2000 times in each of 3
rounds, 3 report lines would be
created for it,
aggregating all those 2000 statistic records in each round.
See more about rounds in the NewRound
command description below.
-
RepSumByPrefRound prefixWord -
similar to RepSumByNameRound,
just that only tasks whose name starts with
prefixWord are included.
If needed, additional reports can be added by extending the abstract class
ReportTask, and by
manipulating the statistics data in Points and TaskStats.
- Control tasks: Few of the tasks control the benchmark algorithm
all over:
-
ClearStats - clears the entire statistics.
Further reports would only include task runs that would start after this
call.
-
NewRound - virtually start a new round of
performance test.
Although this command can be placed anywhere, it mostly makes sense at
the end of an outermost sequence.
This increments a global "round counter". All task runs that
would start now would
record the new, updated round counter as their round number.
This would appear in reports.
In particular, see RepSumByNameRound above.
An additional effect of NewRound, is that numeric and boolean
properties defined (at the head
of the .alg file) as a sequence of values, e.g.
merge.factor=mrg:10:100:10:100 would
increment (cyclic) to the next value.
Note: this would also be reflected in the reports, in this case under a
column that would be named "mrg".
-
ResetInputs - DocMaker and the
various QueryMakers
would reset their counters to start.
The way these Maker interfaces work, each call for makeDocument()
or makeQuery() creates the next document or query
that it "knows" to create.
If that pool is "exhausted", the "maker" start over again.
The resetInpus command
therefore allows to make the rounds comparable.
It is therefore useful to invoke ResetInputs together with NewRound.
-
ResetSystemErase - reset all index
and input data and call gc.
Does NOT reset statistics. This contains ResetInputs.
All writers/readers are nullified, deleted, closed.
Index is erased.
Directory is erased.
You would have to call CreateIndex once this was called...
-
ResetSystemSoft - reset all
index and input data and call gc.
Does NOT reset statistics. This contains ResetInputs.
All writers/readers are nullified, closed.
Index is NOT erased.
Directory is NOT erased.
This is useful for testing performance on an existing index,
for instance if the construction of a large index
took a very long time and now you would to test
its search or update performance.
-
Other existing tasks are quite straightforward and would
just be briefly described here.
-
CreateIndex and
OpenIndex both leave the
index open for later update operations.
CloseIndex would close it.
-
OpenReader, similarly, would
leave an index reader open for later search operations.
But this have further semantics.
If a Read operation is performed, and an open reader exists,
it would be used.
Otherwise, the read operation would open its own reader
and close it when the read operation is done.
This allows testing various scenarios - sharing a reader,
searching with "cold" reader, with "warmed" reader, etc.
The read operations affected by this are:
Warm,
Search,
SearchTrav (search and traverse),
and SearchTravRet (search
and traverse and retrieve).
Notice that each of the 3 search task types maintains
its own queryMaker instance.
Benchmark properties
Properties are read from the header of the .alg file, and
define several parameters of the performance test.
As mentioned above for the NewRound task,
numeric and boolean properties that are defined as a sequence
of values, e.g. merge.factor=mrg:10:100:10:100
would increment (cyclic) to the next value,
when NewRound is called, and would also
appear as a named column in the reports (column
name would be "mrg" in this example).
Some of the currently defined properties are:
-
analyzer - full
class name for the analyzer to use.
Same analyzer would be used in the entire test.
-
directory - valid values are
This tells which directory to use for the performance test.
-
Index work parameters:
Multi int/boolean values would be iterated with calls to NewRound.
There would be also added as columns in the reports, first string in the
sequence is the column name.
(Make sure it is no shorter than any value in the sequence).
- max.buffered
Example: max.buffered=buf:10:10:100:100 -
this would define using maxBufferedDocs of 10 in iterations 0 and 1,
and 100 in iterations 2 and 3.
-
merge.factor - which
merge factor to use.
-
compound - whether the index is
using the compound format or not. Valid values are "true" and "false".
Here is a list of currently defined properties:
- Root directory for data and indexes:
- work.dir (default is System property "benchmark.work.dir" or "work".)
- Docs and queries creation:
- analyzer
- doc.maker
- doc.maker.forever
- html.parser
- doc.stored
- doc.tokenized
- doc.term.vector
- doc.term.vector.positions
- doc.term.vector.offsets
- doc.store.body.bytes
- docs.dir
- query.maker
- file.query.maker.file
- file.query.maker.default.field
- Logging:
- doc.add.log.step
- doc.delete.log.step
- log.queries
- task.max.depth.log
- doc.tokenize.log.step
- Index writing:
- compound
- merge.factor
- max.buffered
- directory
- ram.flush.mb
- autocommit
- Doc deletion:
For sample use of these properties see the *.alg files under conf.
Example input algorithm and the result benchmark report
The following example is in conf/sample.alg:
# --------------------------------------------------------
#
# Sample: what is the effect of doc size on indexing time?
#
# There are two parts in this test:
# - PopulateShort adds 2N documents of length L
# - PopulateLong adds N documents of length 2L
# Which one would be faster?
# The comparison is done twice.
#
# --------------------------------------------------------
# -------------------------------------------------------------------------------------
# multi val params are iterated by NewRound's, added to reports, start with column name.
merge.factor=mrg:10:20
max.buffered=buf:100:1000
compound=true
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=500
docs.dir=reuters-out
doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
# task at this depth or less would print when they start
task.max.depth.log=2
log.queries=false
# -------------------------------------------------------------------------------------
{
{ "PopulateShort"
CreateIndex
{ AddDoc(4000) > : 20000
Optimize
CloseIndex
>
ResetSystemErase
{ "PopulateLong"
CreateIndex
{ AddDoc(8000) > : 10000
Optimize
CloseIndex
>
ResetSystemErase
NewRound
} : 2
RepSumByName
RepSelectByPref Populate
The command line for running this sample:
ant run-task -Dtask.alg=conf/sample.alg
The output report from running this test contains the following:
Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
PopulateShort 0 10 100 1 20003 119.6 167.26 12,959,120 14,241,792
PopulateLong - - 0 10 100 - - 1 - - 10003 - - - 74.3 - - 134.57 - 17,085,208 - 20,635,648
PopulateShort 1 20 1000 1 20003 143.5 139.39 63,982,040 94,756,864
PopulateLong - - 1 20 1000 - - 1 - - 10003 - - - 77.0 - - 129.92 - 87,309,608 - 100,831,232
Results record counting clarified
Two columns in the results table indicate records counts: records-per-run and
records-per-second. What does it mean?
Almost every task gets 1 in this count just for being executed.
Task sequences aggregate the counts of their child tasks,
plus their own count of 1.
So, a task sequence containing 5 other task sequences, each running a single
other task 10 times, would have a count of 1 + 5 * (1 + 10) = 56.
The traverse and retrieve tasks "count" more: a traverse task
would add 1 for each traversed result (hit), and a retrieve task would
additionally add 1 for each retrieved doc. So, regular Search would
count 1, SearchTrav that traverses 10 hits would count 11, and a
SearchTravRet task that retrieves (and traverses) 10, would count 21.
Confusing? this might help: always examine the elapsedSec column,
and always compare "apples to apples", .i.e. it is interesting to check how the
rec/s changed for the same task (or sequence) between two
different runs, but it is not very useful to know how the rec/s
differs between Search and SearchTrav tasks. For
the latter, elapsedSec would bring more insight.
| org.apache.lucene.benchmark.byTask.feeds |
Sources for benchmark inputs: documents and queries.
| org.apache.lucene.benchmark.byTask.programmatic |
Sample performance test written programatically - no algorithm file is needed here.
| org.apache.lucene.benchmark.byTask.stats |
Statistics maintained when running benchmark tasks.
| org.apache.lucene.benchmark.byTask.tasks |
Extendable benchmark tasks.
| org.apache.lucene.benchmark.byTask.utils |
Utilities used for the benchmark, and for the reports.
| org.apache.lucene.benchmark.quality |
Search Quality Benchmarking.
This package allows to benchmark search quality of a Lucene application.
In order to use this package you should provide:
For benchmarking TREC collections with TREC QRels, take a look at the
trec package.
Here is a sample code used to run the TREC 2006 queries 701-850 on the .Gov2 collection:
File topicsFile = new File("topics-701-850.txt");
File qrelsFile = new File("qrels-701-850.txt");
Searcher searcher = new IndexSearcher("index");
int maxResults = 1000;
String docNameField = "docname";
PrintWriter logger = new PrintWriter(System.out,true);
// use trec utilities to read trec topics into quality queries
TrecTopicsReader qReader = new TrecTopicsReader();
QualityQuery qqs[] = qReader.readQueries(new BufferedReader(new FileReader(topicsFile)));
// prepare judge, with trec utilities that read from a QRels file
Judge judge = new TrecJudge(new BufferedReader(new FileReader(qrelsFile)));
// validate topics & judgments match each other
judge.validateData(qqs, logger);
// set the parsing of quality queries into Lucene queries.
QualityQueryParser qqParser = new SimpleQQParser("title", "body");
// run the benchmark
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(maxResults, judge, submitLog, logger);
// print an avarage sum of the results
QualityStats avg = QualityStats.average(stats);
avg.log("SUMMARY",2,logger, " ");
Some immediate ways to modify this program to your needs are:
| org.apache.lucene.benchmark.quality.trec |
Utilities for Trec related quality benchmarking, feeding from Trec Topics and QRels inputs.
| org.apache.lucene.benchmark.quality.utils |
Miscellaneous utilities for search quality benchmarking: query parsing, submission reports.
| org.apache.lucene.benchmark.standard | | org.apache.lucene.benchmark.stats | | org.apache.lucene.benchmark.utils | | org.apache.lucene.demo | | org.apache.lucene.demo.html | | org.apache.lucene.document |
The logical representation of a {@link org.apache.lucene.document.Document} for indexing and searching.
The document package provides the user level logical representation of content to be indexed and searched. The
package also provides utilities for working with {@link org.apache.lucene.document.Document}s and {@link org.apache.lucene.document.Fieldable}s.
Document and Fieldable
A {@link org.apache.lucene.document.Document} is a collection of {@link org.apache.lucene.document.Fieldable}s. A
{@link org.apache.lucene.document.Fieldable} is a logical representation of a user's content that needs to be indexed or stored.
{@link org.apache.lucene.document.Fieldable}s have a number of properties that tell Lucene how to treat the content (like indexed, tokenized,
stored, etc.) See the {@link org.apache.lucene.document.Field} implementation of {@link org.apache.lucene.document.Fieldable}
for specifics on these properties.
Note: it is common to refer to {@link org.apache.lucene.document.Document}s having {@link org.apache.lucene.document.Field}s, even though technically they have
{@link org.apache.lucene.document.Fieldable}s.
Working with Documents
First and foremost, a {@link org.apache.lucene.document.Document} is something created by the user application. It is your job
to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.)
How this is done is completely up to you. That being said, there are many tools available in other projects that can make
the process of taking a file and converting it into a Lucene {@link org.apache.lucene.document.Document}. To see an example of this,
take a look at the Lucene demo and the associated source code
for extracting content from HTML.
The {@link org.apache.lucene.document.DateTools} and {@link org.apache.lucene.document.NumberTools} classes are utility
classes to make dates, times and longs searchable (remember, Lucene only searches text).
The {@link org.apache.lucene.document.FieldSelector} class provides a mechanism to tell Lucene how to load Documents from
storage. If no FieldSelector is used, all Fieldables on a Document will be loaded. As an example of the FieldSelector usage, consider
the common use case of
displaying search results on a web page and then having users click through to see the full document. In this scenario, it is often
the case that there are many small fields and one or two large fields (containing the contents of the original file). Before the FieldSelector,
the full Document had to be loaded, including the large fields, in order to display the results. Now, using the FieldSelector, one
can {@link org.apache.lucene.document.FieldSelectorResult#LAZY_LOAD} the large fields, thus only loading the large fields
when a user clicks on the actual link to view the original content.
| org.apache.lucene.index |
Code to maintain and access indices.
| org.apache.lucene.index.memory |
High-performance single-document main memory Apache Lucene fulltext search index.
| org.apache.lucene.index.store | | org.apache.lucene.misc | | org.apache.lucene.queryParser |
A simple query parser implemented with JavaCC.
Note that JavaCC defines lots of public classes, methods and fields
that do not need to be public. These clutter the documentation.
Sorry.
Note that because JavaCC defines a class named Token, org.apache.lucene.analysis.Token
must always be fully qualified in source code in this package.
| org.apache.lucene.queryParser.analyzing | | org.apache.lucene.queryParser.precedence | | org.apache.lucene.queryParser.surround.parser |
Surround parser package
This package contains the QueryParser.jj source file for the Surround parser.
Parsing the text of a query results in a SrndQuery in the
org.apache.lucene.queryParser.surround.query package.
| org.apache.lucene.queryParser.surround.query |
Surround query package
This package contains SrndQuery and its subclasses.
The parser in the org.apache.lucene.queryParser.surround.parser package
normally generates a SrndQuery.
For searching an org.apache.lucene.search.Query is provided by
the SrndQuery.makeLuceneQueryField method.
For this, TermQuery, BooleanQuery and SpanQuery are used from Lucene.
| org.apache.lucene.search |
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Changing the Scoring
Search
Search over indices.
Applications usually call {@link
org.apache.lucene.search.Searcher#search(Query)} or {@link
org.apache.lucene.search.Searcher#search(Query,Filter)}.
Query Classes
Of the various implementations of
Query, the
TermQuery
is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the
specified
Term,
which is a word that occurs in a certain
Field.
Thus, a TermQuery identifies and scores all
Documents that have a Field with the specified string in it.
Constructing a TermQuery
is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term".
Things start to get interesting when one combines multiple
TermQuery instances into a BooleanQuery.
A BooleanQuery contains multiple
BooleanClauses,
where each clause contains a sub-query (Query
instance) and an operator (from BooleanClause.Occur)
describing how that sub-query is combined with the other clauses:
SHOULD — Use this operator when a clause can occur in the result set, but is not required.
If a query is made up of all SHOULD clauses, then every document in the result
set matches at least one of these clauses.
MUST — Use this operator when a clause is required to occur in the result set. Every
document in the result set will match
all such clauses.
MUST NOT — Use this operator when a
clause must not occur in the result set. No
document in the result set will match
any such clauses.
Boolean queries are constructed by adding two or more
BooleanClause
instances. If too many clauses are added, a TooManyClauses
exception will be thrown during searching. This most often occurs
when a Query
is rewritten into a BooleanQuery with many
TermQuery clauses,
for example by WildcardQuery.
The default setting for the maximum number
of clauses 1024, but this can be changed via the
static method setMaxClauseCount
in BooleanQuery.
Phrases
Another common search is to find documents containing certain phrases. This
is handled two different ways:
-
PhraseQuery
— Matches a sequence of
Terms.
PhraseQuery uses a slop factor to determine
how many positions may occur between any two terms in the phrase and still be considered a match.
-
SpanNearQuery
— Matches a sequence of other
SpanQuery
instances. SpanNearQuery allows for
much more
complicated phrase queries since it is constructed from other SpanQuery
instances, instead of only TermQuery
instances.
The
RangeQuery
matches all documents that occur in the
exclusive range of a lower
Term
and an upper
Term.
For example, one could find all documents
that have terms beginning with the letters a through c. This type of Query is frequently used to
find
documents that occur in a specific date range.
While the
PrefixQuery
has a different implementation, it is essentially a special case of the
WildcardQuery.
The PrefixQuery allows an application
to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing
for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards.
Note that the WildcardQuery can be quite slow. Also
note that
WildcardQuery should
not start with * and ?, as these are extremely slow.
To remove this protection and allow a wildcard at the beginning of a term, see method
setAllowLeadingWildcard in
QueryParser.
A
FuzzyQuery
matches documents that contain terms similar to the specified term. Similarity is
determined using
Levenshtein (edit) distance.
This type of query can be useful when accounting for spelling variations in the collection.
Changing Similarity
Chances are DefaultSimilarity is sufficient for all
your searching needs.
However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
applications do not need to
distinguish between shorter and longer documents (see a "fair" similarity).
To change Similarity, one must do so for both indexing and
searching, and the changes must happen before
either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
just isn't well-defined what is going to happen.
To make this change, implement your own Similarity (likely
you'll want to simply subclass
DefaultSimilarity) and then use the new
class by calling
IndexWriter.setSimilarity
before indexing and
Searcher.setSimilarity
before searching.
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity.
In summary, here are a few use cases:
SweetSpotSimilarity — SweetSpotSimilarity gives small increases
as the frequency increases a small amount
and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is
more significant.
Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
matching term occurs. In these
cases people have overridden Similarity to return 1 from the tf() method.
Changing Length Normalization — By overriding lengthNorm,
it is possible to discount how the length of a field contributes
to a score. In DefaultSimilarity,
lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
1 / (numTerms in field), all fields will be treated
"fairly".
In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
[One would override the Similarity in] ... any situation where you know more about your data then just
that
it's "text" is a situation where it *might* make sense to to override your
Similarity method.
Changing Scoring — Expert Level
Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity
when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
three main classes:
-
Query — The abstract object representation of the
user's information need.
-
Weight — The internal interface representation of
the user's Query, so that Query objects may be reused.
-
Scorer — An abstract class containing common
functionality for scoring. Provides both scoring and explanation capabilities.
Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the
Query
class is where it all begins. Without a Query, there would be
nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
is often responsible
for creating them or coordinating the functionality between them. The
Query class has several methods that are important for
derived classes:
- createWeight(Searcher searcher) — A
Weight is the internal representation of the
Query, so each Query implementation must
provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight
interface.
- rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are:
TermQuery,
BooleanQuery, OTHERS????
The Weight Interface
The
Weight
interface provides an internal representation of the Query so that it can be reused. Any
Searcher
dependent state should be stored in the Weight implementation,
not in the Query class. The interface defines six methods that must be implemented:
-
Weight#getQuery() — Pointer to the
Query that this Weight represents.
-
Weight#getValue() — The weight for
this Query. For example, the TermQuery.TermWeight value is
equal to the idf^2 * boost * queryNorm
-
Weight#sumOfSquaredWeights() — The sum of squared weights. For TermQuery, this is (idf *
boost)^2
-
Weight#normalize(float) — Determine the query normalization factor. The query normalization may
allow for comparing scores between queries.
-
Weight#scorer(IndexReader) — Construct a new
Scorer
for this Weight. See
The Scorer Class
below for help defining a Scorer. As the name implies, the
Scorer is responsible for doing the actual scoring of documents given the Query.
-
Weight#explain(IndexReader, int) — Provide a means for explaining why a given document was
scored
the way it was.
The Scorer Class
The
Scorer
abstract class provides common scoring functionality for all Scorer implementations and
is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which
must be implemented:
-
Scorer#next() — Advances to the next
document that matches this Query, returning true if and only
if there is another document that matches.
-
Scorer#doc() — Returns the id of the
Document
that contains the match. It is not valid until next() has been called at least once.
-
Scorer#score() — Return the score of the
current document. This value can be determined in any
appropriate way for an application. For instance, the
TermScorer
returns the tf * Weight.getValue() * fieldNorm.
-
Scorer#skipTo(int) — Skip ahead in
the document matches to the document whose id is greater than
or equal to the passed in value. In many instances, skipTo can be
implemented more efficiently than simply looping through all the matching documents until
the target document is identified.
-
Scorer#explain(int) — Provides
details on why the score came about.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
aren't appropriate for the
task that you want to do. You might be doing some cutting edge research or you need more information
back
out of Lucene (similar to Doug adding SpanQuery functionality).
Examples
FILL IN HERE
| org.apache.lucene.search.function |
org.apache.lucene.search.function
Programmatic control over documents scores.
The function package provides tight control over documents scores.
WARNING: The status of the search.function package is experimental. The APIs
introduced here might change in the future and will not be supported anymore
in such a case.
Two types of queries are available in this package:
-
Custom Score queries - allowing to set the score
of a matching document as a mathematical expression over scores
of that document by contained (sub) queries.
-
Field score queries - allowing to base the score of a
document on numeric values of indexed fields.
Some possible uses of these queries:
-
Normalizing the document scores by values indexed in a special field -
for instance, experimenting with a different doc length normalization.
-
Introducing some static scoring element, to the score of a document, -
for instance using some topological attribute of the links to/from a document.
-
Computing the score of a matching document as an arbitrary odd function of
its score by a certain query.
Performance and Quality Considerations:
-
When scoring by values of indexed fields,
these values are loaded into memory.
Unlike the regular scoring, where the required information is read from
disk as necessary, here field values are loaded once and cached by Lucene in memory
for further use, anticipating reuse by further queries. While all this is carefully
cached with performance in mind, it is recommended to
use these features only when the default Lucene scoring does
not match your "special" application needs.
-
Use only with carefully selected fields, because in most cases,
search quality with regular Lucene scoring
would outperform that of scoring by field values.
-
Values of fields used for scoring should match.
Do not apply on a field containing arbitrary (long) text.
Do not mix values in the same field if that field is used for scoring.
-
Smaller (shorter) field tokens means less RAM (something always desired).
When using FieldScoreQuery,
select the shortest FieldScoreQuery.Type
that is sufficient for the used field values.
-
Reusing IndexReaders/IndexSearchers is essential, because the caching of field tokens
is based on an IndexReader. Whenever a new IndexReader is used, values currently in the cache
cannot be used and new values must be loaded from disk. So replace/refresh readers/searchers in
a controlled manner.
History and Credits:
-
A large part of the code of this package was originated from Yonik's FunctionQuery code that was
imported from Solr
(see LUCENE-446).
-
The idea behind CustomScoreQurey is borrowed from
the "Easily create queries that transform sub-query scores arbitrarily" contribution by Mike Klaas
(see LUCENE-850)
though the implementation and API here are different.
Code sample:
Note: code snippets here should work, but they were never really compiled... so,
tests sources under TestCustomScoreQuery, TestFieldScoreQuery and TestOrdValues
may also be useful.
-
Using field (byte) values to as scores:
Indexing:
f = new Field("score", "7", Field.Store.NO, Field.Index.UN_TOKENIZED);
f.setOmitNorms(true);
d1.add(f);
Search:
Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);
Document d1 above would get a score of 7.
-
Manipulating scores
Dividing the original score of each document by a square root of its docid
(just to demonstrate what it takes to manipulate scores this way)
Query q = queryParser.parse("my query text");
CustomScoreQuery customQ = new CustomScoreQuery(q) {
public float customScore(int doc, float subQueryScore, float valSrcScore) {
return subQueryScore / Math.sqrt(docid);
}
};
For more informative debug info on the custom query, also override the name() method:
CustomScoreQuery customQ = new CustomScoreQuery(q) {
public float customScore(int doc, float subQueryScore, float valSrcScore) {
return subQueryScore / Math.sqrt(docid);
}
public String name() {
return "1/sqrt(docid)";
}
};
Taking the square root of the original score and multiplying it by a "short field driven score", ie, the
short value that was indexed for the scored doc in a certain field:
Query q = queryParser.parse("my query text");
FieldScoreQuery qf = new FieldScoreQuery("shortScore", FieldScoreQuery.Type.SHORT);
CustomScoreQuery customQ = new CustomScoreQuery(q,qf) {
public float customScore(int doc, float subQueryScore, float valSrcScore) {
return Math.sqrt(subQueryScore) * valSrcScore;
}
public String name() {
return "shortVal*sqrt(score)";
}
};
| org.apache.lucene.search.highlight |
The highlight package contains classes to provide "keyword in context" features
typically used to highlight search terms in the text of results pages.
The Highlighter class is the central component and can be used to extract the
most interesting sections of a piece of text and highlight them, with the help of
Fragmenter, FragmentScorer, Formatter classes.
Example Usage
//... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("notv", analyzer);
Query query = parser.parse("million");
//query = query.rewrite(reader); //required to expand search terms
Hits hits = searcher.search(query);
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < 10; i++) {
String text = hits.doc(i).get("notv");
TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.id(i), "notv", analyzer);
TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);//highlighter.getBestFragments(tokenStream, text, 3, "...");
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
//Term vector
text = hits.doc(i).get("tv");
tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.id(i), "tv", analyzer);
frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
System.out.println((frag[j].toString()));
}
}
System.out.println("-------------");
}
New features 06/02/2005
This release adds options for encoding (thanks to Nicko Cadell).
An "Encoder" implementation such as the new SimpleHTMLEncoder class can be passed to the highlighter to encode
all those non-xhtml standard characters such as & into legal values. This simple class may not suffice for
some languages - Commons Lang has an implementation that could be used: escapeHtml(String) in
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/lang/trunk/src/java/org/apache/commons/lang/StringEscapeUtils.java?rev=137958&view=markup
New features 22/12/2004
This release adds some new capabilities:
- Faster highlighting using Term vector support
- New formatting options to use color intensity to show informational value
- Options for better summarization by using term IDF scores to influence fragment selection
The highlighter takes a TokenStream as input. Until now these streams have typically been produced
using an Analyzer but the new class TokenSources provides helper methods for obtaining TokenStreams from
the new TermVector position support (see latest CVS version).
The new class GradientFormatter can use a scale of colors to highlight terms according to their score.
A subtle use of color can help emphasise the reasons for matching (useful when doing "MoreLikeThis" queries and
you want to see what the basis of the similarities are).
The QueryScorer class has a new constructor which can use an IndexReader to derive the IDF (inverse document frequency)
for each term in order to influcence the score. This is useful for helping to extracting the most significant sections
of a document and in supplying scores used by the new GradientFormatter to color significant words more strongly.
The QueryScorer.getMaxWeight method is useful when passed to the GradientFormatter constructor to define the top score
which is associated with the top color.
| org.apache.lucene.search.payloads |
org.apache.lucene.search.payloads
The payloads package provides Query mechanisms for finding and using payloads.
The following Query implementations are provided:
- BoostingTermQuery -- Boost a term's score based on the value of the payload located at that term
| org.apache.lucene.search.regex |
Regular expression Query.
| org.apache.lucene.search.similar |
Document similarity query generators.
| org.apache.lucene.search.spans |
The calculus of spans.
A span is a <doc,startPosition,endPosition> tuple.
The following span query operators are implemented:
- A SpanTermQuery matches all spans
containing a particular Term.
- A SpanNearQuery matches spans
which occur near one another, and can be used to implement things like
phrase search (when constructed from SpanTermQueries) and inter-phrase
proximity (when constructed from other SpanNearQueries).
- A SpanOrQuery merges spans from a
number of other SpanQueries.
- A SpanNotQuery removes spans
matching one SpanQuery which overlap
another. This can be used, e.g., to implement within-paragraph
search.
- A SpanFirstQuery matches spans
matching
q whose end position is less than
n . This can be used to constrain matches to the first
part of the document.
In all cases, output spans are minimally inclusive. In other words, a
span formed by matching a span in x and y starts at the lesser of the
two starts and ends at the greater of the two ends.
For example, a span query which matches "John Kerry" within ten
words of "George Bush" within the first 100 words of the document
could be constructed with:
SpanQuery john = new SpanTermQuery(new Term("content", "john"));
SpanQuery kerry = new SpanTermQuery(new Term("content", "kerry"));
SpanQuery george = new SpanTermQuery(new Term("content", "george"));
SpanQuery bush = new SpanTermQuery(new Term("content", "bush"));
SpanQuery johnKerry =
new SpanNearQuery(new SpanQuery[] {john, kerry}, 0, true);
SpanQuery georgeBush =
new SpanNearQuery(new SpanQuery[] {george, bush}, 0, true);
SpanQuery johnKerryNearGeorgeBush =
new SpanNearQuery(new SpanQuery[] {johnKerry, georgeBush}, 10, false);
SpanQuery johnKerryNearGeorgeBushAtStart =
new SpanFirstQuery(johnKerryNearGeorgeBush, 100);
Span queries may be freely intermixed with other Lucene queries.
So, for example, the above query can be restricted to documents which
also use the word "iraq" with:
Query query = new BooleanQuery();
query.add(johnKerryNearGeorgeBushAtStart, true, false);
query.add(new TermQuery("content", "iraq"), true, false);
| org.apache.lucene.search.spell |
Suggest alternate spellings for words.
Also see the spell checker Wiki page.
| org.apache.lucene.store |
Binary i/o API, used for all index data.
| org.apache.lucene.store.db | | org.apache.lucene.store.je | | org.apache.lucene.swing.models |
Decorators for JTable TableModel and JList ListModel encapsulating Lucene indexing and searching functionality.
| org.apache.lucene.util |
Some utility classes.
| org.apache.lucene.wikipedia.analysis | | org.apache.lucene.wordnet |
WordNet Lucene Synonyms Integration
This package uses synonyms defined by WordNet to build a
Lucene index storing them, which in turn can be used for query expansion.
You normally run {@link org.apache.lucene.wordnet.Syns2Index} once to build the query index/"database", and then call
{@link org.apache.lucene.wordnet.SynExpand#expand SynExpand.expand(...)} to expand a query.
Instructions
- Download the WordNet prolog database , gunzip, untar etc.
- Invoke Syn2Index as appropriate to build a synonym index.
It'll take 2 arguments, the path to wn_s.pl from that WordNet downlaod, and the index name.
- Update your UI so that as appropriate you call SynExpand.expand(...) to expand user queries with synonyms.
| org.apache.lucene.xmlparser | | org.apache.lucene.xmlparser.builders | | org.apache.regexp |
This package exists to allow access to useful package protected data within
Jakarta Regexp. This data has now been opened up with an accessor, but
an official release with that change has not been made to date.
|
|