| java.lang.Object org.apache.lucene.search.highlight.TokenSources
TokenSources | public class TokenSources (Code) | | Hides implementation issues associated with obtaining a TokenStream for use with
the higlighter - can obtain from TermFreqVectors with offsets and (optionally) positions or
from Analyzer class reparsing the stored content.
author: maharwood |
Method Summary | |
public static TokenStream | getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer) A convenience method that tries a number of approaches to getting a token stream.
The cost of finding there are no termVectors in the index is minimal (1000 invocations still
registers 0 ms). | public static TokenStream | getTokenStream(TermPositionVector tpv) | public static TokenStream | getTokenStream(TermPositionVector tpv, boolean tokenPositionsGuaranteedContiguous) Low level api.
Returns a token stream or null if no offset info available in index.
This can be used to feed the highlighter with a pre-parsed token stream
In my tests the speeds to recreate 1000 token streams using this method are:
- with TermVector offset only data stored - 420 milliseconds
- with TermVector offset AND position data stored - 271 milliseconds
(nb timings for TermVector with position data are based on a tokenizer with contiguous
positions - no overlaps or gaps)
The cost of not using TermPositionVector to store
pre-parsed content and using an analyzer to re-parse the original content:
- reanalyzing the original content - 980 milliseconds
The re-analyze timings will typically vary depending on -
1) The complexity of the analyzer code (timings above were using a
stemmer/lowercaser/stopword combo)
2) The number of other fields (Lucene reads ALL fields off the disk
when accessing just one document field - can cost dear!)
3) Use of compression on field storage - could be faster cos of compression (less disk IO)
or slower (more CPU burn) depending on the content.
Parameters: tpv - Parameters: tokenPositionsGuaranteedContiguous - true if the token position numbers have no overlaps or gaps. | public static TokenStream | getTokenStream(IndexReader reader, int docId, String field) | public static TokenStream | getTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer) |
getAnyTokenStream | public static TokenStream getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer) throws IOException(Code) | | A convenience method that tries a number of approaches to getting a token stream.
The cost of finding there are no termVectors in the index is minimal (1000 invocations still
registers 0 ms). So this "lazy" (flexible?) approach to coding is probably acceptable
Parameters: reader - Parameters: docId - Parameters: field - Parameters: analyzer - null if field not stored correctly throws: IOException - |
getTokenStream | public static TokenStream getTokenStream(TermPositionVector tpv, boolean tokenPositionsGuaranteedContiguous)(Code) | | Low level api.
Returns a token stream or null if no offset info available in index.
This can be used to feed the highlighter with a pre-parsed token stream
In my tests the speeds to recreate 1000 token streams using this method are:
- with TermVector offset only data stored - 420 milliseconds
- with TermVector offset AND position data stored - 271 milliseconds
(nb timings for TermVector with position data are based on a tokenizer with contiguous
positions - no overlaps or gaps)
The cost of not using TermPositionVector to store
pre-parsed content and using an analyzer to re-parse the original content:
- reanalyzing the original content - 980 milliseconds
The re-analyze timings will typically vary depending on -
1) The complexity of the analyzer code (timings above were using a
stemmer/lowercaser/stopword combo)
2) The number of other fields (Lucene reads ALL fields off the disk
when accessing just one document field - can cost dear!)
3) Use of compression on field storage - could be faster cos of compression (less disk IO)
or slower (more CPU burn) depending on the content.
Parameters: tpv - Parameters: tokenPositionsGuaranteedContiguous - true if the token position numbers have no overlaps or gaps. If lookingto eek out the last drops of performance, set to true. If in doubt, set to false. |
|
|