| java.lang.Object org.apache.lucene.analysis.Token
Token | public class Token implements Cloneable(Code) | | A Token is an occurence of a term from the text of a field. It consists of
a term's text, the start and end offset of the term in the text of the field,
and a type string.
The start and end offsets permit applications to re-associate a token with
its source text, e.g., to display highlighted query terms in a document
browser, or to show matching text fragments in a KWIC (KeyWord In Context)
display, etc.
The type is an interned string, assigned by a lexical analyzer
(a.k.a. tokenizer), naming the lexical or syntactic class that the token
belongs to. For example an end of sentence marker token might be implemented
with type "eos". The default token type is "word".
A Token can optionally have metadata (a.k.a. Payload) in the form of a variable
length byte array. Use
TermPositions.getPayloadLength and
TermPositions.getPayload(byte[]int) to retrieve the payloads from the index.
WARNING: The status of the Payloads feature is experimental.
The APIs introduced here might change in the future and will not be
supported anymore in such a case.
NOTE: As of 2.3, Token stores the term text
internally as a malleable char[] termBuffer instead of
String termText. The indexing code and core tokenizers
have been changed re-use a single Token instance, changing
its buffer and other fields in-place as the Token is
processed. This provides substantially better indexing
performance as it saves the GC cost of new'ing a Token and
String for every term. The APIs that accept String
termText are still available but a warning about the
associated performance cost has been added (below). The
Token.termText() method has been deprecated.
Tokenizers and filters should try to re-use a Token
instance when possible for best performance, by
implementing the
TokenStream.next(Token) API.
Failing that, to create a new Token you should first use
one of the constructors that starts with null text. Then
you should call either
Token.termBuffer() or
Token.resizeTermBuffer(int) to retrieve the Token's
termBuffer. Fill in the characters of your term into this
buffer, and finally call
Token.setTermLength(int) to
set the length of the term text. See LUCENE-969
for details.
See Also: org.apache.lucene.index.Payload |
Constructor Summary | |
public | Token() Constructs a Token will null text. | public | Token(int start, int end) Constructs a Token with null text and start & end
offsets. | public | Token(int start, int end, String typ) Constructs a Token with null text and start & end
offsets plus the Token type. | public | Token(String text, int start, int end) Constructs a Token with the given term text, and start
& end offsets. | public | Token(String text, int start, int end, String typ) Constructs a Token with the given text, start and end
offsets, & type. |
Method Summary | |
public void | clear() Resets the term text, payload, and positionIncrement to default.
Other fields such as startOffset, endOffset and the token type are
not reset since they are normally overwritten by the tokenizer. | public Object | clone() | final public int | endOffset() Returns this Token's ending offset, one greater than the position of the
last character corresponding to this token in the source text. | public Payload | getPayload() Returns this Token's payload. | public int | getPositionIncrement() Returns the position increment of this Token. | public char[] | resizeTermBuffer(int newSize) Grows the termBuffer to at least size newSize. | public void | setEndOffset(int offset) Set the ending offset. | public void | setPayload(Payload payload) Sets this Token's payload. | public void | setPositionIncrement(int positionIncrement) Set the position increment. | public void | setStartOffset(int offset) Set the starting offset. | final public void | setTermBuffer(char[] buffer, int offset, int length) Copies the contents of buffer, starting at offset for
length characters, into the termBuffer
array. | final public void | setTermLength(int length) Set number of valid characters (length of the term) in
the termBuffer array. | public void | setTermText(String text) Sets the Token's term text. | final public void | setType(String type) Set the lexical type. | final public int | startOffset() Returns this Token's starting offset, the position of the first character
corresponding to this token in the source text.
Note that the difference between endOffset() and startOffset() may not be
equal to termText.length(), as the term text may have been altered by a
stemmer or some other filter. | final public char[] | termBuffer() Returns the internal termBuffer character array which
you can then directly alter. | final public int | termLength() Return number of valid characters (length of the term)
in the termBuffer array. | final public String | termText() Returns the Token's term text. | public String | toString() | final public String | type() Returns this Token's lexical type. |
positionIncrement | int positionIncrement(Code) | | |
startOffset | int startOffset(Code) | | |
termBuffer | char[] termBuffer(Code) | | |
termLength | int termLength(Code) | | |
Token | public Token()(Code) | | Constructs a Token will null text.
|
Token | public Token(int start, int end)(Code) | | Constructs a Token with null text and start & end
offsets.
Parameters: start - start offset Parameters: end - end offset |
Token | public Token(int start, int end, String typ)(Code) | | Constructs a Token with null text and start & end
offsets plus the Token type.
Parameters: start - start offset Parameters: end - end offset |
Token | public Token(String text, int start, int end)(Code) | | Constructs a Token with the given term text, and start
& end offsets. The type defaults to "word."
NOTE: for better indexing speed you should
instead use the char[] termBuffer methods to set the
term text.
Parameters: text - term text Parameters: start - start offset Parameters: end - end offset |
Token | public Token(String text, int start, int end, String typ)(Code) | | Constructs a Token with the given text, start and end
offsets, & type. NOTE: for better indexing
speed you should instead use the char[] termBuffer
methods to set the term text.
Parameters: text - term text Parameters: start - start offset Parameters: end - end offset Parameters: typ - token type |
clear | public void clear()(Code) | | Resets the term text, payload, and positionIncrement to default.
Other fields such as startOffset, endOffset and the token type are
not reset since they are normally overwritten by the tokenizer.
|
endOffset | final public int endOffset()(Code) | | Returns this Token's ending offset, one greater than the position of the
last character corresponding to this token in the source text.
|
getPayload | public Payload getPayload()(Code) | | Returns this Token's payload.
|
resizeTermBuffer | public char[] resizeTermBuffer(int newSize)(Code) | | Grows the termBuffer to at least size newSize.
Parameters: newSize - minimum size of the new termBuffer newly created termBuffer with length >= newSize |
setPayload | public void setPayload(Payload payload)(Code) | | Sets this Token's payload.
|
setPositionIncrement | public void setPositionIncrement(int positionIncrement)(Code) | | Set the position increment. This determines the position of this token
relative to the previous Token in a
TokenStream , used in phrase
searching.
The default value is one.
Some common uses for this are:
- Set it to zero to put multiple terms in the same position. This is
useful if, e.g., a word has multiple stems. Searches for phrases
including either stem will match. In this case, all but the first stem's
increment should be set to zero: the increment of the first instance
should be one. Repeating a token with an increment of zero can also be
used to boost the scores of matches on that token.
- Set it to values greater than one to inhibit exact phrase matches.
If, for example, one does not want phrases to match across removed stop
words, then one could build a stop word filter that removes stop words and
also sets the increment to the number of stop words removed before each
non-stop word. Then exact phrase queries will only match when the terms
occur with no intervening stop words.
See Also: org.apache.lucene.index.TermPositions |
setTermBuffer | final public void setTermBuffer(char[] buffer, int offset, int length)(Code) | | Copies the contents of buffer, starting at offset for
length characters, into the termBuffer
array. NOTE: for better indexing speed you
should instead retrieve the termBuffer, using
Token.termBuffer() or
Token.resizeTermBuffer(int) , and
fill it in directly to set the term text. This saves
an extra copy.
|
setTermLength | final public void setTermLength(int length)(Code) | | Set number of valid characters (length of the term) in
the termBuffer array.
|
setTermText | public void setTermText(String text)(Code) | | Sets the Token's term text. NOTE: for better
indexing speed you should instead use the char[]
termBuffer methods to set the term text.
|
startOffset | final public int startOffset()(Code) | | Returns this Token's starting offset, the position of the first character
corresponding to this token in the source text.
Note that the difference between endOffset() and startOffset() may not be
equal to termText.length(), as the term text may have been altered by a
stemmer or some other filter.
|
termBuffer | final public char[] termBuffer()(Code) | | Returns the internal termBuffer character array which
you can then directly alter. If the array is too
small for your token, use
Token.resizeTermBuffer(int) to increase it. After
altering the buffer be sure to call
Token.setTermLength to record the number of valid
characters that were placed into the termBuffer.
|
termLength | final public int termLength()(Code) | | Return number of valid characters (length of the term)
in the termBuffer array.
|
type | final public String type()(Code) | | Returns this Token's lexical type. Defaults to "word".
|
|
|