Java Doc for Token.java in » Net » lucene-connector » org » apache » lucene » analysis » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Net » lucene connector » org.apache.lucene.analysis

Source Cross Reference

Class Diagram

Java Document (Java Doc)

java.lang .Object

org.apache.lucene.analysis .Token

Token

public class Token implements Cloneable(Code)

A Token is an occurence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc.

The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use TermPositions.getPayloadLength and TermPositions.getPayload(byte[]int) to retrieve the payloads from the index.

WARNING: The status of the Payloads feature is experimental. The APIs introduced here might change in the future and will not be supported anymore in such a case.

NOTE: As of 2.3, Token stores the term text internally as a malleable char[] termBuffer instead of String termText. The indexing code and core tokenizers have been changed re-use a single Token instance, changing its buffer and other fields in-place as the Token is processed. This provides substantially better indexing performance as it saves the GC cost of new'ing a Token and String for every term. The APIs that accept String termText are still available but a warning about the associated performance cost has been added (below). The Token.termText() method has been deprecated.

Tokenizers and filters should try to re-use a Token instance when possible for best performance, by implementing the TokenStream.next(Token) API. Failing that, to create a new Token you should first use one of the constructors that starts with null text. Then you should call either Token.termBuffer() or Token.resizeTermBuffer(int) to retrieve the Token's termBuffer. Fill in the characters of your term into this buffer, and finally call Token.setTermLength(int) to set the length of the term text. See LUCENE-969 for details.

See Also: org.apache.lucene.index.Payload

Field Summary
final public static String	DEFAULT_TYPE
int	endOffset
Payload	payload
int	positionIncrement
int	startOffset
char[]	termBuffer
int	termLength
String	type

Constructor Summary
public	Token() Constructs a Token will null text.
public	Token(int start, int end) Constructs a Token with null text and start & end offsets.
public	Token(int start, int end, String typ) Constructs a Token with null text and start & end offsets plus the Token type.
public	Token(String text, int start, int end) Constructs a Token with the given term text, and start & end offsets.
public	Token(String text, int start, int end, String typ) Constructs a Token with the given text, start and end offsets, & type.

Method Summary
public void	clear() Resets the term text, payload, and positionIncrement to default. Other fields such as startOffset, endOffset and the token type are not reset since they are normally overwritten by the tokenizer.
public Object	clone()
final public int	endOffset() Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.
public Payload	getPayload() Returns this Token's payload.
public int	getPositionIncrement() Returns the position increment of this Token.
public char[]	resizeTermBuffer(int newSize) Grows the termBuffer to at least size newSize.
public void	setEndOffset(int offset) Set the ending offset.
public void	setPayload(Payload payload) Sets this Token's payload.
public void	setPositionIncrement(int positionIncrement) Set the position increment.
public void	setStartOffset(int offset) Set the starting offset.
final public void	setTermBuffer(char[] buffer, int offset, int length) Copies the contents of buffer, starting at offset for length characters, into the termBuffer array.
final public void	setTermLength(int length) Set number of valid characters (length of the term) in the termBuffer array.
public void	setTermText(String text) Sets the Token's term text.
final public void	setType(String type) Set the lexical type.
final public int	startOffset() Returns this Token's starting offset, the position of the first character corresponding to this token in the source text. Note that the difference between endOffset() and startOffset() may not be equal to termText.length(), as the term text may have been altered by a stemmer or some other filter.
final public char[]	termBuffer() Returns the internal termBuffer character array which you can then directly alter.
final public int	termLength() Return number of valid characters (length of the term) in the termBuffer array.
final public String	termText() Returns the Token's term text.
public String	toString()
final public String	type() Returns this Token's lexical type.

Field Detail

DEFAULT_TYPE
final public static String DEFAULT_TYPE(Code)

endOffset
int endOffset(Code)

payload
Payload payload(Code)

positionIncrement
int positionIncrement(Code)

startOffset
int startOffset(Code)

termBuffer
char[] termBuffer(Code)

termLength
int termLength(Code)

type
String type(Code)

Constructor Detail

Token
public Token()(Code)
	Constructs a Token will null text.

Token
public Token(int start, int end)(Code)
	Constructs a Token with null text and start & end offsets. Parameters: start - start offset Parameters: end - end offset

Token
public Token(int start, int end, String typ)(Code)
	Constructs a Token with null text and start & end offsets plus the Token type. Parameters: start - start offset Parameters: end - end offset

Token
public Token(String text, int start, int end)(Code)
	Constructs a Token with the given term text, and start & end offsets. The type defaults to "word." NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text. Parameters: text - term text Parameters: start - start offset Parameters: end - end offset

Token
public Token(String text, int start, int end, String typ)(Code)
	Constructs a Token with the given text, start and end offsets, & type. NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text. Parameters: text - term text Parameters: start - start offset Parameters: end - end offset Parameters: typ - token type

Method Detail

clear
public void clear()(Code)
	Resets the term text, payload, and positionIncrement to default. Other fields such as startOffset, endOffset and the token type are not reset since they are normally overwritten by the tokenizer.

clone
public Object clone()(Code)

endOffset
final public int endOffset()(Code)
	Returns this Token's ending offset, one greater than the position of the last character corresponding to this token in the source text.

getPayload
public Payload getPayload()(Code)
	Returns this Token's payload.

getPositionIncrement
public int getPositionIncrement()(Code)
	Returns the position increment of this Token. See Also: Token.setPositionIncrement

resizeTermBuffer
public char[] resizeTermBuffer(int newSize)(Code)
	Grows the termBuffer to at least size newSize. Parameters: newSize - minimum size of the new termBuffer newly created termBuffer with length >= newSize

setEndOffset
public void setEndOffset(int offset)(Code)
	Set the ending offset. See Also: Token.endOffset() See Also:

setPayload
public void setPayload(Payload payload)(Code)
	Sets this Token's payload.

setPositionIncrement

public void setPositionIncrement(int positionIncrement)(Code)

Set the position increment. This determines the position of this token relative to the previous Token in a TokenStream , used in phrase searching.

The default value is one.

Some common uses for this are:

Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.
Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

See Also: org.apache.lucene.index.TermPositions

setStartOffset
public void setStartOffset(int offset)(Code)
	Set the starting offset. See Also: Token.startOffset() See Also:

setTermBuffer
final public void setTermBuffer(char[] buffer, int offset, int length)(Code)
	Copies the contents of buffer, starting at offset for length characters, into the termBuffer array. NOTE: for better indexing speed you should instead retrieve the termBuffer, using Token.termBuffer() or Token.resizeTermBuffer(int) , and fill it in directly to set the term text. This saves an extra copy.

setTermLength
final public void setTermLength(int length)(Code)
	Set number of valid characters (length of the term) in the termBuffer array.

setTermText
public void setTermText(String text)(Code)
	Sets the Token's term text. NOTE: for better indexing speed you should instead use the char[] termBuffer methods to set the term text.

setType
final public void setType(String type)(Code)
	Set the lexical type. See Also: Token.type() See Also:

startOffset
final public int startOffset()(Code)
	Returns this Token's starting offset, the position of the first character corresponding to this token in the source text. Note that the difference between endOffset() and startOffset() may not be equal to termText.length(), as the term text may have been altered by a stemmer or some other filter.

termBuffer
final public char[] termBuffer()(Code)
	Returns the internal termBuffer character array which you can then directly alter. If the array is too small for your token, use Token.resizeTermBuffer(int) to increase it. After altering the buffer be sure to call Token.setTermLength to record the number of valid characters that were placed into the termBuffer.

termLength
final public int termLength()(Code)
	Return number of valid characters (length of the term) in the termBuffer array.

termText
final public String termText()(Code)
	Returns the Token's term text. Token.termBuffer()Token.termLength()

toString
public String toString()(Code)

type
final public String type()(Code)
	Returns this Token's lexical type. Defaults to "word".

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.