| org.apache.lucene.analysis.cjk.CJKTokenizer
CJKTokenizer | final public class CJKTokenizer extends Tokenizer (Code) | | CJKTokenizer was modified from StopTokenizer which does a decent job for
most European languages. It performs other token methods for double-byte
Characters: the token will return at each two charactors with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it
also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation:
please search google
author: Che, Dong |
Constructor Summary | |
public | CJKTokenizer(Reader in) Construct a token stream processing the given input. |
Method Summary | |
final public Token | next() Returns the next token in the stream, or null at EOS. |
CJKTokenizer | public CJKTokenizer(Reader in)(Code) | | Construct a token stream processing the given input.
Parameters: in - I/O reader |
next | final public Token next() throws java.io.IOException(Code) | | Returns the next token in the stream, or null at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
Token throws: java.io.IOException - - throw IOException when read error hanppened in the InputStream |
|
|