| java.lang.Object java.text.BreakIterator java.text.RuleBasedBreakIterator java.text.DictionaryBasedBreakIterator
DictionaryBasedBreakIterator | class DictionaryBasedBreakIterator extends RuleBasedBreakIterator (Code) | | A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator,
but adds one more special substitution name: <dictionary>. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in <dictionary>, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It follows a prescribed search path to locate the dictionary (right now,
it looks for it in /com/ibm/text/resources in each directory in the classpath,
and won't find it in JAR files, but this location is likely to change). The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
|
Constructor Summary | |
public | DictionaryBasedBreakIterator(String description, InputStream dictionaryStream) Constructs a DictionaryBasedBreakIterator.
Parameters: description - Same as the description parameter on RuleBasedBreakIterator,except for the special meaning of "". |
Method Summary | |
public int | first() Sets the current iteration position to the beginning of the text. | public int | following(int offset) Sets the current iteration position to the first boundary position after
the specified position. | protected int | handleNext() This is the implementation function for next(). | public int | last() Sets the current iteration position to the end of the text. | protected int | lookupCategory(char c) Looks up a character category for a character. | protected RuleBasedBreakIterator.Builder | makeBuilder() Returns a Builder that is customized to build a DictionaryBasedBreakIterator. | public int | preceding(int offset) Sets the current iteration position to the last boundary position
before the specified position. | public int | previous() Advances the iterator one step backwards. | public void | setText(CharacterIterator newText) |
DictionaryBasedBreakIterator | public DictionaryBasedBreakIterator(String description, InputStream dictionaryStream) throws IOException(Code) | | Constructs a DictionaryBasedBreakIterator.
Parameters: description - Same as the description parameter on RuleBasedBreakIterator,except for the special meaning of "". This parameter is justpassed through to RuleBasedBreakIterator's constructor. Parameters: dictionaryFilename - The filename of the dictionary file to use |
first | public int first()(Code) | | Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
The offset of the beginning of the text. |
following | public int following(int offset)(Code) | | Sets the current iteration position to the first boundary position after
the specified position.
Parameters: offset - The position to begin searching forward from The position of the first boundary after "offset" |
handleNext | protected int handleNext()(Code) | | This is the implementation function for next().
|
last | public int last()(Code) | | Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
The text's past-the-end offset. |
lookupCategory | protected int lookupCategory(char c)(Code) | | Looks up a character category for a character.
|
makeBuilder | protected RuleBasedBreakIterator.Builder makeBuilder()(Code) | | Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
This is the same as RuleBasedBreakIterator.Builder, except for the extra code
to handle the tag.
|
preceding | public int preceding(int offset)(Code) | | Sets the current iteration position to the last boundary position
before the specified position.
Parameters: offset - The position to begin searching from The position of the last boundary before "offset" |
previous | public int previous()(Code) | | Advances the iterator one step backwards.
The position of the last boundary position before thecurrent iteration position |
Fields inherited from java.text.RuleBasedBreakIterator | final protected static byte IGNORE(Code)(Java Doc)
|
Fields inherited from java.text.BreakIterator | final public static int DONE(Code)(Java Doc)
|
|
|