| |
|
| java.lang.Object com.ibm.icu.text.BreakIterator com.ibm.icu.text.RuleBasedBreakIterator com.ibm.icu.text.DictionaryBasedBreakIterator
DictionaryBasedBreakIterator | public class DictionaryBasedBreakIterator extends RuleBasedBreakIterator (Code) | | A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old,
but adds one more special substitution name: _dictionary_. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in _dictionary_, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It uses Class.getResource() to locate the dictionary file. The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
|
Method Summary | |
public int | first() Sets the current iteration position to the beginning of the text. | public int | following(int offset) Sets the current iteration position to the first boundary position after
the specified position. | public int | getRuleStatus() Return the status tag from the break rule that determined the most recently
returned break position. | public int | getRuleStatusVec(int[] fillInArray) Get the status (tag) values from the break rule(s) that determined the most
recently returned break position. | protected int | handleNext() This is the implementation function for next(). | public int | last() Sets the current iteration position to the end of the text. | public int | preceding(int offset) Sets the current iteration position to the last boundary position
before the specified position. | public int | previous() Advances the iterator one step backwards. | public void | setText(CharacterIterator newText) |
DictionaryBasedBreakIterator | public DictionaryBasedBreakIterator(String rules, InputStream dictionaryStream) throws IOException(Code) | | Constructs a DictionaryBasedBreakIterator.
Parameters: rules - Same as the rules parameter on RuleBasedBreakIterator,except for the special meaning of "_dictionary_". This parameter is justpassed through to RuleBasedBreakIterator constructor. Parameters: dictionaryStream - the stream containing the dictionary data |
DictionaryBasedBreakIterator | public DictionaryBasedBreakIterator(InputStream compiledRules, InputStream dictionaryStream) throws IOException(Code) | | Construct a DictionarBasedBreakIterator from precompiled rules.
Parameters: compiledRules - an input stream containing the binary (flattened) compiled rules. Parameters: dictionaryStream - an input stream containing the dictionary data |
first | public int first()(Code) | | Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
The offset of the beginning of the text. |
following | public int following(int offset)(Code) | | Sets the current iteration position to the first boundary position after
the specified position.
Parameters: offset - The position to begin searching forward from The position of the first boundary after "offset" |
getRuleStatus | public int getRuleStatus()(Code) | | Return the status tag from the break rule that determined the most recently
returned break position.
TODO: not supported with dictionary based break iterators.
the status from the break rule that determined the most recentlyreturned break position. |
getRuleStatusVec | public int getRuleStatusVec(int[] fillInArray)(Code) | | Get the status (tag) values from the break rule(s) that determined the most
recently returned break position. The values appear in the rule source
within brackets, {123}, for example. The default status value for rules
that do not explicitly provide one is zero.
TODO: not supported for dictionary based break iterator.
Parameters: fillInArray - an array to be filled in with the status values. The number of rule status values from rules that determined the most recent boundary returned by the break iterator.In the event that the array is too small, the return valueis the total number of status values that were available,not the reduced number that were actually returned. |
handleNext | protected int handleNext()(Code) | | This is the implementation function for next().
|
last | public int last()(Code) | | Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
The text's past-the-end offset. |
preceding | public int preceding(int offset)(Code) | | Sets the current iteration position to the last boundary position
before the specified position.
Parameters: offset - The position to begin searching from The position of the last boundary before "offset" |
previous | public int previous()(Code) | | Advances the iterator one step backwards.
The position of the last boundary position before thecurrent iteration position |
|
|
|