| java.lang.Object com.ibm.icu.text.BreakIterator
All known Subclasses: com.ibm.icu.text.RuleBasedBreakIterator, com.ibm.icu.text.BreakIteratorFactory,
BreakIterator | abstract public class BreakIterator implements Cloneable(Code) | | A class that locates boundaries in text. This class defines a protocol for
objects that break up a piece of natural-language text according to a set
of criteria. Instances or subclasses of BreakIterator can be provided, for
example, to break a piece of text into words, sentences, or logical characters
according to the conventions of some language or group of languages.
We provide five built-in types of BreakIterator:
- getTitleInstance() returns a BreakIterator that locates boundaries
between title breaks.
- getSentenceInstance() returns a BreakIterator that locates boundaries
between sentences. This is useful for triple-click selection, for example.
- getWordInstance() returns a BreakIterator that locates boundaries between
words. This is useful for double-click selection or "find whole words" searches.
This type of BreakIterator makes sure there is a boundary position at the
beginning and end of each legal word. (Numbers count as words, too.) Whitespace
and punctuation are kept separate from real words.
- getLineInstance() returns a BreakIterator that locates positions where it is
legal for a text editor to wrap lines. This is similar to word breaking, but
not the same: punctuation and whitespace are generally kept with words (you don't
want a line to start with whitespace, for example), and some special characters
can force a position to be considered a line-break position or prevent a position
from being a line-break position.
- getCharacterInstance() returns a BreakIterator that locates boundaries between
logical characters. Because of the structure of the Unicode encoding, a logical
character may be stored internally as more than one Unicode code point. (A with an
umlaut may be stored as an a followed by a separate combining umlaut character,
for example, but the user still thinks of it as one character.) This iterator allows
various processes (especially text editors) to treat as characters the units of text
that a user would think of as characters, rather than the units of text that the
computer sees as "characters".
BreakIterator's interface follows an "iterator" model (hence the name), meaning it
has a concept of a "current position" and methods like first(), last(), next(),
and previous() that update the current position. All BreakIterators uphold the
following invariants:
- The beginning and end of the text are always treated as boundary positions.
- The current position of the iterator is always a boundary position (random-
access methods move the iterator to the nearest boundary position before or
after the specified position, not _to_ the specified position).
- DONE is used as a flag to indicate when iteration has stopped. DONE is only
returned when the current position is the end of the text and the user calls next(),
or when the current position is the beginning of the text and the user calls
previous().
- Break positions are numbered by the positions of the characters that follow
them. Thus, under normal circumstances, the position before the first character
is 0, the position after the first character is 1, and the position after the
last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it analyzes,
at will, but cannot change the behavior. If the user wants different behavior, he
must instantiate a new iterator.
BreakIterator accesses the text it analyzes through a CharacterIterator, which makes
it possible to use BreakIterator to analyze text in any text-storage vehicle that
provides a CharacterIterator interface.
NOTE: Some types of BreakIterator can take a long time to create, and
instances of BreakIterator are not currently cached by the system. For
optimal performance, keep instances of BreakIterator around as long as makes
sense. For example, when word-wrapping a document, don't create and destroy a
new BreakIterator for each line. Create one break iterator for the whole document
(or whatever stretch of text you're wrapping) and use it to do the whole job of
wrapping the text.
Examples:
Creating and using text boundaries
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(source.substring(start,end));
}
}
Print each element in reverse order
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous();
start != BreakIterator.DONE;
end = start, start = boundary.previous()) {
System.out.println(source.substring(start,end));
}
}
Print first element
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start,end));
}
Print last element
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Print the element at a specified position
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Find the next word
public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}
(The iterator returned by BreakIterator.getWordInstance() is unique in that
the break positions it returns don't represent both the start and end of the
thing being iterated over. That is, a sentence-break iterator returns breaks
that each represent the end of one sentence and the beginning of the next.
With the word-break iterator, the characters between two boundaries might be a
word, or they might be the punctuation or whitespace between two words. The
above code uses a simple heuristic to determine which boundary is the beginning
of a word: If the characters between this boundary and the next boundary
include at least one letter (this can be an alphabetical letter, a CJK ideograph,
a Hangul syllable, a Kana character, etc.), then the text between this boundary
and the next is a word; otherwise, it's the material between words.)
See Also: CharacterIterator |
Inner Class :abstract static class BreakIteratorServiceShim | |
Field Summary | |
final public static int | DONE DONE is returned by previous() and next() after all valid
boundaries have been returned. | final public static int | KIND_CHARACTER | final public static int | KIND_LINE | final public static int | KIND_SENTENCE | final public static int | KIND_TITLE | final public static int | KIND_WORD |
Constructor Summary | |
protected | BreakIterator() Default constructor. |
Method Summary | |
public Object | clone() Clone method. | abstract public int | current() Return the iterator's current position. | abstract public int | first() Return the first boundary position. | abstract public int | following(int offset) Sets the iterator's current iteration position to be the first
boundary position following the specified position. | public static synchronized Locale[] | getAvailableLocales() Returns a list of locales for which BreakIterators can be used.
An array of Locales. | public static synchronized ULocale[] | getAvailableULocales() Returns a list of locales for which BreakIterators can be used.
An array of Locales. | public static BreakIterator | getBreakInstance(ULocale where, int kind) Get a particular kind of BreakIterator for a locale.
Avoids writing a switch statement with getXYZInstance(where) calls. | public static BreakIterator | getCharacterInstance() Returns a new instance of BreakIterator that locates logical-character
boundaries. | public static BreakIterator | getCharacterInstance(Locale where) Returns a new instance of BreakIterator that locates logical-character
boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. | public static BreakIterator | getCharacterInstance(ULocale where) Returns a new instance of BreakIterator that locates logical-character
boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. | public static BreakIterator | getLineInstance() Returns a new instance of BreakIterator that locates legal line-
wrapping positions. | public static BreakIterator | getLineInstance(Locale where) Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
Parameters: where - A Locale specifying the language of the text being broken. | public static BreakIterator | getLineInstance(ULocale where) Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
Parameters: where - A Locale specifying the language of the text being broken. | final public ULocale | getLocale(ULocale.Type type) Return the locale that was used to create this object, or null.
This may may differ from the locale requested at the time of
this object's creation. | public static BreakIterator | getSentenceInstance() Returns a new instance of BreakIterator that locates sentence boundaries. | public static BreakIterator | getSentenceInstance(Locale where) Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. | public static BreakIterator | getSentenceInstance(ULocale where) Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. | abstract public CharacterIterator | getText() Returns a CharacterIterator over the text being analyzed.
For at least some subclasses of BreakIterator, this is a reference
to the actual iterator being used by the BreakIterator,
and therefore, this function's return value should be treated as
const. | public static BreakIterator | getTitleInstance() Returns a new instance of BreakIterator that locates title boundaries.
This function assumes the text being analyzed is in the default locale's
language. | public static BreakIterator | getTitleInstance(Locale where) Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. | public static BreakIterator | getTitleInstance(ULocale where) Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. | public static BreakIterator | getWordInstance() Returns a new instance of BreakIterator that locates word boundaries. | public static BreakIterator | getWordInstance(Locale where) Returns a new instance of BreakIterator that locates word boundaries.
Parameters: where - A locale specifying the language of the text to beanalyzed. | public static BreakIterator | getWordInstance(ULocale where) Returns a new instance of BreakIterator that locates word boundaries.
Parameters: where - A locale specifying the language of the text to beanalyzed. | public boolean | isBoundary(int offset) Return true if the specfied position is a boundary position. | abstract public int | last() Return the last boundary position. | abstract public int | next(int n) Advances the specified number of steps forward in the text (a negative
number, therefore, advances backwards). | abstract public int | next() Advances the iterator forward one boundary. | public int | preceding(int offset) Sets the iterator's current iteration position to be the last
boundary position preceding the specified position. | abstract public int | previous() Advances the iterator backward one boundary. | public static Object | registerInstance(BreakIterator iter, Locale locale, int kind) Register a new break iterator of the indicated kind, to use in the given locale. | public static Object | registerInstance(BreakIterator iter, ULocale locale, int kind) Register a new break iterator of the indicated kind, to use in the given locale. | final void | setLocale(ULocale valid, ULocale actual) Set information about the locales that were used to create this
object. | public void | setText(String newText) Sets the iterator to analyze a new piece of text. | abstract public void | setText(CharacterIterator newText) Sets the iterator to analyze a new piece of text. | public static boolean | unregister(Object key) Unregister a previously-registered BreakIterator using the key returned from the
register call. |
DONE | final public static int DONE(Code) | | DONE is returned by previous() and next() after all valid
boundaries have been returned.
|
KIND_CHARACTER | final public static int KIND_CHARACTER(Code) | | |
KIND_LINE | final public static int KIND_LINE(Code) | | |
KIND_SENTENCE | final public static int KIND_SENTENCE(Code) | | |
KIND_TITLE | final public static int KIND_TITLE(Code) | | |
KIND_WORD | final public static int KIND_WORD(Code) | | |
BreakIterator | protected BreakIterator()(Code) | | Default constructor. There is no state that is carried by this abstract
base class.
|
clone | public Object clone()(Code) | | Clone method. Creates another BreakIterator with the same behavior and
current state as this one.
The clone. |
current | abstract public int current()(Code) | | Return the iterator's current position.
The iterator's current position. |
first | abstract public int first()(Code) | | Return the first boundary position. This is always the beginning
index of the text this iterator iterates over. For example, if
the iterator iterates over a whole string, this function will
always return 0. This function also updates the iteration position
to point to the beginning of the text.
The character offset of the beginning of the stretch of textbeing broken. |
following | abstract public int following(int offset)(Code) | | Sets the iterator's current iteration position to be the first
boundary position following the specified position. (Whether the
specified position is itself a boundary position or not doesn't
matter-- this function always moves the iteration position to the
first boundary after the specified position.) If the specified
position is the past-the-end position, returns DONE.
Parameters: offset - The character position to start searching from. The position of the first boundary position following"offset" (whether or not "offset" itself is a boundary position),or DONE if "offset" is the past-the-end offset. |
getAvailableLocales | public static synchronized Locale[] getAvailableLocales()(Code) | | Returns a list of locales for which BreakIterators can be used.
An array of Locales. All of the locales in the array canbe used when creating a BreakIterator. |
getAvailableULocales | public static synchronized ULocale[] getAvailableULocales()(Code) | | Returns a list of locales for which BreakIterators can be used.
An array of Locales. All of the locales in the array canbe used when creating a BreakIterator. |
getBreakInstance | public static BreakIterator getBreakInstance(ULocale where, int kind)(Code) | | Get a particular kind of BreakIterator for a locale.
Avoids writing a switch statement with getXYZInstance(where) calls.
|
getCharacterInstance | public static BreakIterator getCharacterInstance()(Code) | | Returns a new instance of BreakIterator that locates logical-character
boundaries. This function assumes that the text being analyzed is
in the default locale's language.
A new instance of BreakIterator that locates logical-characterboundaries. |
getCharacterInstance | public static BreakIterator getCharacterInstance(Locale where)(Code) | | Returns a new instance of BreakIterator that locates logical-character
boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates logical-characterboundaries. |
getCharacterInstance | public static BreakIterator getCharacterInstance(ULocale where)(Code) | | Returns a new instance of BreakIterator that locates logical-character
boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates logical-characterboundaries. |
getLineInstance | public static BreakIterator getLineInstance()(Code) | | Returns a new instance of BreakIterator that locates legal line-
wrapping positions. This function assumes the text being broken
is in the default locale's language.
A new instance of BreakIterator that locates legalline-wrapping positions. |
getLineInstance | public static BreakIterator getLineInstance(Locale where)(Code) | | Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
Parameters: where - A Locale specifying the language of the text being broken. A new instance of BreakIterator that locates legalline-wrapping positions. |
getLineInstance | public static BreakIterator getLineInstance(ULocale where)(Code) | | Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
Parameters: where - A Locale specifying the language of the text being broken. A new instance of BreakIterator that locates legalline-wrapping positions. |
getSentenceInstance | public static BreakIterator getSentenceInstance()(Code) | | Returns a new instance of BreakIterator that locates sentence boundaries.
This function assumes the text being analyzed is in the default locale's
language.
A new instance of BreakIterator that locates sentence boundaries. |
getSentenceInstance | public static BreakIterator getSentenceInstance(Locale where)(Code) | | Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates sentence boundaries. |
getSentenceInstance | public static BreakIterator getSentenceInstance(ULocale where)(Code) | | Returns a new instance of BreakIterator that locates sentence boundaries.
Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates sentence boundaries. |
getText | abstract public CharacterIterator getText()(Code) | | Returns a CharacterIterator over the text being analyzed.
For at least some subclasses of BreakIterator, this is a reference
to the actual iterator being used by the BreakIterator,
and therefore, this function's return value should be treated as
const. No guarantees are made about the current position
of this iterator when it is returned. If you need to move that
position to examine the text, clone this function's return value first.
A CharacterIterator over the text being analyzed. |
getTitleInstance | public static BreakIterator getTitleInstance()(Code) | | Returns a new instance of BreakIterator that locates title boundaries.
This function assumes the text being analyzed is in the default locale's
language. The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use a word boundary iterator.
BreakIterator.getWordInstance A new instance of BreakIterator that locates title boundaries. |
getTitleInstance | public static BreakIterator getTitleInstance(Locale where)(Code) | | Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.
BreakIterator.getWordInstance Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates title boundaries. |
getTitleInstance | public static BreakIterator getTitleInstance(ULocale where)(Code) | | Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.
BreakIterator.getWordInstance Parameters: where - A Locale specifying the language of the text being analyzed. A new instance of BreakIterator that locates title boundaries. |
getWordInstance | public static BreakIterator getWordInstance()(Code) | | Returns a new instance of BreakIterator that locates word boundaries.
This function assumes that the text being analyzed is in the default
locale's language.
An instance of BreakIterator that locates word boundaries. |
getWordInstance | public static BreakIterator getWordInstance(Locale where)(Code) | | Returns a new instance of BreakIterator that locates word boundaries.
Parameters: where - A locale specifying the language of the text to beanalyzed. An instance of BreakIterator that locates word boundaries. |
getWordInstance | public static BreakIterator getWordInstance(ULocale where)(Code) | | Returns a new instance of BreakIterator that locates word boundaries.
Parameters: where - A locale specifying the language of the text to beanalyzed. An instance of BreakIterator that locates word boundaries. |
isBoundary | public boolean isBoundary(int offset)(Code) | | Return true if the specfied position is a boundary position. If the
function returns true, the current iteration position is set to the
specified position; if the function returns false, the current
iteration position is set as though following() had been called.
Parameters: offset - the offset to check. True if "offset" is a boundary position. |
last | abstract public int last()(Code) | | Return the last boundary position. This is always the "past-the-end"
index of the text this iterator iterates over. For example, if the
iterator iterates over a whole string (call it "text"), this function
will always return text.length(). This function also updated the
iteration position to point to the end of the text.
The character offset of the end of the stretch of textbeing broken. |
next | abstract public int next(int n)(Code) | | Advances the specified number of steps forward in the text (a negative
number, therefore, advances backwards). If this causes the iterator
to advance off either end of the text, this function returns DONE;
otherwise, this function returns the position of the appropriate
boundary. Calling this function is equivalent to calling next() or
previous() n times.
Parameters: n - The number of boundaries to advance over (if positive, movesforward; if negative, moves backwards). The position of the boundary n boundaries from the currentiteration position, or DONE if moving n boundaries causes the iteratorto advance off either end of the text. |
next | abstract public int next()(Code) | | Advances the iterator forward one boundary. The current iteration
position is updated to point to the next boundary position after the
current position, and this is also the value that is returned. If
the current position is equal to the value returned by last(), or to
DONE, this function returns DONE and sets the current position to
DONE.
The position of the first boundary position following theiteration position. |
preceding | public int preceding(int offset)(Code) | | Sets the iterator's current iteration position to be the last
boundary position preceding the specified position. (Whether the
specified position is itself a boundary position or not doesn't
matter-- this function always moves the iteration position to the
last boundary before the specified position.) If the specified
position is the starting position, returns DONE.
Parameters: offset - The character position to start searching from. The position of the last boundary position preceding"offset" (whether of not "offset" itself is a boundary position),or DONE if "offset" is the starting offset of the iterator. |
previous | abstract public int previous()(Code) | | Advances the iterator backward one boundary. The current iteration
position is updated to point to the last boundary position before
the current position, and this is also the value that is returned. If
the current position is equal to the value returned by first(), or to
DONE, this function returns DONE and sets the current position to
DONE.
The position of the last boundary position preceding theiteration position. |
registerInstance | public static Object registerInstance(BreakIterator iter, Locale locale, int kind)(Code) | | Register a new break iterator of the indicated kind, to use in the given locale.
Clones of the iterator will be returned
if a request for a break iterator of the given kind matches or falls back to
this locale.
Parameters: iter - the BreakIterator instance to adopt. Parameters: locale - the Locale for which this instance is to be registered Parameters: kind - the type of iterator for which this instance is to be registered a registry key that can be used to unregister this instance |
registerInstance | public static Object registerInstance(BreakIterator iter, ULocale locale, int kind)(Code) | | Register a new break iterator of the indicated kind, to use in the given locale.
Clones of the iterator will be returned
if a request for a break iterator of the given kind matches or falls back to
this locale.
Parameters: iter - the BreakIterator instance to adopt. Parameters: locale - the Locale for which this instance is to be registered Parameters: kind - the type of iterator for which this instance is to be registered a registry key that can be used to unregister this instance |
setLocale | final void setLocale(ULocale valid, ULocale actual)(Code) | | Set information about the locales that were used to create this
object. If the object was not constructed from locale data,
both arguments should be set to null. Otherwise, neither
should be null. The actual locale must be at the same level or
less specific than the valid locale. This method is intended
for use by factories or other entities that create objects of
this class.
Parameters: valid - the most specific locale containing any resourcedata, or null Parameters: actual - the locale containing data used to construct thisobject, or null See Also: com.ibm.icu.util.ULocale See Also: com.ibm.icu.util.ULocale.VALID_LOCALE See Also: com.ibm.icu.util.ULocale.ACTUAL_LOCALE |
setText | public void setText(String newText)(Code) | | Sets the iterator to analyze a new piece of text. The new
piece of text is passed in as a String, and the current
iteration position is reset to the beginning of the string.
(The old text is dropped.)
Parameters: newText - A String containing the text to analyze withthis BreakIterator. |
setText | abstract public void setText(CharacterIterator newText)(Code) | | Sets the iterator to analyze a new piece of text. The
BreakIterator is passed a CharacterIterator through which
it will access the text itself. The current iteration
position is reset to the CharacterIterator's start index.
(The old iterator is dropped.)
Parameters: newText - A CharacterIterator referring to the textto analyze with this BreakIterator (the iterator's currentposition is ignored, but its other state is significant). |
unregister | public static boolean unregister(Object key)(Code) | | Unregister a previously-registered BreakIterator using the key returned from the
register call. Key becomes invalid after this call and should not be used again.
Parameters: key - the registry key returned by a previous call to registerInstance true if the iterator for the key was successfully unregistered |
|
|