| java.lang.Object sun.text.Normalizer
Normalizer | final public class Normalizer implements Cloneable(Code) | | Normalizer transforms Unicode text into an equivalent composed or
decomposed form, allowing for easier sorting and searching of text.
Normalizer supports the standard normalization forms described in
Unicode Technical Report #15.
Characters with accents or other adornments can be encoded in
several different ways in Unicode. For example, take the character "Â"
(A-acute). In Unicode, this can be encoded as a single character (the
"composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
0041 LATIN CAPITAL LETTER A
0301 COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be
treated as the same "user-level" character "Â". When you are searching or
comparing text, you must ensure that these two sequences are treated
equivalently. In addition, you must handle characters with more than one
accent. Sometimes the order of a character's combining accents is
significant, while in other cases accent sequences in different orders are
really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066 LATIN SMALL LETTER F
0066 LATIN SMALL LETTER F
0069 LATIN SMALL LETTER I
or as the single character
FB03 LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking
it shouldn't be in Unicode at all, but it was included for compatibility
with existing character sets that already provided it. The Unicode standard
identifies such characters by giving them "compatibility" decompositions
into the corresponding semantic characters. When sorting and searching, you
will often want to use these mappings.
Normalizer helps solve these problems by transforming text into the
canonical composed and decomposed forms as shown in the first example above.
In addition, you can have it perform compatibility decompositions so that
you can treat compatibility characters the same as their equivalents.
Finally, Normalizer rearranges accents into the proper canonical
order, so that you do not have to worry about accent rearrangement on your
own.
Normalizer adds one optional behavior,
Normalizer.IGNORE_HANGUL ,
that differs from
the standard Unicode Normalization Forms. This option can be passed
to the
Normalizer.Normalizer constructors and to the static
Normalizer.compose compose and
Normalizer.decompose decompose methods. This
option, and any that are added in the future, will be turned off by default.
There are three common usage models for Normalizer. In the first,
the static
Normalizer.normalize normalize() method is used to process an
entire input string at once. Second, you can create a Normalizer
object and use it to iterate through the normalized form of a string by
calling
Normalizer.first and
Normalizer.next . Finally, you can use the
Normalizer.setIndex setIndex() and
Normalizer.getIndex methods to perform
random-access iteration, which is very useful for searching.
Note: Normalizer objects behave like iterators and have
methods such as setIndex, next, previous, etc.
You should note that while the setIndex and getIndex refer
to indices in the underlying input text being processed, the
next and previous methods it iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next and previous and the indices passed to and
returned from setIndex and getIndex. It is for this
reason that Normalizer does not implement the
CharacterIterator interface.
Note: Normalizer is currently based on version 3.0
of the Unicode Standard.
It will be updated as later versions of Unicode are released. If you are
using this class on a JDK that supports an earlier version of Unicode, it
is possible that Normalizer may generate composed or dedecomposed
characters for which your JDK's
java.lang.Character class does not
have any data.
author: Laura Werner, Mark Davis |
Inner Class :final public static class Mode | |
Field Summary | |
final public static Mode | COMPOSE Canonical decomposition followed by canonical composition. | final public static Mode | COMPOSE_COMPAT Compatibility decomposition followed by canonical composition. | final public static Mode | DECOMP Canonical decomposition. | final public static Mode | DECOMP_COMPAT Compatibility decomposition. | final public static char | DONE Constant indicating that the end of the iteration has been reached. | final static char | HANGUL_BASE | final static char | HANGUL_LIMIT | final public static int | IGNORE_HANGUL Option to disable Hangul/Jamo composition and decomposition.
This option applies to Korean text,
which can be represented either in the Jamo alphabet or in Hangul
characters, which are really just two or three Jamo combined
into one visual glyph. | final public static Mode | NO_OP Null operation for use with the
Normalizer.Normalizer constructors and the static
Normalizer.normalize normalize method. | final static int | STR_INDEX_SHIFT | final static int | STR_LENGTH_MASK |
Constructor Summary | |
public | Normalizer(String str, Mode mode) Creates a new Normalizer object for iterating over the
normalized form of a given string.
Parameters: str - The string to be normalized. | public | Normalizer(String str, Mode mode, int opt) Creates a new Normalizer object for iterating over the
normalized form of a given string.
The options parameter specifies which optional
Normalizer features are to be enabled for this object.
Parameters: str - The string to be normalized. | public | Normalizer(CharacterIterator iter, Mode mode) Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. | public | Normalizer(CharacterIterator iter, Mode mode, int opt) Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. |
Method Summary | |
public Object | clone() Clones this Normalizer object. | public static String | compose(String source, boolean compat, int options) Compose a String.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is
Normalizer.IGNORE_HANGUL .
If you want the default behavior corresponding
to Unicode Normalization Form C or KC,
use 0 for this argument.
Parameters: source - the string to be composed. Parameters: compat - Perform compatibility decomposition before composition.If this argument is false, only canonicaldecomposition will be performed. Parameters: options - the optional features to be enabled. | final static int | composeAction(int baseIndex, int comIndex) | final static int | composeLookup(char ch) | public char | current() Return the current character in the normalized text. | public static String | decompose(String source, boolean compat, int options) Static method to decompose a String.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is
Normalizer.IGNORE_HANGUL .
The desired options should be OR'ed together to determine the value
of this argument. | public static String | decompose(String source, boolean compat, int options, boolean addSingleQuotation) | final static void | explode(StringBuffer target, int index) | public char | first() Return the first character in the normalized text. | final public int | getBeginIndex() Retrieve the index of the start of the input text. | final public static int | getClass(char ch) | final public int | getEndIndex() Retrieve the index of the end of the input text. | final public int | getIndex() Retrieve the current iteration position in the input text that is
being normalized. | public Mode | getMode() | public boolean | getOption(int option) Determine whether an option is turned on or off. | static int | hangulToJamo(char ch, StringBuffer result, int decompLimit) Convert a single Hangul syllable into one or more Jamo characters. | final static int | jamoAppend(char ch, int limit, StringBuffer dest) | public char | last() Return the last character in the normalized text. | public char | next() Return the current character in the normalized text and advance
the iteration position by one. | public static String | normalize(String str, Mode mode, int options) Normalizes a String using the given normalization operation. | public static String | normalize(String str, Mode mode, int options, boolean addSingleQuotation) | final static char | pairExplode(StringBuffer target, int action) | public char | previous() Return the previous character in the normalized text and decrement
the iteration position by one. | public void | reset() | public char | setIndex(int index) Set the iteration position in the input text that is being normalized
and return the first normalized character at that position.
Parameters: index - the desired index in the input text. | public void | setIndexOnly(int index) | public void | setMode(Mode newMode) Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating
over a string, calls to
Normalizer.next and
Normalizer.previous may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call
Normalizer.setText setText() ,
Normalizer.first ,
Normalizer.last , etc. | public void | setOption(int option, boolean value) Set options that affect this Normalizer's operation.
Options do not change the basic composition or decomposition operation
that is being performed , but they control whether
certain optional portions of the operation are done.
Currently the only available option is:
| public void | setText(String newText) Set the input text over which this Normalizer will iterate. | public void | setText(CharacterIterator newText) Set the input text over which this Normalizer will iterate. |
DONE | final public static char DONE(Code) | | Constant indicating that the end of the iteration has been reached.
This is guaranteed to have the same value as
CharacterIterator.DONE .
|
HANGUL_BASE | final static char HANGUL_BASE(Code) | | |
HANGUL_LIMIT | final static char HANGUL_LIMIT(Code) | | |
IGNORE_HANGUL | final public static int IGNORE_HANGUL(Code) | | Option to disable Hangul/Jamo composition and decomposition.
This option applies to Korean text,
which can be represented either in the Jamo alphabet or in Hangul
characters, which are really just two or three Jamo combined
into one visual glyph. Since Jamo takes up more storage space than
Hangul, applications that process only Hangul text may wish to turn
this option on when decomposing text.
The Unicode standard treates Hangul to Jamo conversion as a
canonical decomposition, so this option must be turned off if you
wish to transform strings into one of the standard
Unicode Normalization Forms.
See Also: Normalizer.setOption |
NO_OP | final public static Mode NO_OP(Code) | | Null operation for use with the
Normalizer.Normalizer constructors and the static
Normalizer.normalize normalize method. This value tells
the Normalizer to do nothing but return unprocessed characters
from the underlying String or CharacterIterator. If you have code which
requires raw text at some times and normalized text at others, you can
use NO_OP for the cases where you want raw text, rather
than having a separate code path that bypasses Normalizer
altogether.
See Also: Normalizer.setMode |
STR_INDEX_SHIFT | final static int STR_INDEX_SHIFT(Code) | | |
STR_LENGTH_MASK | final static int STR_LENGTH_MASK(Code) | | |
Normalizer | public Normalizer(String str, Mode mode)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of a given string.
Parameters: str - The string to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. |
Normalizer | public Normalizer(String str, Mode mode, int opt)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of a given string.
The options parameter specifies which optional
Normalizer features are to be enabled for this object.
Parameters: str - The string to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. Parameters: opt - Any optional features to be enabled.Currently the only available option is Normalizer.IGNORE_HANGUL.If you want the default behavior corresponding to one of thestandard Unicode Normalization Forms, use 0 for this argument. |
Normalizer | public Normalizer(CharacterIterator iter, Mode mode)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. |
Normalizer | public Normalizer(CharacterIterator iter, Mode mode, int opt)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. Parameters: opt - Any optional features to be enabled.Currently the only available option is Normalizer.IGNORE_HANGUL.If you want the default behavior corresponding to one of thestandard Unicode Normalization Forms, use 0 for this argument. |
clone | public Object clone()(Code) | | Clones this Normalizer object. All properties of this
object are duplicated in the new object, including the cloning of any
CharacterIterator that was passed in to the constructor
or to
Normalizer.setText(CharacterIterator) setText .
However, the text storage underlying
the CharacterIterator is not duplicated unless the
iterator's clone method does so.
|
compose | public static String compose(String source, boolean compat, int options)(Code) | | Compose a String.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is
Normalizer.IGNORE_HANGUL .
If you want the default behavior corresponding
to Unicode Normalization Form C or KC,
use 0 for this argument.
Parameters: source - the string to be composed. Parameters: compat - Perform compatibility decomposition before composition.If this argument is false, only canonicaldecomposition will be performed. Parameters: options - the optional features to be enabled. the composed string. |
composeAction | final static int composeAction(int baseIndex, int comIndex)(Code) | | |
composeLookup | final static int composeLookup(char ch)(Code) | | |
current | public char current()(Code) | | Return the current character in the normalized text.
|
decompose | public static String decompose(String source, boolean compat, int options)(Code) | | Static method to decompose a String.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is
Normalizer.IGNORE_HANGUL .
The desired options should be OR'ed together to determine the value
of this argument. If you want the default behavior corresponding
to Unicode Normalization Form D or KD,
use 0 for this argument.
Parameters: str - the string to be decomposed. Parameters: compat - Perform compatibility decomposition.If this argument is false, only canonicaldecomposition will be performed. the decomposed string. |
decompose | public static String decompose(String source, boolean compat, int options, boolean addSingleQuotation)(Code) | | |
first | public char first()(Code) | | Return the first character in the normalized text. This resets
the Normalizer's position to the beginning of the text.
|
getBeginIndex | final public int getBeginIndex()(Code) | | Retrieve the index of the start of the input text. This is the begin index
of the CharacterIterator or the start (i.e. 0) of the String
over which this Normalizer is iterating
|
getClass | final public static int getClass(char ch)(Code) | | |
getEndIndex | final public int getEndIndex()(Code) | | Retrieve the index of the end of the input text. This is the end index
of the CharacterIterator or the length of the String
over which this Normalizer is iterating
|
getIndex | final public int getIndex()(Code) | | Retrieve the current iteration position in the input text that is
being normalized. This method is useful in applications such as
searching, where you need to be able to determine the position in
the input text that corresponds to a given normalized output character.
|
getOption | public boolean getOption(int option)(Code) | | Determine whether an option is turned on or off.
See Also: Normalizer.setOption |
hangulToJamo | static int hangulToJamo(char ch, StringBuffer result, int decompLimit)(Code) | | Convert a single Hangul syllable into one or more Jamo characters.
Parameters: conjoin - If true, decompose Jamo into conjoining Jamo. |
last | public char last()(Code) | | Return the last character in the normalized text. This resets
the Normalizer's position to be just before the
the input text corresponding to that normalized character.
|
next | public char next()(Code) | | Return the current character in the normalized text and advance
the iteration position by one. If the end
of the text has already been reached,
Normalizer.DONE is returned.
|
normalize | public static String normalize(String str, Mode mode, int options)(Code) | | Normalizes a String using the given normalization operation.
The options parameter specifies which optional
Normalizer features are to be enabled for this operation.
Currently the only available option is
Normalizer.IGNORE_HANGUL .
If you want the default behavior corresponding to one of the standard
Unicode Normalization Forms, use 0 for this argument.
Parameters: str - the input string to be normalized. Parameters: aMode - the normalization mode Parameters: options - the optional features to be enabled. |
normalize | public static String normalize(String str, Mode mode, int options, boolean addSingleQuotation)(Code) | | |
previous | public char previous()(Code) | | Return the previous character in the normalized text and decrement
the iteration position by one. If the beginning
of the text has already been reached,
Normalizer.DONE is returned.
|
reset | public void reset()(Code) | | |
setIndex | public char setIndex(int index)(Code) | | Set the iteration position in the input text that is being normalized
and return the first normalized character at that position.
Parameters: index - the desired index in the input text. the first normalized character that is the result of iteratingforward starting at the given index. throws: IllegalArgumentException - if the given index is less thanNormalizer.getBeginIndex or greater than Normalizer.getEndIndex. |
setIndexOnly | public void setIndexOnly(int index)(Code) | | |
setMode | public void setMode(Mode newMode)(Code) | | Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating
over a string, calls to
Normalizer.next and
Normalizer.previous may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call
Normalizer.setText setText() ,
Normalizer.first ,
Normalizer.last , etc. after calling setMode.
Parameters: newMode - the new mode for this Normalizer.The supported modes are: See Also: Normalizer.getMode |
setOption | public void setOption(int option, boolean value)(Code) | | Set options that affect this Normalizer's operation.
Options do not change the basic composition or decomposition operation
that is being performed , but they control whether
certain optional portions of the operation are done.
Currently the only available option is:
-
Normalizer.IGNORE_HANGUL - Do not decompose Hangul syllables into the Jamo alphabet
and vice-versa. This option is off by default (i.e. Hangul processing
is enabled) since the Unicode standard specifies that Hangul to Jamo
is a canonical decomposition. For any of the standard Unicode Normalization
Forms, you should leave this option off.
Parameters: option - the option whose value is to be set. Parameters: value - the new setting for the option. Use true toturn the option on and false to turn it off. See Also: Normalizer.getOption |
setText | public void setText(String newText)(Code) | | Set the input text over which this Normalizer will iterate.
The iteration position will be reset to the beginning.
Parameters: newText - The new string to be normalized. |
setText | public void setText(CharacterIterator newText)(Code) | | Set the input text over which this Normalizer will iterate.
The iteration position will be reset to the beginning.
Parameters: newText - The new text to be normalized. |
|
|