| java.lang.Object sun.text.normalizer.NormalizerBase
NormalizerBase | final public class NormalizerBase implements Cloneable(Code) | | Unicode Normalization
Unicode normalization API
normalize transforms Unicode text into an equivalent composed or
decomposed form, allowing for easier sorting and searching of text.
normalize supports the standard normalization forms described in
Unicode Standard Annex #15 — Unicode Normalization Forms.
Characters with accents or other adornments can be encoded in
several different ways in Unicode. For example, take the character A-acute.
In Unicode, this can be encoded as a single character (the
"composed" form):
00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
0041 LATIN CAPITAL LETTER A
0301 COMBINING ACUTE ACCENT
To a user of your program, however, both of these sequences should be
treated as the same "user-level" character "A with acute accent". When you
are searching or comparing text, you must ensure that these two sequences are
treated equivalently. In addition, you must handle characters with more than
one accent. Sometimes the order of a character's combining accents is
significant, while in other cases accent sequences in different orders are
really equivalent.
Similarly, the string "ffi" can be encoded as three separate letters:
0066 LATIN SMALL LETTER F
0066 LATIN SMALL LETTER F
0069 LATIN SMALL LETTER I
or as the single character
FB03 LATIN SMALL LIGATURE FFI
The ffi ligature is not a distinct semantic character, and strictly speaking
it shouldn't be in Unicode at all, but it was included for compatibility
with existing character sets that already provided it. The Unicode standard
identifies such characters by giving them "compatibility" decompositions
into the corresponding semantic characters. When sorting and searching, you
will often want to use these mappings.
normalize helps solve these problems by transforming text into
the canonical composed and decomposed forms as shown in the first example
above. In addition, you can have it perform compatibility decompositions so
that you can treat compatibility characters the same as their equivalents.
Finally, normalize rearranges accents into the proper canonical
order, so that you do not have to worry about accent rearrangement on your
own.
Form FCD, "Fast C or D", is also designed for collation.
It allows to work on strings that are not necessarily normalized
with an algorithm (like in collation) that works under "canonical closure",
i.e., it treats precomposed characters and their decomposed equivalents the
same.
It is not a normalization form because it does not provide for uniqueness of
representation. Multiple strings may be canonically equivalent (their NFDs
are identical) and may all conform to FCD without being identical themselves.
The form is defined such that the "raw decomposition", the recursive
canonical decomposition of each character, results in a string that is
canonically ordered. This means that precomposed characters are allowed for
as long as their decompositions do not need canonical reordering.
Its advantage for a process like collation is that all NFD and most NFC texts
- and many unnormalized texts - already conform to FCD and do not need to be
normalized (NFD) for such a process. The FCD quick check will return YES for
most strings in practice.
normalize(FCD) may be implemented with NFD.
For more details on FCD see the collation design document:
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_collation_design.htm
ICU collation performs either NFD or FCD normalization automatically if
normalization is turned on for the collator object. Beyond collation and
string search, normalized strings may be useful for string equivalence
comparisons, transliteration/transcription, unique representations, etc.
The W3C generally recommends to exchange texts in NFC.
Note also that most legacy character encodings use only precomposed forms and
often do not encode any combining marks by themselves. For conversion to such
character encodings the Unicode text needs to be normalized to NFC.
For more usage examples, see the Unicode Standard Annex.
|
Inner Class :public static class Mode | |
Inner Class :final public static class QuickCheckResult | |
Field Summary | |
final public static int | DONE Constant indicating that the end of the iteration has been reached. | final public static QuickCheckResult | MAYBE Indicates it cannot be determined if string is in the normalized
format without further thorough checks. | final public static Mode | NFC Canonical decomposition followed by canonical composition. | final public static Mode | NFD Canonical decomposition. | final public static Mode | NFKC Compatibility decomposition followed by canonical composition. | final public static Mode | NFKD Compatibility decomposition. | final public static QuickCheckResult | NO | final public static Mode | NONE No decomposition/composition. | final public static int | UNICODE_3_2 Options bit set value to select Unicode 3.2 normalization
(except NormalizationCorrections). | final public static int | UNICODE_3_2_0_ORIGINAL | final public static int | UNICODE_LATEST | final public static QuickCheckResult | YES |
Constructor Summary | |
public | NormalizerBase(String str, Mode mode, int opt) Creates a new Normalizer object for iterating over the
normalized form of a given string.
The options parameter specifies which optional
Normalizer features are to be enabled for this object.
Parameters: str - The string to be normalized. | public | NormalizerBase(CharacterIterator iter, Mode mode) Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. | public | NormalizerBase(CharacterIterator iter, Mode mode, int opt) Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. | public | NormalizerBase(String str, Mode mode) Creates a new Normalizer object for iterating over the
normalized form of a given string.
Parameters: str - The string to be normalized. |
Method Summary | |
public Object | clone() Clones this Normalizer object. | public static String | compose(String str, boolean compat, int options) Compose a string. | public int | current() | public static String | decompose(String str, boolean compat) Decompose a string.
The string will be decomposed to according the the specified mode.
Parameters: str - The string to decompose. Parameters: compat - If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. | public static String | decompose(String str, boolean compat, int options) Decompose a string.
The string will be decomposed to according the the specified mode.
Parameters: str - The string to decompose. Parameters: compat - If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. Parameters: options - The normalization options, ORed together (0 for no options). | public int | endIndex() | public int | getBeginIndex() Retrieve the index of the start of the input text. | public int | getEndIndex() Retrieve the index of the end of the input text. | public int | getIndex() Retrieve the current iteration position in the input text that is
being normalized. | public Mode | getMode() | public static boolean | isNFSkippable(int c, Mode mode) | public static boolean | isNormalized(String str, Normalizer.Form form) Test if a string is in a given normalization form. | public static boolean | isNormalized(String str, Normalizer.Form form, int options) Test if a string is in a given normalization form. | public int | next() Return the next character in the normalized text and advance
the iteration position by one. | public static int | normalize(char[] src, int srcStart, int srcLimit, char[] dest, int destStart, int destLimit, Mode mode, int options) Normalize a string.
The string will be normalized according the the specified normalization
mode and options.
Parameters: src - The char array to compose. Parameters: srcStart - Start index of the source Parameters: srcLimit - Limit index of the source Parameters: dest - The char buffer to fill in Parameters: destStart - Start index of the destination buffer Parameters: destLimit - End index of the destination buffer Parameters: mode - The normalization mode; one of Normalizer.NONE, Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, Normalizer.NFKD, Normalizer.DEFAULT Parameters: options - The normalization options, ORed together (0 for no options). | public static String | normalize(String str, Normalizer.Form form) Normalizes a String using the given normalization form. | public static String | normalize(String str, Normalizer.Form form, int options) Normalizes a String using the given normalization form. | public int | previous() Return the previous character in the normalized text and decrement
the iteration position by one. | public void | reset() Reset the index to the beginning of the text. | public int | setIndex(int index) Set the iteration position in the input text that is being normalized
and return the first normalized character at that position.
Note: This method sets the position in the input text,
while
NormalizerBase.next and
NormalizerBase.previous iterate through characters
in the normalized output. | public void | setIndexOnly(int index) Set the iteration position in the input text that is being normalized,
without any immediate normalization. | public void | setMode(Mode newMode) Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating
over a string, calls to
NormalizerBase.next and
NormalizerBase.previous may
return previously buffers characters in the old normalization mode
until the iteration is able to re-sync at the next base character.
It is safest to call
NormalizerBase.setText setText() ,
NormalizerBase.first ,
NormalizerBase.last , etc. | public void | setText(String newText) Set the input text over which this Normalizer will iterate. | public void | setText(CharacterIterator newText) Set the input text over which this Normalizer will iterate. |
DONE | final public static int DONE(Code) | | Constant indicating that the end of the iteration has been reached.
This is guaranteed to have the same value as
UCharacterIterator.DONE .
|
MAYBE | final public static QuickCheckResult MAYBE(Code) | | Indicates it cannot be determined if string is in the normalized
format without further thorough checks.
|
NFC | final public static Mode NFC(Code) | | Canonical decomposition followed by canonical composition.
|
NFD | final public static Mode NFD(Code) | | Canonical decomposition.
|
NFKC | final public static Mode NFKC(Code) | | Compatibility decomposition followed by canonical composition.
|
NFKD | final public static Mode NFKD(Code) | | Compatibility decomposition.
|
NO | final public static QuickCheckResult NO(Code) | | Indicates that string is not in the normalized format
|
NONE | final public static Mode NONE(Code) | | No decomposition/composition.
|
UNICODE_3_2 | final public static int UNICODE_3_2(Code) | | Options bit set value to select Unicode 3.2 normalization
(except NormalizationCorrections).
At most one Unicode version can be selected at a time.
|
UNICODE_3_2_0_ORIGINAL | final public static int UNICODE_3_2_0_ORIGINAL(Code) | | |
UNICODE_LATEST | final public static int UNICODE_LATEST(Code) | | |
YES | final public static QuickCheckResult YES(Code) | | Indicates that string is in the normalized format
|
NormalizerBase | public NormalizerBase(String str, Mode mode, int opt)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of a given string.
The options parameter specifies which optional
Normalizer features are to be enabled for this object.
Parameters: str - The string to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. Parameters: opt - Any optional features to be enabled.Currently the only available option is NormalizerBase.UNICODE_3_2.If you want the default behavior corresponding to one of thestandard Unicode Normalization Forms, use 0 for this argument. |
NormalizerBase | public NormalizerBase(CharacterIterator iter, Mode mode)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. |
NormalizerBase | public NormalizerBase(CharacterIterator iter, Mode mode, int opt)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of the given text.
Parameters: iter - The input text to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. Parameters: opt - Any optional features to be enabled.Currently the only available option is NormalizerBase.UNICODE_3_2.If you want the default behavior corresponding to one of thestandard Unicode Normalization Forms, use 0 for this argument. |
NormalizerBase | public NormalizerBase(String str, Mode mode)(Code) | | Creates a new Normalizer object for iterating over the
normalized form of a given string.
Parameters: str - The string to be normalized. The normalizationwill start at the beginning of the string. Parameters: mode - The normalization mode. |
clone | public Object clone()(Code) | | Clones this Normalizer object. All properties of this
object are duplicated in the new object, including the cloning of any
CharacterIterator that was passed in to the constructor
or to
NormalizerBase.setText(CharacterIterator) setText .
However, the text storage underlying
the CharacterIterator is not duplicated unless the
iterator's clone method does so.
|
compose | public static String compose(String str, boolean compat, int options)(Code) | | Compose a string.
The string will be composed to according the the specified mode.
Parameters: str - The string to compose. Parameters: compat - If true the string will be composed accoding to NFKC rules and if false will be composed according to NFC rules. Parameters: options - The only recognized option is UNICODE_3_2 String The composed string |
current | public int current()(Code) | | Return the current character in the normalized text->
The codepoint as an int |
decompose | public static String decompose(String str, boolean compat)(Code) | | Decompose a string.
The string will be decomposed to according the the specified mode.
Parameters: str - The string to decompose. Parameters: compat - If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. String The decomposed string |
decompose | public static String decompose(String str, boolean compat, int options)(Code) | | Decompose a string.
The string will be decomposed to according the the specified mode.
Parameters: str - The string to decompose. Parameters: compat - If true the string will be decomposed accoding to NFKD rules and if false will be decomposed according to NFD rules. Parameters: options - The normalization options, ORed together (0 for no options). String The decomposed string |
endIndex | public int endIndex()(Code) | | Retrieve the index of the end of the input text-> This is the end index
of the CharacterIterator or the length of the String
over which this Normalizer is iterating
The current iteration position |
getBeginIndex | public int getBeginIndex()(Code) | | Retrieve the index of the start of the input text. This is the begin
index of the CharacterIterator or the start (i.e. 0) of the
String over which this Normalizer is iterating
The codepoint as an int See Also: NormalizerBase.startIndex |
getEndIndex | public int getEndIndex()(Code) | | Retrieve the index of the end of the input text. This is the end index
of the CharacterIterator or the length of the String
over which this Normalizer is iterating
The codepoint as an int See Also: NormalizerBase.endIndex |
getIndex | public int getIndex()(Code) | | Retrieve the current iteration position in the input text that is
being normalized. This method is useful in applications such as
searching, where you need to be able to determine the position in
the input text that corresponds to a given normalized output character.
Note: This method sets the position in the input, while
NormalizerBase.next and
NormalizerBase.previous iterate through characters in the
output. This means that there is not necessarily a one-to-one
correspondence between characters returned by next and
previous and the indices passed to and returned from
setIndex and
NormalizerBase.getIndex .
The current iteration position |
isNFSkippable | public static boolean isNFSkippable(int c, Mode mode)(Code) | | Internal API
|
isNormalized | public static boolean isNormalized(String str, Normalizer.Form form)(Code) | | Test if a string is in a given normalization form.
This is semantically equivalent to source.equals(normalize(source, mode)).
Unlike quickCheck(), this function returns a definitive result,
never a "maybe".
For NFD, NFKD, and FCD, both functions work exactly the same.
For NFC and NFKC where quickCheck may return "maybe", this function will
perform further tests to arrive at a true/false result.
Parameters: str - the input string to be checked to see if it is normalized Parameters: form - the normalization form Parameters: options - the optional features to be enabled. |
isNormalized | public static boolean isNormalized(String str, Normalizer.Form form, int options)(Code) | | Test if a string is in a given normalization form.
This is semantically equivalent to source.equals(normalize(source, mode)).
Unlike quickCheck(), this function returns a definitive result,
never a "maybe".
For NFD, NFKD, and FCD, both functions work exactly the same.
For NFC and NFKC where quickCheck may return "maybe", this function will
perform further tests to arrive at a true/false result.
Parameters: str - the input string to be checked to see if it is normalized Parameters: form - the normalization form Parameters: options - the optional features to be enabled. |
next | public int next()(Code) | | Return the next character in the normalized text and advance
the iteration position by one. If the end
of the text has already been reached,
NormalizerBase.DONE is returned.
The codepoint as an int |
normalize | public static int normalize(char[] src, int srcStart, int srcLimit, char[] dest, int destStart, int destLimit, Mode mode, int options)(Code) | | Normalize a string.
The string will be normalized according the the specified normalization
mode and options.
Parameters: src - The char array to compose. Parameters: srcStart - Start index of the source Parameters: srcLimit - Limit index of the source Parameters: dest - The char buffer to fill in Parameters: destStart - Start index of the destination buffer Parameters: destLimit - End index of the destination buffer Parameters: mode - The normalization mode; one of Normalizer.NONE, Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, Normalizer.NFKD, Normalizer.DEFAULT Parameters: options - The normalization options, ORed together (0 for no options). int The total buffer size needed;if greater than length of result, the output was truncated. exception: IndexOutOfBoundsException - if the target capacity is less than the required length |
normalize | public static String normalize(String str, Normalizer.Form form)(Code) | | Normalizes a String using the given normalization form.
Parameters: str - the input string to be normalized. Parameters: form - the normalization form |
normalize | public static String normalize(String str, Normalizer.Form form, int options)(Code) | | Normalizes a String using the given normalization form.
Parameters: str - the input string to be normalized. Parameters: form - the normalization form Parameters: options - the optional features to be enabled. |
previous | public int previous()(Code) | | Return the previous character in the normalized text and decrement
the iteration position by one. If the beginning
of the text has already been reached,
NormalizerBase.DONE is returned.
The codepoint as an int |
reset | public void reset()(Code) | | Reset the index to the beginning of the text.
This is equivalent to setIndexOnly(startIndex)).
|
setIndex | public int setIndex(int index)(Code) | | Set the iteration position in the input text that is being normalized
and return the first normalized character at that position.
Note: This method sets the position in the input text,
while
NormalizerBase.next and
NormalizerBase.previous iterate through characters
in the normalized output. This means that there is not
necessarily a one-to-one correspondence between characters returned
by next and previous and the indices passed to and
returned from setIndex and
NormalizerBase.getIndex .
Parameters: index - the desired index in the input text-> the first normalized character that is the result of iteratingforward starting at the given index. throws: IllegalArgumentException - if the given index is less thanNormalizerBase.getBeginIndex or greater than NormalizerBase.getEndIndex. The codepoint as an int |
setIndexOnly | public void setIndexOnly(int index)(Code) | | Set the iteration position in the input text that is being normalized,
without any immediate normalization.
After setIndexOnly(), getIndex() will return the same index that is
specified here.
Parameters: index - the desired index in the input text. |
setText | public void setText(String newText)(Code) | | Set the input text over which this Normalizer will iterate.
The iteration position is set to the beginning of the input text->
Parameters: newText - The new string to be normalized. |
setText | public void setText(CharacterIterator newText)(Code) | | Set the input text over which this Normalizer will iterate.
The iteration position is set to the beginning of the input text->
Parameters: newText - The new string to be normalized. |
|
|