| java.lang.Object sun.text.normalizer.UTF16
UTF16 | final public class UTF16 (Code) | | Standalone utility class providing UTF16 character conversions and
indexing conversions.
Code that uses strings alone rarely need modification.
By design, UTF-16 does not allow overlap, so searching for strings is a safe
operation. Similarly, concatenation is always safe. Substringing is safe if
the start and end are both on UTF-32 boundaries. In normal code, the values
for start and end are on those boundaries, since they arose from operations
like searching. If not, the nearest UTF-32 boundaries can be determined
using bounds() .
Examples:
The following examples illustrate use of some of these methods.
// iteration forwards: Original
for (int i = 0; i < s.length(); ++i) {
char ch = s.charAt(i);
doSomethingWith(ch);
}
// iteration forwards: Changes for UTF-32
int ch;
for (int i = 0; i < s.length(); i+=UTF16.getCharCount(ch)) {
ch = UTF16.charAt(s,i);
doSomethingWith(ch);
}
// iteration backwards: Original
for (int i = s.length() -1; i >= 0; --i) {
char ch = s.charAt(i);
doSomethingWith(ch);
}
// iteration backwards: Changes for UTF-32
int ch;
for (int i = s.length() -1; i > 0; i-=UTF16.getCharCount(ch)) {
ch = UTF16.charAt(s,i);
doSomethingWith(ch);
}
Notes:
-
Naming: For clarity, High and Low surrogates are called
Lead and Trail in the API, which gives a better
sense of their ordering in a string. offset16 and
offset32 are used to distinguish offsets to UTF-16
boundaries vs offsets to UTF-32 boundaries. int char32 is
used to contain UTF-32 characters, as opposed to char16 ,
which is a UTF-16 code unit.
-
Roundtripping Offsets: You can always roundtrip from a
UTF-32 offset to a UTF-16 offset and back. Because of the difference in
structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and
back if and only if
bounds(string, offset16) != TRAIL .
-
Exceptions: The error checking will throw an exception
if indices are out of bounds. Other than than that, all methods will
behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32
values are present.
UCharacter.isLegal() can be used to check
for validity if desired.
-
Unmatched Surrogates: If the string contains unmatched
surrogates, then these are counted as one UTF-32 value. This matches
their iteration behavior, which is vital. It also matches common display
practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5).
-
Optimization: The method implementations may need
optimization if the compiler doesn't fold static final methods. Since
surrogate pairs will form an exceeding small percentage of all the text
in the world, the singleton case should always be optimized for.
author: Mark Davis, with help from Markus Scherer |
Method Summary | |
public static StringBuffer | append(StringBuffer target, int char32) Append a single UTF-32 value to the end of a StringBuffer.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: target - the buffer to append to Parameters: char32 - value to append. | public static int | charAt(String source, int offset16) Extract a single UTF-32 value from a string.
Used when iterating forwards or backwards (with
UTF16.getCharCount() , as well as random access. | public static int | charAt(char source, int start, int limit, int offset16) Extract a single UTF-32 value from a substring.
Used when iterating forwards or backwards (with
UTF16.getCharCount() , as well as random access. | public static int | getCharCount(int char32) Determines how many chars this char32 requires.
If a validity check is required, use
isLegal() on
char32 before calling.
Parameters: char32 - the input codepoint. | public static char | getLeadSurrogate(int char32) Returns the lead surrogate.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: char32 - the input character. | public static char | getTrailSurrogate(int char32) Returns the trail surrogate.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: char32 - the input character. | public static boolean | isLeadSurrogate(char char16) Determines whether the character is a lead surrogate.
Parameters: char16 - the input character. | public static boolean | isSurrogate(char char16) Determines whether the code value is a surrogate.
Parameters: char16 - the input character. | public static boolean | isTrailSurrogate(char char16) Determines whether the character is a trail surrogate.
Parameters: char16 - the input character. | public static int | moveCodePointOffset(char source, int start, int limit, int offset16, int shift32) Shifts offset16 by the argument number of codepoints within a subarray. | public static String | valueOf(int char32) Convenience method corresponding to String.valueOf(char). |
CODEPOINT_MAX_VALUE | final public static int CODEPOINT_MAX_VALUE(Code) | | The highest Unicode code point value (scalar value) according to the
Unicode Standard.
|
CODEPOINT_MIN_VALUE | final public static int CODEPOINT_MIN_VALUE(Code) | | The lowest Unicode code point value.
|
LEAD_SURROGATE_MAX_VALUE | final public static int LEAD_SURROGATE_MAX_VALUE(Code) | | Lead surrogate maximum value
|
LEAD_SURROGATE_MIN_VALUE | final public static int LEAD_SURROGATE_MIN_VALUE(Code) | | Lead surrogate minimum value
|
SUPPLEMENTARY_MIN_VALUE | final public static int SUPPLEMENTARY_MIN_VALUE(Code) | | The minimum value for Supplementary code points
|
SURROGATE_MIN_VALUE | final public static int SURROGATE_MIN_VALUE(Code) | | Surrogate minimum value
|
TRAIL_SURROGATE_MAX_VALUE | final public static int TRAIL_SURROGATE_MAX_VALUE(Code) | | Trail surrogate maximum value
|
TRAIL_SURROGATE_MIN_VALUE | final public static int TRAIL_SURROGATE_MIN_VALUE(Code) | | Trail surrogate minimum value
|
append | public static StringBuffer append(StringBuffer target, int char32)(Code) | | Append a single UTF-32 value to the end of a StringBuffer.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: target - the buffer to append to Parameters: char32 - value to append. the updated StringBuffer exception: IllegalArgumentException - thrown when char32 does not liewithin the range of the Unicode codepoints |
charAt | public static int charAt(String source, int offset16)(Code) | | Extract a single UTF-32 value from a string.
Used when iterating forwards or backwards (with
UTF16.getCharCount() , as well as random access. If a
validity check is required, use
UCharacter.isLegal() on the return value.
If the char retrieved is part of a surrogate pair, its supplementary
character will be returned. If a complete supplementary character is
not found the incomplete character will be returned
Parameters: source - array of UTF-16 chars Parameters: offset16 - UTF-16 offset to the start of the character. UTF-32 value for the UTF-32 value that contains the char atoffset16. The boundaries of that codepoint are the same as inbounds32() . exception: IndexOutOfBoundsException - thrown if offset16 is out ofbounds. |
charAt | public static int charAt(char source, int start, int limit, int offset16)(Code) | | Extract a single UTF-32 value from a substring.
Used when iterating forwards or backwards (with
UTF16.getCharCount() , as well as random access. If a
validity check is required, use
UCharacter.isLegal()
on the return value.
If the char retrieved is part of a surrogate pair, its supplementary
character will be returned. If a complete supplementary character is
not found the incomplete character will be returned
Parameters: source - array of UTF-16 chars Parameters: start - offset to substring in the source array for analyzing Parameters: limit - offset to substring in the source array for analyzing Parameters: offset16 - UTF-16 offset relative to start UTF-32 value for the UTF-32 value that contains the char atoffset16. The boundaries of that codepoint are the same as inbounds32() . exception: IndexOutOfBoundsException - thrown if offset16 is not withinthe range of start and limit. |
getCharCount | public static int getCharCount(int char32)(Code) | | Determines how many chars this char32 requires.
If a validity check is required, use
isLegal() on
char32 before calling.
Parameters: char32 - the input codepoint. 2 if is in supplementary space, otherwise 1. |
getLeadSurrogate | public static char getLeadSurrogate(int char32)(Code) | | Returns the lead surrogate.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: char32 - the input character. lead surrogate if the getCharCount(ch) is 2; and 0 otherwise (note: 0 is not a valid lead surrogate). |
getTrailSurrogate | public static char getTrailSurrogate(int char32)(Code) | | Returns the trail surrogate.
If a validity check is required, use
isLegal()
on char32 before calling.
Parameters: char32 - the input character. the trail surrogate if the getCharCount(ch) is 2; otherwisethe character itself |
isLeadSurrogate | public static boolean isLeadSurrogate(char char16)(Code) | | Determines whether the character is a lead surrogate.
Parameters: char16 - the input character. true iff the input character is a lead surrogate |
isSurrogate | public static boolean isSurrogate(char char16)(Code) | | Determines whether the code value is a surrogate.
Parameters: char16 - the input character. true iff the input character is a surrogate. |
isTrailSurrogate | public static boolean isTrailSurrogate(char char16)(Code) | | Determines whether the character is a trail surrogate.
Parameters: char16 - the input character. true iff the input character is a trail surrogate. |
moveCodePointOffset | public static int moveCodePointOffset(char source, int start, int limit, int offset16, int shift32)(Code) | | Shifts offset16 by the argument number of codepoints within a subarray.
Parameters: source - char array Parameters: start - position of the subarray to be performed on Parameters: limit - position of the subarray to be performed on Parameters: offset16 - UTF16 position to shift relative to start Parameters: shift32 - number of codepoints to shift new shifted offset16 relative to start exception: IndexOutOfBoundsException - if the new offset16 is out ofbounds with respect to the subarray or the subarray boundsare out of range. |
valueOf | public static String valueOf(int char32)(Code) | | Convenience method corresponding to String.valueOf(char). Returns a one
or two char string containing the UTF-32 value in UTF16 format. If a
validity check is required, use
isLegal()
on char32 before calling.
Parameters: char32 - the input character. string value of char32 in UTF16 format exception: IllegalArgumentException - thrown if char32 is a invalidcodepoint. |
|
|