Binary Ordered Compression for Unicode
Users are strongly encouraged to read the ICU paper on
BOCU before attempting to use this class.
BOCU is used to compress unicode text into a stream of unsigned
bytes. For many kinds of text the compression compares favorably
to UTF-8, and for some kinds of text (such as CJK) it does better.
The resulting bytes will compare in the same order as the original
code points. The byte stream does not contain the values 0, 1, or
2.
One example of a use of BOCU is in
com.ibm.icu.text.Collator.getCollationKey(String) for a RuleBasedCollator object with
collation strength IDENTICAL. The result CollationKey will consist of the
collation order of the source string followed by the BOCU result of the
source string.
Unlike a UTF encoding, BOCU-compressed text is not suitable for
random access.
Method: Slope Detection Remember the previous code point
(initial 0). For each code point in the string, encode the
difference with the previous one. Similar to a UTF, the length of
the byte sequence is encoded in the lead bytes. Unlike a UTF, the
trail byte values may overlap with lead/single byte values. The
signedness of the difference must be encoded as the most
significant part.
We encode differences with few bytes if their absolute values
are small. For correct ordering, we must treat the entire value
range -10ffff..+10ffff in ascending order, which forbids encoding
the sign and the absolute value separately. Instead, we split the
lead byte range in the middle and encode non-negative values going
up and negative values going down.
For very small absolute values, the difference is added to a
middle byte value for single-byte encoded differences. For
somewhat larger absolute values, the difference is divided by the
number of byte values available, the modulo is used for one trail
byte, and the remainder is added to a lead byte avoiding the
single-byte range. For large absolute values, the difference is
similarly encoded in three bytes. (Syn Wee, I need examples
here.)
BOCU does not use byte values 0, 1, or 2, but uses all other
byte values for lead and single bytes, so that the middle range of
single bytes is as large as possible.
Note that the lead byte ranges overlap some, but that the
sequences as a whole are well ordered. I.e., even if the lead byte
is the same for sequences of different lengths, the trail bytes
establish correct order. It would be possible to encode slightly
larger ranges for each length (>1) by subtracting the lower bound
of the range. However, that would also slow down the calculation.
(Syn Wee, need an example).
For the actual string encoding, an optimization moves the
previous code point value to the middle of its Unicode script block
to minimize the differences in same-script text runs. (Syn Wee,
need an example.)
author: Syn Wee Quek since: release 2.2, May 3rd 2002 |