A compression engine implementing the Standard Compression Scheme
for Unicode (SCSU) as outlined in Unicode Technical
Report #6.
The SCSU works by using dynamically positioned windows
consisting of 128 consecutive characters in Unicode. During compression,
characters within a window are encoded in the compressed stream as the bytes
0x7F - 0xFF. The SCSU provides transparency for the characters
(bytes) between U+0000 - U+00FF. The SCSU approximates the
storage size of traditional character sets, for example 1 byte per
character for ASCII or Latin-1 text, and 2 bytes per character for CJK
ideographs.
USAGE
The static methods on UnicodeCompressor may be used in a
straightforward manner to compress simple strings:
String s = ... ; // get string from somewhere
byte [] compressed = UnicodeCompressor.compress(s);
The static methods have a fairly large memory footprint.
For finer-grained control over memory usage,
UnicodeCompressor offers more powerful APIs allowing
iterative compression:
// Compress an array "chars" of length "len" using a buffer of 512 bytes
// to the OutputStream "out"
UnicodeCompressor myCompressor = new UnicodeCompressor();
final static int BUFSIZE = 512;
byte [] byteBuffer = new byte [ BUFSIZE ];
int bytesWritten = 0;
int [] unicharsRead = new int [1];
int totalCharsCompressed = 0;
int totalBytesWritten = 0;
do {
// do the compression
bytesWritten = myCompressor.compress(chars, totalCharsCompressed,
len, unicharsRead,
byteBuffer, 0, BUFSIZE);
// do something with the current set of bytes
out.write(byteBuffer, 0, bytesWritten);
// update the no. of characters compressed
totalCharsCompressed += unicharsRead[0];
// update the no. of bytes written
totalBytesWritten += bytesWritten;
} while(totalCharsCompressed < len);
myCompressor.reset(); // reuse compressor
See Also: UnicodeDecompressor author: Stephen F. Booth |