| java.lang.Object com.healthmarketscience.jackcess.scsu.SCSU
All known Subclasses: com.healthmarketscience.jackcess.scsu.Expand,
SCSU | abstract public class SCSU (Code) | | Encoding text data in Unicode often requires more storage than using
an existing 8-bit character set and limited to the subset of characters
actually found in the text. The Unicode Compression Algorithm reduces
the necessary storage while retaining the universality of Unicode.
A full description of the algorithm can be found in document
http://www.unicode.org/unicode/reports/tr6.html
Summary
The goal of the Unicode Compression Algorithm is the abilty to
Express all code points in Unicode
Approximate storage size for traditional character sets
Work well for short strings
Provide transparency for Latin-1 data
Support very simple decoders
Support simple as well as sophisticated encoders
If needed, further compression can be achieved by layering standard
file or disk-block based compression algorithms on top.
Features
Languages using small alphabets would contain runs of characters that
are coded close together in Unicode. These runs are interrupted only
by punctuation characters, which are themselves coded in proximity to
each other in Unicode (usually in the ASCII range).
Two basic mechanisms in the compression algorithm account for these two
cases, sliding windows and static windows. A window is an area of 128
consecutive characters in Unicode. In the compressed data stream, each
character from a sliding window would be represented as a byte between
0x80 and 0xFF, while a byte from 0x20 to 0x7F (as well as CR, LF, and
TAB) would always mean an ASCII character (or control).
Notes on the Java implementation
A limitation of Java is the exclusive use of a signed byte data type.
The following work arounds are required:
Copying a byte to an integer variable and adding 256 for 'negative'
bytes gives an integer in the range 0-255.
Values of char are between 0x0000 and 0xFFFF in Java. Arithmetic on
char values is unsigned.
Extended characters require an int to store them. The sign is not an
issue because only 1024*1024 + 65536 extended characters exist.
|
Field Summary | |
final static byte | SC0 SCn Change to Window n. | final static byte | SC1 | final static byte | SC2 | final static byte | SC3 | final static byte | SC4 | final static byte | SC5 | final static byte | SC6 | final static byte | SC7 | final static byte | SCU | final static byte | SD0 | final static byte | SD1 | final static byte | SD2 | final static byte | SD3 | final static byte | SD4 | final static byte | SD5 | final static byte | SD6 | final static byte | SD7 | final static byte | SDX | final static byte | SQ0 SQn Quote from Window . | final static byte | SQ1 | final static byte | SQ2 | final static byte | SQ3 | final static byte | SQ4 | final static byte | SQ5 | final static byte | SQ6 | final static byte | SQ7 | final static byte | SQU | final static byte | Srs | final static byte | UC0 | final static byte | UC1 | final static byte | UC2 | final static byte | UC3 | final static byte | UC4 | final static byte | UC5 | final static byte | UC6 | final static byte | UC7 | final static byte | UD0 | final static byte | UD1 | final static byte | UD2 | final static byte | UD3 | final static byte | UD4 | final static byte | UD5 | final static byte | UD6 | final static byte | UD7 | final static byte | UDX | final static byte | UQU | final static byte | Urs | int | dynamicOffset dynamic window offsets, intitialize to default values. | final static int | fixedOffset | final static int | fixedThreshold | final static int | gapOffset | final static int | gapThreshold Unicode code points from 3400 to E000 are not adressible by
dynamic window, since in these areas no short run alphabets are
found. | final static int | initialDynamicOffset | final static int | reservedStart | final static int | staticOffset |
SC0 | final static byte SC0(Code) | | SCn Change to Window n.
If the following bytes are less than 0x80, interpret them
as command bytes or pass them through, else add the offset
for dynamic window n.
|
SC1 | final static byte SC1(Code) | | |
SC2 | final static byte SC2(Code) | | |
SC3 | final static byte SC3(Code) | | |
SC4 | final static byte SC4(Code) | | |
SC5 | final static byte SC5(Code) | | |
SC6 | final static byte SC6(Code) | | |
SC7 | final static byte SC7(Code) | | |
SCU | final static byte SCU(Code) | | |
SD0 | final static byte SD0(Code) | | |
SD1 | final static byte SD1(Code) | | |
SD2 | final static byte SD2(Code) | | |
SD3 | final static byte SD3(Code) | | |
SD4 | final static byte SD4(Code) | | |
SD5 | final static byte SD5(Code) | | |
SD6 | final static byte SD6(Code) | | |
SD7 | final static byte SD7(Code) | | |
SDX | final static byte SDX(Code) | | |
SQ0 | final static byte SQ0(Code) | | SQn Quote from Window .
If the following byte is less than 0x80, quote from
static window n, else quote from dynamic window n.
|
SQ1 | final static byte SQ1(Code) | | |
SQ2 | final static byte SQ2(Code) | | |
SQ3 | final static byte SQ3(Code) | | |
SQ4 | final static byte SQ4(Code) | | |
SQ5 | final static byte SQ5(Code) | | |
SQ6 | final static byte SQ6(Code) | | |
SQ7 | final static byte SQ7(Code) | | |
SQU | final static byte SQU(Code) | | |
Srs | final static byte Srs(Code) | | |
UC0 | final static byte UC0(Code) | | |
UC1 | final static byte UC1(Code) | | |
UC2 | final static byte UC2(Code) | | |
UC3 | final static byte UC3(Code) | | |
UC4 | final static byte UC4(Code) | | |
UC5 | final static byte UC5(Code) | | |
UC6 | final static byte UC6(Code) | | |
UC7 | final static byte UC7(Code) | | |
UD0 | final static byte UD0(Code) | | |
UD1 | final static byte UD1(Code) | | |
UD2 | final static byte UD2(Code) | | |
UD3 | final static byte UD3(Code) | | |
UD4 | final static byte UD4(Code) | | |
UD5 | final static byte UD5(Code) | | |
UD6 | final static byte UD6(Code) | | |
UD7 | final static byte UD7(Code) | | |
UDX | final static byte UDX(Code) | | |
UQU | final static byte UQU(Code) | | |
Urs | final static byte Urs(Code) | | |
dynamicOffset | int dynamicOffset(Code) | | dynamic window offsets, intitialize to default values.
|
fixedOffset | final static int fixedOffset(Code) | | Table of fixed predefined Offsets, and byte values that index into *
|
fixedThreshold | final static int fixedThreshold(Code) | | |
gapOffset | final static int gapOffset(Code) | | |
gapThreshold | final static int gapThreshold(Code) | | Unicode code points from 3400 to E000 are not adressible by
dynamic window, since in these areas no short run alphabets are
found. Therefore add gapOffset to all values from gapThreshold
|
initialDynamicOffset | final static int initialDynamicOffset(Code) | | initial offsets for the 8 dynamic (sliding) windows
|
reservedStart | final static int reservedStart(Code) | | |
staticOffset | final static int staticOffset(Code) | | constant offsets for the 8 static windows
|
getCurrentWindow | protected int getCurrentWindow()(Code) | | select the active dynamic window *
|
isCompressible | public static boolean isCompressible(char ch)(Code) | | whether a character is compressible
|
reset | public void reset()(Code) | | reset is only needed to bail out after an exception and
restart with new input
|
selectWindow | protected void selectWindow(int iWindow)(Code) | | select the active dynamic window *
|
|
|