| java.lang.Object com.healthmarketscience.jackcess.scsu.SCSU com.healthmarketscience.jackcess.scsu.Expand
Expand | public class Expand extends SCSU (Code) | | Reference decoder for the Standard Compression Scheme for Unicode (SCSU)
Notes on the Java implementation
A limitation of Java is the exclusive use of a signed byte data type.
The following work arounds are required:
Copying a byte to an integer variable and adding 256 for 'negative'
bytes gives an integer in the range 0-255.
Values of char are between 0x0000 and 0xFFFF in Java. Arithmetic on
char values is unsigned.
Extended characters require an int to store them. The sign is not an
issue because only 1024*1024 + 65536 extended characters exist.
|
Field Summary | |
protected int | iIn | protected int | iOut |
Method Summary | |
public int | bytesRead() | public static char | charFromTwoBytes(byte hi, byte lo) | public int | charsWritten() | protected void | defineExtendedWindow(char chOffset) (re-)define (and select) a window as an extended dynamic window
The surrogate area in Unicode allows access to 2**20 codes beyond the
first 64K codes by combining one of 1024 characters from the High
Surrogate Area with one of 1024 characters from the Low Surrogate
Area (see Unicode 2.0 for the details).
The tags SDX and UDX set the window such that each subsequent byte in
the range 80 to FF represents a surrogate pair. | protected void | defineWindow(int iWindow, byte bOffset) (re-)define (and select) a dynamic window
A sliding window position cannot start at any Unicode value,
so rather than providing an absolute offset, this function takes
an index value which selects among the possible starting values.
Most scripts in Unicode start on or near a half-block boundary
so the default behaviour is to multiply the index by 0x80. | public String | expand(byte[] in) | protected String | expandSingleByte(byte[] in) | protected int | expandUnicode(byte[] in, int iCur, StringBuffer sb) | public void | reset() |
iIn | protected int iIn(Code) | | input cursor used by the following functions
|
iOut | protected int iOut(Code) | | string buffer length used by the following functions
|
bytesRead | public int bytesRead()(Code) | | |
charFromTwoBytes | public static char charFromTwoBytes(byte hi, byte lo)(Code) | | assemble a char from two bytes
In Java bytes are signed quantities, while chars are unsigned
the character Parameters: hi - most significant byte Parameters: lo - least significant byte |
charsWritten | public int charsWritten()(Code) | | |
defineExtendedWindow | protected void defineExtendedWindow(char chOffset)(Code) | | (re-)define (and select) a window as an extended dynamic window
The surrogate area in Unicode allows access to 2**20 codes beyond the
first 64K codes by combining one of 1024 characters from the High
Surrogate Area with one of 1024 characters from the Low Surrogate
Area (see Unicode 2.0 for the details).
The tags SDX and UDX set the window such that each subsequent byte in
the range 80 to FF represents a surrogate pair. The following diagram
shows how the bits in the two bytes following the SDX or UDX, and a
subsequent data byte, map onto the bits in the resulting surrogate pair.
hbyte lbyte data
nnnwwwww zzzzzyyy 1xxxxxxx
high-surrogate low-surrogate
110110wwwwwzzzzz 110111yyyxxxxxxx
Parameters: chOffset - - Since the three top bits of chOffset are not needed toset the location of the extended Window, they are used insteadto select the window, thereby reducing the number of needed command codes.The bottom 13 bits of chOffset are used to calculate the offset relative toa 7 bit input data byte to yield the 20 bits expressed by each surrogate pair. |
defineWindow | protected void defineWindow(int iWindow, byte bOffset) throws IllegalInputException(Code) | | (re-)define (and select) a dynamic window
A sliding window position cannot start at any Unicode value,
so rather than providing an absolute offset, this function takes
an index value which selects among the possible starting values.
Most scripts in Unicode start on or near a half-block boundary
so the default behaviour is to multiply the index by 0x80. Han,
Hangul, Surrogates and other scripts between 0x3400 and 0xDFFF
show very poor locality--therefore no sliding window can be set
there. A jumpOffset is added to the index value to skip that region,
and only 167 index values total are required to select all eligible
half-blocks.
Finally, a few scripts straddle half block boundaries. For them, a
table of fixed offsets is used, and the index values from 0xF9 to
0xFF are used to select these special offsets.
After (re-)defining a windows location it is selected so it is ready
for use.
Recall that all Windows are of the same length (128 code positions).
Parameters: iWindow - - index of the window to be (re-)defined Parameters: bOffset - - index for the new offset value |
expandUnicode | protected int expandUnicode(byte[] in, int iCur, StringBuffer sb) throws IllegalInputException, EndOfInputException(Code) | | expand input that is in Unicode mode
Parameters: in - input byte array to be expanded Parameters: iCur - starting index Parameters: sb - string buffer to which to append expanded input the index for the lastc byte processed |
reset | public void reset()(Code) | | reset is called to start with new input, w/o creating a new
instance
|
|
|