| java.lang.Object java.io.Reader org.geoserver.ows.util.UCSReader
UCSReader | public class UCSReader extends Reader (Code) | | Reader for UCS-2 and UCS-4 encodings.
(more precisely ISO-10646-UCS-(2|4) encodings).
This variant is modified to handle supplementary Unicode code points
correctly. Though this required a lot of new code and definitely
reduced the perfomance comparing to original version. I tried my best
to preserve exsiting code and comments whenever it was possible.
I performed some basic tests, but not too thorough ones, so
some bugs may still nest in the code. -AK
author: Neil Graham, IBM version: $Id: UCSReader.java 6177 2007-02-19 10:11:27Z aaime $ |
Field Summary | |
final public static int | CHAR_BUFFER_INITIAL_SIZE Starting size of the internal char buffer. | final public static int | DEFAULT_BUFFER_SIZE Default byte buffer size (8192, larger than that of ASCIIReader
since it's reasonable to surmise that the average UCS-4-encoded
file should be 4 times as large as the average ASCII-encoded file). | final public static int | MAX_CODE_POINT The maximum value of a Unicode code point. | final public static int | MIN_CODE_POINT The minimum value of a Unicode code point. | final public static int | MIN_SUPPLEMENTARY_CODE_POINT The minimum value of a supplementary code point. | final public static short | UCS2BE | final public static short | UCS2LE | final public static short | UCS4BE | final public static short | UCS4LE | protected byte[] | fBuffer Byte buffer. | protected char[] | fCharBuf Stores aforeread or "excess" characters that may appear during
read methods invocation due to the fact that one input
UCS-4 supplementary character results in two output Java
char `s - high surrogate and low surrogate code units.
Because of that, if read() method encounters supplementary
code point in the input stream, it returns UTF-16-encoded high surrogate
code unit and stores low surrogate in buffer. | protected int | fCharCount Count of Java chars currently being stored in in the
fCharBuf array. | protected short | fEncoding | protected InputStream | fInputStream Input stream. |
Constructor Summary | |
public | UCSReader(InputStream inputStream, short encoding) Constructs an ISO-10646-UCS-(2|4) reader from the specified
input stream using default buffer size. | public | UCSReader(InputStream inputStream, int size, short encoding) Constructs an ISO-10646-UCS-(2|4) reader from the source
input stream using explicitly specified initial buffer size. |
Method Summary | |
public void | close() Close the stream. | public String | getByteOrder() Returns byte order ("endianness") of the encoding currently in use by
this character stream. | public String | getEncoding() Returns the encoding currently in use by this character stream.
Encoding of this stream. | protected boolean | isSupplementaryCodePoint(int codePoint) Determines whether the specified character (Unicode code point)
is in the supplementary character range. | public void | mark(int readAheadLimit) Mark the present position in the stream. | public boolean | markSupported() Tell whether this stream supports the mark() operation. | public int | read() Read a single character. | public int | read(char[] ch, int offset, int length) Read characters into a portion of an array. | protected int | readUCS2(char[] ch, int offset, int length) Read UCS-2 characters into a portion of an array. | public boolean | ready() Tell whether this stream is ready to be read.
True if the next read() is guaranteed not to block for input,false otherwise. | public void | reset() Reset the stream. | public long | skip(long n) Skip characters. |
CHAR_BUFFER_INITIAL_SIZE | final public static int CHAR_BUFFER_INITIAL_SIZE(Code) | | Starting size of the internal char buffer. Internal char buffer is
maintained to hold excess chars that may left from previous read
operation when working with UCS-4 data (never used for UCS-2).
|
DEFAULT_BUFFER_SIZE | final public static int DEFAULT_BUFFER_SIZE(Code) | | Default byte buffer size (8192, larger than that of ASCIIReader
since it's reasonable to surmise that the average UCS-4-encoded
file should be 4 times as large as the average ASCII-encoded file).
|
MAX_CODE_POINT | final public static int MAX_CODE_POINT(Code) | | The maximum value of a Unicode code point.
|
MIN_CODE_POINT | final public static int MIN_CODE_POINT(Code) | | The minimum value of a Unicode code point.
|
MIN_SUPPLEMENTARY_CODE_POINT | final public static int MIN_SUPPLEMENTARY_CODE_POINT(Code) | | The minimum value of a supplementary code point.
|
UCS2BE | final public static short UCS2BE(Code) | | |
UCS2LE | final public static short UCS2LE(Code) | | |
UCS4BE | final public static short UCS4BE(Code) | | |
UCS4LE | final public static short UCS4LE(Code) | | |
fBuffer | protected byte[] fBuffer(Code) | | Byte buffer.
|
fCharBuf | protected char[] fCharBuf(Code) | | Stores aforeread or "excess" characters that may appear during
read methods invocation due to the fact that one input
UCS-4 supplementary character results in two output Java
char `s - high surrogate and low surrogate code units.
Because of that, if read() method encounters supplementary
code point in the input stream, it returns UTF-16-encoded high surrogate
code unit and stores low surrogate in buffer. When called next time,
read() will return this low surrogate, instead of reading
more bytes from the InputStream . Similarly if
read(char[], int, int) is invoked to read, for example,
10 chars into specified buffer, and 4 of them turn out to
be supplementary Unicode characters, each written as two chars, then we
end up having 4 excess chars that we cannot immediately return or
push back to the input stream. So we need to store them in the buffer
awaiting further read invocations.
Note that char buffer functions like a stack, i.e. chars and surrogate
pairs are stored in reverse order.
|
fCharCount | protected int fCharCount(Code) | | Count of Java chars currently being stored in in the
fCharBuf array.
|
fEncoding | protected short fEncoding(Code) | | what kind of data we're dealing with
|
UCSReader | public UCSReader(InputStream inputStream, short encoding)(Code) | | Constructs an ISO-10646-UCS-(2|4) reader from the specified
input stream using default buffer size. The Endianness and exact input
encoding (UCS-2 or UCS-4 ) also should be known
in advance.
Parameters: inputStream - input stream with UCS-2|4 encoded data Parameters: encoding - One of UCS2LE, UCS2BE, UCS4LE or UCS4BE. |
UCSReader | public UCSReader(InputStream inputStream, int size, short encoding)(Code) | | Constructs an ISO-10646-UCS-(2|4) reader from the source
input stream using explicitly specified initial buffer size. Endianness
and exact input encoding (UCS-2 or UCS-4 ) also
should be known in advance.
Parameters: inputStream - input stream with UCS-2|4 encoded data Parameters: size - The initial buffer size. You better make surethis number is divisible by 4 if you plan toto read UCS-4 with this class. Parameters: encoding - One of UCS2LE, UCS2BE, UCS4LE or UCS4BE |
close | public void close() throws IOException(Code) | | Close the stream. Once a stream has been closed, further
read , ready , mark ,
or reset invocations will throw an IOException.
Closing a previously-closed stream, however, has no effect.
exception: IOException - If an I/O error occurs |
getByteOrder | public String getByteOrder()(Code) | | Returns byte order ("endianness") of the encoding currently in use by
this character stream. This is a string with two possible values:
LITTLE_ENDIAN and BIG_ENDIAN . Maybe using
a named constant is a better alternative, but I just don't like them.
But feel free to change this behavior if you think that would be
better.
LITTLE_ENDIAN or BIG_ENDIAN dependingon byte order of current encoding of this stream. |
getEncoding | public String getEncoding()(Code) | | Returns the encoding currently in use by this character stream.
Encoding of this stream. Either ISO-10646-UCS-2 orISO-10646-UCS-4. Problem is that this string doesn't indicatethe byte order of that encoding. What to do, then? UnlikeUTF-16 byte order cannot be made part of the encoding namein this case and still can be critical. Currently you canfind out the byte order by invoking getByteOrder method. |
isSupplementaryCodePoint | protected boolean isSupplementaryCodePoint(int codePoint)(Code) | | Determines whether the specified character (Unicode code point)
is in the supplementary character range. The method call is
equivalent to the expression:
codePoint >= 0x10000 && codePoint <= 0x10ffff
Stolen from JDK 1.5 java.lang.Character class in
order to provide JDK 1.4 compatibility.
Parameters: codePoint - the character (Unicode code point) to be tested true if the specified character is in the Unicodesupplementary character range; false otherwise. |
mark | public void mark(int readAheadLimit) throws IOException(Code) | | Mark the present position in the stream. Subsequent calls to
reset will attempt to reposition the stream to this point.
Not all character-input streams support the mark operation.
This is one of them :) It relies on marking facilities of underlying
byte stream.
Parameters: readAheadLimit - Limit on the number of characters that may beread while still preserving the mark. Afterreading this many characters, attempting toreset the stream may fail. exception: IOException - If the stream does not supportmark , or if some other I/O erroroccurs |
markSupported | public boolean markSupported()(Code) | | Tell whether this stream supports the mark() operation.
|
read | public int read() throws IOException(Code) | | Read a single character. This method will block until a character is
available, an I/O error occurs, or the end of the stream is reached.
If supplementary Unicode character is encountered in UCS-4
input, it will be encoded into UTF-16 surrogate pair
according to RFC 2781. High surrogate code unit will be returned
immediately, and low surrogate saved in the internal buffer to be read
during next read() or read(char[], int, int)
invocation. -AK
Java 16-bit char value containing UTF-16 codeunit which may be either code point from Basic MultilingualPlane or one of the surrogate code units (high or low)of the pair representing supplementary Unicode character(one in 0x10000 - 0x10FFFF range) -AK exception: IOException - when I/O error occurs |
read | public int read(char[] ch, int offset, int length) throws IOException(Code) | | Read characters into a portion of an array. This method will block
until some input is available, an I/O error occurs, or the end of the
stream is reached.
I suspect that the whole stuff works awfully slow, so if you know
for sure that your UCS-4 input does not contain any
supplementary code points you probably should use original
UCSReader class from Xerces team
(org.apache.xerces.impl.io.UCSReader ). -AK
Parameters: ch - Destination buffer Parameters: offset - Offset at which to start storing characters Parameters: length - Maximum number of characters to read The number of characters read, or -1 if theend of the stream has been reached. Note that this is nota number of UCS-4 characters read, butinstead number of UTF-16 code units. Thesetwo are equal only if there were no supplementary Unicodecode points among read chars. exception: IOException - If an I/O error occurs |
readUCS2 | protected int readUCS2(char[] ch, int offset, int length) throws IOException(Code) | | Read UCS-2 characters into a portion of an array.
This method will block until some input is available, an I/O
error occurs, or the end of the stream is reached.
In original UCSReader this code was part of
read(char[], int, int) method, but I removed it
from there to reduce complexity of the latter.
Parameters: ch - destination buffer Parameters: offset - offset at which to start storing characters Parameters: length - maximum number of characters to read The number of characters read, or -1 if the end of the stream has been reached exception: IOException - If an I/O error occurs |
ready | public boolean ready() throws IOException(Code) | | Tell whether this stream is ready to be read.
True if the next read() is guaranteed not to block for input,false otherwise. Note that returning false does not guarantee that thenext read will block. exception: IOException - If an I/O error occurs |
reset | public void reset() throws IOException(Code) | | Reset the stream. If the stream has been marked, then attempt to
reposition it at the mark. If the stream has not been marked, then
attempt to reset it in some way appropriate to the particular stream,
for example by repositioning it to its starting point. This stream
implementation does not support mark /reset
by itself, it relies on underlying byte stream in this matter.
exception: IOException - If the stream has not been marked,or if the mark has been invalidated,or if the stream does not support reset(),or if some other I/O error occurs |
skip | public long skip(long n) throws IOException(Code) | | Skip characters. This method will block until some characters are
available, an I/O error occurs, or the end of the stream is reached.
Parameters: n - The number of characters to skip The number of characters actually skipped exception: IOException - If an I/O error occurs |
|
|