| java.lang.Object com.jclark.xml.tok.Encoding
All known Subclasses: com.jclark.xml.tok.UTF16BigEndianEncoding, com.jclark.xml.tok.UTF8Encoding, com.jclark.xml.tok.UTF16LittleEndianEncoding, com.jclark.xml.tok.InternalEncoding, com.jclark.xml.tok.SingleByteEncoding, com.jclark.xml.tok.ISO8859_1Encoding, com.jclark.xml.tok.ASCIIEncoding,
Encoding | abstract public class Encoding (Code) | | An Encoding object corresponds to a possible
encoding (a mapping from characters to sequences of bytes).
It provides operations on byte arrays
that represent all or part of a parsed XML entity in that encoding.
The set of ASCII characters excluding $@\^`{}~
have a special status; these are called XML significant
characters.
This class imposes certain restrictions on an encoding:
- the encoding must be stateless;
- a single byte must not encode more than one character;
- all XML significant characters must be encoded by the same number
of bytes, and no character may be encoded by fewer bytes.
Several methods operate on byte subarrays. The subarray is specified
by a byte array buf and two integers,
off and end ; off
gives the index in buf of the first byte of the subarray
and end gives the
index in buf of the byte immediately after the last byte.
Use the getInitialEncoding method to get an
Encoding object to use to start parsing an entity.
The main operations provided by Encoding are
tokenizeProlog , tokenizeContent and
tokenizeCdataSection ;
these are used to divide up an XML entity into tokens.
tokenizeProlog is used for the prolog of an XML document
as well as for the external subset and parameter entities (except
when referenced in an EntityValue );
it can also be used for parsing the Misc * that follows
the document element.
tokenizeContent is used for the document element and for
parsed general entities that are referenced in content
except for CDATA sections.
tokenizeCdataSection is used for CDATA sections, following
the <![CDATA[ up to and including the ]]> .
tokenizeAttributeValue and tokenizeEntityValue
are used to further divide up tokens returned by tokenizeProlog
and tokenizeContent ; they are also used to divide up entities
referenced in attribute values or entity values.
version: $Revision: 1.15 $ $Date: 1998/12/28 08:05:18 $ |
Field Summary | |
final static int | BT_AMP | final static int | BT_APOS | final static int | BT_AST | final static int | BT_COMMA | final static int | BT_CR | final static int | BT_EQUALS | final static int | BT_EXCL | final static int | BT_GT | final static int | BT_LEAD2 | final static int | BT_LEAD3 | final static int | BT_LEAD4 | final static int | BT_LF | final static int | BT_LPAR | final static int | BT_LSQB | final static int | BT_LT | final static int | BT_MALFORM | final static int | BT_MINUS | final static int | BT_NAME | final static int | BT_NMSTRT | final static int | BT_NONXML | final static int | BT_NUM | final static int | BT_OTHER | final static int | BT_PERCNT | final static int | BT_PLUS | final static int | BT_QUEST | final static int | BT_QUOT | final static int | BT_RPAR | final static int | BT_RSQB | final static int | BT_S | final static int | BT_SEMI | final static int | BT_SOL | final static int | BT_VERBAR | final public static int | TOK_ATTRIBUTE_VALUE_S Represents a white space character in an attribute value,
excluding white space characters that are part of line boundaries. | final public static int | TOK_CDATA_SECT_CLOSE Represents the end of a CDATA section ]]> . | final public static int | TOK_CDATA_SECT_OPEN Represents the start of a CDATA section <![CDATA[ . | final public static int | TOK_CHAR_PAIR_REF Represents a numeric character reference (decimal or hexadecimal),
when the referenced character is greater than 0xFFFF and so is
represented by a pair of chars. | final public static int | TOK_CHAR_REF Represents a numeric character reference (decimal or hexadecimal),
when the referenced character is less than or equal to 0xFFFF
and so is represented by a single char. | final public static int | TOK_CLOSE_BRACKET Represents ] in the prolog. | final public static int | TOK_CLOSE_PAREN Represents a ) in the prolog that is not
followed immediately by any of
* , + or ? . | final public static int | TOK_CLOSE_PAREN_ASTERISK Represents )* in the prolog. | final public static int | TOK_CLOSE_PAREN_PLUS Represents )+ in the prolog. | final public static int | TOK_CLOSE_PAREN_QUESTION Represents )? in the prolog. | final public static int | TOK_COMMA Represents , in the prolog. | final public static int | TOK_COMMENT Represents a comment <!-- comment --> . | final public static int | TOK_COND_SECT_CLOSE Represents ]]> in the prolog. | final public static int | TOK_COND_SECT_OPEN Represents <![ in the prolog. | final public static int | TOK_DATA_CHARS Represents one or more characters of data. | final public static int | TOK_DATA_NEWLINE Represents a newline (CR, LF or CR followed by LF) in data. | final public static int | TOK_DECL_CLOSE Represents > in the prolog. | final public static int | TOK_DECL_OPEN Represents <!NAME in the prolog. | final public static int | TOK_EMPTY_ELEMENT_NO_ATTS Represents an empty element tag <name/> ,
that doesn't have any attribute specifications. | final public static int | TOK_EMPTY_ELEMENT_WITH_ATTS Represents an empty element tag <name att="val"/> ,
that contains one or more attribute specifications. | final public static int | TOK_END_TAG Represents a complete end-tag </name> . | final public static int | TOK_ENTITY_REF Represents a general entity reference. | final public static int | TOK_LITERAL Represents a literal (EntityValue, AttValue, SystemLiteral or
PubidLiteral). | final public static int | TOK_MAGIC_ENTITY_REF Represents a general entity reference to a one of the 5 predefined
entities amp , lt , gt ,
quot , apos . | final public static int | TOK_NAME Represents a name in the prolog. | final public static int | TOK_NAME_ASTERISK Represents a name followed immediately by * . | final public static int | TOK_NAME_PLUS Represents a name followed immediately by + . | final public static int | TOK_NAME_QUESTION Represents a name followed immediately by ? . | final public static int | TOK_NMTOKEN Represents a name token in the prolog that is not a name. | final public static int | TOK_OPEN_BRACKET Represents [ in the prolog. | final public static int | TOK_OPEN_PAREN Represents a ( in the prolog. | final public static int | TOK_OR Represents | in the prolog. | final public static int | TOK_PARAM_ENTITY_REF Represents a parameter entity reference in the prolog. | final public static int | TOK_PERCENT Represents a % in the prolog that does not start
a parameter entity reference. | final public static int | TOK_PI Represents a processing instruction. | final public static int | TOK_POUND_NAME Represents #NAME in the prolog. | final public static int | TOK_PROLOG_S Represents whitespace in the prolog. | final public static int | TOK_START_TAG_NO_ATTS Represents a complete start-tag <name> ,
that doesn't have any attribute specifications. | final public static int | TOK_START_TAG_WITH_ATTS Represents a complete start-tag <name att="val"> ,
that contains one or more attribute specifications. | final public static int | TOK_XML_DECL Represents an XML declaration or text declaration (a processing
instruction whose target is xml ). | final static byte[] | asciiTypeTable | static byte[][] | charTypeTable |
Constructor Summary | |
| Encoding(int minBPC) |
Method Summary | |
abstract int | byteToAscii(byte[] buf, int off) | abstract int | byteType(byte[] buf, int off) | int | byteType2(byte[] buf, int off) | int | byteType3(byte[] buf, int off) | int | byteType4(byte[] buf, int off) | abstract boolean | charMatches(byte[] buf, int off, char c) | void | check2(byte[] buf, int off) | void | check3(byte[] buf, int off) | void | check4(byte[] buf, int off) | abstract public int | convert(byte[] sourceBuf, int sourceStart, int sourceEnd, char[] targetBuf, int targetStart) Convert bytes to characters. | int | extendCdata(byte[] buf, int off, int end) | int | extendData(byte[] buf, int off, int end) | final public Encoding | getEncoding(String name) Returns an Encoding corresponding to
the specified IANA character set name. | abstract public int | getFixedBytesPerChar() Returns the number of bytes required to represent each char ,
or zero if different char s are represented by different
numbers of bytes. | final public static Encoding | getInitialEncoding(byte[] buf, int off, int end, Token token) Returns an encoding object to be used to start parsing an external entity. | final public static Encoding | getInternalEncoding() Returns an Encoding object for use with internal entities. | final public int | getMinBytesPerChar() Returns the minimum number of bytes required to represent a single
character in this encoding. | final public String | getPublicId(byte[] buf, int off, int end) Checks that a literal contained in the specified byte subarray
is a legal public identifier and returns a string with
the normalized content of the public id. | final public Encoding | getSingleByteEncoding(String map) Returns an Encoding for entities encoded with
a single-byte encoding (an encoding in which each byte represents
exactly one character). | Encoding | getUTF16Encoding() | final public boolean | matchesXMLString(byte[] buf, int off, int end, String str) Returns true if the specified byte subarray is equal to the string. | abstract public void | movePosition(byte[] buf, int off, int end, Position pos) Moves a position forward. | final public int | skipIgnoreSect(byte[] buf, int off, int end) Skips over an ignored conditional section. | final public int | skipS(byte[] buf, int off, int end) Skips over XML whitespace characters at the start of the specified
subarray. | final public int | tokenizeAttributeValue(byte[] buf, int off, int end, Token token) Scans the first token of a byte subarrary that contains part of
literal attribute value. | final public int | tokenizeCdataSection(byte[] buf, int off, int end, Token token) Scans the first token of a byte subarrary that starts with the
content of a CDATA section. | final public int | tokenizeContent(byte[] buf, int off, int end, ContentToken token) Scans the first token of a byte subarrary that contains content. | final public int | tokenizeEntityValue(byte[] buf, int off, int end, Token token) Scans the first token of a byte subarrary that contains part of
literal entity value. | final public int | tokenizeProlog(byte[] buf, int off, int end, Token token) Scans the first token of a byte subarray that contains part of a
prolog. |
BT_AMP | final static int BT_AMP(Code) | | |
BT_APOS | final static int BT_APOS(Code) | | |
BT_AST | final static int BT_AST(Code) | | |
BT_COMMA | final static int BT_COMMA(Code) | | |
BT_CR | final static int BT_CR(Code) | | |
BT_EQUALS | final static int BT_EQUALS(Code) | | |
BT_EXCL | final static int BT_EXCL(Code) | | |
BT_GT | final static int BT_GT(Code) | | |
BT_LEAD2 | final static int BT_LEAD2(Code) | | |
BT_LEAD3 | final static int BT_LEAD3(Code) | | |
BT_LEAD4 | final static int BT_LEAD4(Code) | | |
BT_LF | final static int BT_LF(Code) | | |
BT_LPAR | final static int BT_LPAR(Code) | | |
BT_LSQB | final static int BT_LSQB(Code) | | |
BT_LT | final static int BT_LT(Code) | | |
BT_MALFORM | final static int BT_MALFORM(Code) | | |
BT_MINUS | final static int BT_MINUS(Code) | | |
BT_NAME | final static int BT_NAME(Code) | | |
BT_NMSTRT | final static int BT_NMSTRT(Code) | | |
BT_NONXML | final static int BT_NONXML(Code) | | |
BT_NUM | final static int BT_NUM(Code) | | |
BT_OTHER | final static int BT_OTHER(Code) | | |
BT_PERCNT | final static int BT_PERCNT(Code) | | |
BT_PLUS | final static int BT_PLUS(Code) | | |
BT_QUEST | final static int BT_QUEST(Code) | | |
BT_QUOT | final static int BT_QUOT(Code) | | |
BT_RPAR | final static int BT_RPAR(Code) | | |
BT_RSQB | final static int BT_RSQB(Code) | | |
BT_S | final static int BT_S(Code) | | |
BT_SEMI | final static int BT_SEMI(Code) | | |
BT_SOL | final static int BT_SOL(Code) | | |
BT_VERBAR | final static int BT_VERBAR(Code) | | |
TOK_ATTRIBUTE_VALUE_S | final public static int TOK_ATTRIBUTE_VALUE_S(Code) | | Represents a white space character in an attribute value,
excluding white space characters that are part of line boundaries.
|
TOK_CDATA_SECT_CLOSE | final public static int TOK_CDATA_SECT_CLOSE(Code) | | Represents the end of a CDATA section ]]> .
|
TOK_CDATA_SECT_OPEN | final public static int TOK_CDATA_SECT_OPEN(Code) | | Represents the start of a CDATA section <![CDATA[ .
|
TOK_CHAR_PAIR_REF | final public static int TOK_CHAR_PAIR_REF(Code) | | Represents a numeric character reference (decimal or hexadecimal),
when the referenced character is greater than 0xFFFF and so is
represented by a pair of chars.
|
TOK_CHAR_REF | final public static int TOK_CHAR_REF(Code) | | Represents a numeric character reference (decimal or hexadecimal),
when the referenced character is less than or equal to 0xFFFF
and so is represented by a single char.
|
TOK_CLOSE_BRACKET | final public static int TOK_CLOSE_BRACKET(Code) | | Represents ] in the prolog.
|
TOK_CLOSE_PAREN | final public static int TOK_CLOSE_PAREN(Code) | | Represents a ) in the prolog that is not
followed immediately by any of
* , + or ? .
|
TOK_CLOSE_PAREN_ASTERISK | final public static int TOK_CLOSE_PAREN_ASTERISK(Code) | | Represents )* in the prolog.
|
TOK_CLOSE_PAREN_PLUS | final public static int TOK_CLOSE_PAREN_PLUS(Code) | | Represents )+ in the prolog.
|
TOK_CLOSE_PAREN_QUESTION | final public static int TOK_CLOSE_PAREN_QUESTION(Code) | | Represents )? in the prolog.
|
TOK_COMMA | final public static int TOK_COMMA(Code) | | Represents , in the prolog.
|
TOK_COMMENT | final public static int TOK_COMMENT(Code) | | Represents a comment <!-- comment --> .
This can occur both in the prolog and in content.
|
TOK_COND_SECT_CLOSE | final public static int TOK_COND_SECT_CLOSE(Code) | | Represents ]]> in the prolog.
|
TOK_COND_SECT_OPEN | final public static int TOK_COND_SECT_OPEN(Code) | | Represents <![ in the prolog.
|
TOK_DATA_CHARS | final public static int TOK_DATA_CHARS(Code) | | Represents one or more characters of data.
|
TOK_DATA_NEWLINE | final public static int TOK_DATA_NEWLINE(Code) | | Represents a newline (CR, LF or CR followed by LF) in data.
|
TOK_DECL_CLOSE | final public static int TOK_DECL_CLOSE(Code) | | Represents > in the prolog.
|
TOK_DECL_OPEN | final public static int TOK_DECL_OPEN(Code) | | Represents <!NAME in the prolog.
|
TOK_EMPTY_ELEMENT_NO_ATTS | final public static int TOK_EMPTY_ELEMENT_NO_ATTS(Code) | | Represents an empty element tag <name/> ,
that doesn't have any attribute specifications.
|
TOK_EMPTY_ELEMENT_WITH_ATTS | final public static int TOK_EMPTY_ELEMENT_WITH_ATTS(Code) | | Represents an empty element tag <name att="val"/> ,
that contains one or more attribute specifications.
|
TOK_END_TAG | final public static int TOK_END_TAG(Code) | | Represents a complete end-tag </name> .
|
TOK_ENTITY_REF | final public static int TOK_ENTITY_REF(Code) | | Represents a general entity reference.
|
TOK_LITERAL | final public static int TOK_LITERAL(Code) | | Represents a literal (EntityValue, AttValue, SystemLiteral or
PubidLiteral).
|
TOK_MAGIC_ENTITY_REF | final public static int TOK_MAGIC_ENTITY_REF(Code) | | Represents a general entity reference to a one of the 5 predefined
entities amp , lt , gt ,
quot , apos .
|
TOK_NAME | final public static int TOK_NAME(Code) | | Represents a name in the prolog.
|
TOK_NAME_ASTERISK | final public static int TOK_NAME_ASTERISK(Code) | | Represents a name followed immediately by * .
|
TOK_NAME_PLUS | final public static int TOK_NAME_PLUS(Code) | | Represents a name followed immediately by + .
|
TOK_NAME_QUESTION | final public static int TOK_NAME_QUESTION(Code) | | Represents a name followed immediately by ? .
|
TOK_NMTOKEN | final public static int TOK_NMTOKEN(Code) | | Represents a name token in the prolog that is not a name.
|
TOK_OPEN_BRACKET | final public static int TOK_OPEN_BRACKET(Code) | | Represents [ in the prolog.
|
TOK_OPEN_PAREN | final public static int TOK_OPEN_PAREN(Code) | | Represents a ( in the prolog.
|
TOK_OR | final public static int TOK_OR(Code) | | Represents | in the prolog.
|
TOK_PARAM_ENTITY_REF | final public static int TOK_PARAM_ENTITY_REF(Code) | | Represents a parameter entity reference in the prolog.
|
TOK_PERCENT | final public static int TOK_PERCENT(Code) | | Represents a % in the prolog that does not start
a parameter entity reference.
This can occur in an entity declaration.
|
TOK_PI | final public static int TOK_PI(Code) | | Represents a processing instruction.
|
TOK_POUND_NAME | final public static int TOK_POUND_NAME(Code) | | Represents #NAME in the prolog.
|
TOK_PROLOG_S | final public static int TOK_PROLOG_S(Code) | | Represents whitespace in the prolog.
The token contains one or more whitespace characters.
|
TOK_START_TAG_NO_ATTS | final public static int TOK_START_TAG_NO_ATTS(Code) | | Represents a complete start-tag <name> ,
that doesn't have any attribute specifications.
|
TOK_START_TAG_WITH_ATTS | final public static int TOK_START_TAG_WITH_ATTS(Code) | | Represents a complete start-tag <name att="val"> ,
that contains one or more attribute specifications.
|
TOK_XML_DECL | final public static int TOK_XML_DECL(Code) | | Represents an XML declaration or text declaration (a processing
instruction whose target is xml ).
|
asciiTypeTable | final static byte[] asciiTypeTable(Code) | | |
charTypeTable | static byte[][] charTypeTable(Code) | | |
Encoding | Encoding(int minBPC)(Code) | | |
byteToAscii | abstract int byteToAscii(byte[] buf, int off)(Code) | | |
byteType | abstract int byteType(byte[] buf, int off)(Code) | | |
byteType2 | int byteType2(byte[] buf, int off)(Code) | | |
byteType3 | int byteType3(byte[] buf, int off)(Code) | | |
byteType4 | int byteType4(byte[] buf, int off)(Code) | | |
charMatches | abstract boolean charMatches(byte[] buf, int off, char c)(Code) | | |
convert | abstract public int convert(byte[] sourceBuf, int sourceStart, int sourceEnd, char[] targetBuf, int targetStart)(Code) | | Convert bytes to characters.
The bytes on sourceBuf between sourceStart
and sourceEnd are converted to characters and stored
in targetBuf starting at targetStart .
(targetBuf.length - targetStart) * getMinBytesPerChar()
must be at greater than or equal to
sourceEnd - sourceStart .
If getFixedBytesPerChar returns a value greater than 0,
then the return value will be equal to
(sourceEnd - sourceStart)/getFixedBytesPerChar() .
the number of characters stored into targetBuf See Also: Encoding.getFixedBytesPerChar |
getEncoding | final public Encoding getEncoding(String name)(Code) | | Returns an Encoding corresponding to
the specified IANA character set name.
Returns this Encoding if the name is null.
Returns null if the specified encoding is not supported.
Note that there are two distinct Encoding objects
associated with the name UTF-16 , one for
each possible byte order; if this Encoding
is UTF-16 with little-endian byte ordering, then
getEncoding("UTF-16") will return this,
otherwise it will return an Encoding for
UTF-16 with big-endian byte ordering.
Parameters: name - a string specifying the IANA name of the encoding; this iscase insensitive |
getFixedBytesPerChar | abstract public int getFixedBytesPerChar()(Code) | | Returns the number of bytes required to represent each char ,
or zero if different char s are represented by different
numbers of bytes. The value returned will 0, 1, 2, or 4.
|
getInitialEncoding | final public static Encoding getInitialEncoding(byte[] buf, int off, int end, Token token)(Code) | | Returns an encoding object to be used to start parsing an external entity.
The encoding is chosen based on the initial 4 bytes of the entity.
Parameters: buf - the byte array containing the initial bytes of the entity Parameters: off - the index in buf of the first byte of the entity Parameters: end - the index in buf following the last availablebyte of the entity; end - off must be greater than or equalto 4 unless the entity has fewer that 4 bytes, in which case it mustbe equal to the length of the entity Parameters: token - receives information about the presence of a byte ordermark; if the entity starts with a byte order markthen token.getTokenEnd() will return off + 2 , otherwise it will returnoff See Also: TextDecl See Also: XmlDecl See Also: Encoding.TOK_XML_DECL See Also: Encoding.getEncoding See Also: Encoding.getInternalEncoding |
getInternalEncoding | final public static Encoding getInternalEncoding()(Code) | | Returns an Encoding object for use with internal entities.
This is a UTF-16 big endian encoding, except that newlines
are assumed to have been normalized into line feed,
so carriage return is treated like a space.
|
getMinBytesPerChar | final public int getMinBytesPerChar()(Code) | | Returns the minimum number of bytes required to represent a single
character in this encoding. The value will be 1, 2 or 4.
|
getPublicId | final public String getPublicId(byte[] buf, int off, int end) throws InvalidTokenException(Code) | | Checks that a literal contained in the specified byte subarray
is a legal public identifier and returns a string with
the normalized content of the public id.
The subarray includes the opening and closing quotes.
exception: InvalidTokenException - if it is not a legal public identifier |
getSingleByteEncoding | final public Encoding getSingleByteEncoding(String map)(Code) | | Returns an Encoding for entities encoded with
a single-byte encoding (an encoding in which each byte represents
exactly one character).
Parameters: map - a string specifying the character represented by each byte;the string must have a length of 256; map.charAt(b) specifies the character encoded by byte b ; bytes that donot represent any character should be mapped to \uFFFD |
matchesXMLString | final public boolean matchesXMLString(byte[] buf, int off, int end, String str)(Code) | | Returns true if the specified byte subarray is equal to the string.
The string must contain only XML significant characters.
|
movePosition | abstract public void movePosition(byte[] buf, int off, int end, Position pos)(Code) | | Moves a position forward.
On entry, pos gives the position of the byte at index
off in buf .
On exit, it pos will give the position of the byte at index
end , which must be greater than or equal to off .
The bytes between off and end must encode
one or more complete characters.
A carriage return followed by a line feed will be treated as a single
line delimiter provided that they are given to movePosition
together.
|
skipIgnoreSect | final public int skipIgnoreSect(byte[] buf, int off, int end) throws PartialTokenException, InvalidTokenException(Code) | | Skips over an ignored conditional section.
The subarray starts following the <![ IGNORE [ .
the index of the character following the closing]]> exception: PartialTokenException - if the subarray does not contain thecomplete ignored conditional section exception: InvalidTokenException - if the ignored conditional sectioncontains illegal characters |
skipS | final public int skipS(byte[] buf, int off, int end)(Code) | | Skips over XML whitespace characters at the start of the specified
subarray.
the index of the first non-whitespace character,end if there is the subarray is all whitespace |
|
|