| java.lang.Object au.id.jericho.lib.html.Segment au.id.jericho.lib.html.CharacterReference
All known Subclasses: au.id.jericho.lib.html.NumericCharacterReference, au.id.jericho.lib.html.CharacterEntityReference,
CharacterReference | abstract public class CharacterReference extends Segment (Code) | | Represents an HTML Character Reference,
implemented by the subclasses
CharacterEntityReference and
NumericCharacterReference .
This class, together with its subclasses, contains static methods to perform most required operations
without having to instantiate an object.
Instances of this class are useful when the positions of character references in a source document are required,
or to replace the found character references with customised text.
CharacterReference instances are obtained using one of the following methods:
|
Method Summary | |
final static StringBuffer | appendDecimalCharacterReferenceString(StringBuffer sb, int codePoint) | static StringBuffer | appendEncode(StringBuffer sb, CharSequence unencodedText, boolean whiteSpaceFormatting) | final static StringBuffer | appendHexadecimalCharacterReferenceString(StringBuffer sb, int codePoint) | final static StringBuffer | appendUnicodeText(StringBuffer sb, int codePoint) | public static String | decode(CharSequence encodedText) Decodes the specified HTML encoded text into normal text.
All
and
are converted to their respective characters.
This is equivalent to
CharacterReference.decode(CharSequence,boolean) decode(encodedText,false) .
Unterminated character references are dealt with according to the rules for
text outside of attribute values in the
.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case,
some browsers also recognise them in a case-insensitive way.
For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
Parameters: encodedText - the text to decode. | public static String | decode(CharSequence encodedText, boolean insideAttributeValue) Decodes the specified HTML encoded text into normal text.
All
and
are converted to their respective characters.
Unterminated character references are dealt with according to the
value of the insideAttributeValue parameter and the
.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case,
some browsers also recognise them in a case-insensitive way.
For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
Parameters: encodedText - the text to decode. Parameters: insideAttributeValue - specifies whether the encoded text is inside an attribute value. | public static String | decodeCollapseWhiteSpace(CharSequence text) the specified text after collapsing its
. | static String | decodeCollapseWhiteSpace(CharSequence text, boolean convertNonBreakingSpaces) | public static String | encode(CharSequence unencodedText) Encodes the specified text, escaping special characters into character references.
Each character is encoded only if the
CharacterReference.requiresEncoding(char) method would return true for that character,
using its
CharacterEntityReference if available, or a decimal
NumericCharacterReference if its unicode
code point is greater than U+007F.
The only exception to this is an
(U+0027),
which depending on the current setting of the static
Config.IsApostropheEncoded property,
is either left unencoded (default setting), or encoded as the numeric character reference "' ".
This method never encodes an apostrophe into its character entity reference
CharacterEntityReference._apos ' as this entity is not defined for use in HTML. | public static String | encode(char ch) Encodes the specified character into a character reference if
.
The encoding of the character follows the same rules as for each character in the
CharacterReference.encode(CharSequence unencodedText) method.
Parameters: ch - the character to encode. | public static String | encodeWithWhiteSpaceFormatting(CharSequence unencodedText) the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.
This performs the same encoding as the
CharacterReference.encode(CharSequence) method, but also performs the following conversions:
- Line breaks, being Carriage Return (U+000D) or Line Feed (U+000A) characters, and Form Feed characters (U+000C)
are converted to "
<br /> ". | static CharacterReference | findPreviousOrNext(Source source, int pos, boolean previous) | public char | getChar() Returns the character represented by this character reference. | abstract public String | getCharacterReferenceString() Returns the encoded form of this character reference. | public static String | getCharacterReferenceString(int codePoint) Returns the encoded form of the specified unicode code point.
This method returns the
encoded form of the unicode code point
if one exists, otherwise it returns the
encoded form.
The only exception to this is an
(U+0027),
which is encoded as the numeric character reference "' " instead of its character entity reference
"' ".
- Examples:
CharacterReference.getCharacterReferenceString(62) returns "> "
CharacterReference.getCharacterReferenceString('>') returns "> "
CharacterReference.getCharacterReferenceString('☺') returns "☺ "
Parameters: codePoint - the unicode code point to encode. | public int | getCodePoint() Returns the unicode code point represented by this character reference. | public static int | getCodePointFromCharacterReferenceString(CharSequence characterReferenceText) Parses a single encoded character reference text into a unicode code point.
The character reference must be at the start of the given text, but may contain other characters at the end.
If the text does not represent a valid character reference, this method returns
CharacterReference.INVALID_CODE_POINT .
This is equivalent to
CharacterReference.parse(CharSequence) parse(characterReferenceText) .
CharacterReference.getCodePoint() ,
except that it returns
CharacterReference.INVALID_CODE_POINT if an invalid character reference is specified instead of throwing a
NullPointerException .
- Example:
CharacterReference.getCodePointFromCharacterReferenceString(">") returns 38
Parameters: characterReferenceText - the text containing a single encoded character reference. | public String | getDecimalCharacterReferenceString() Returns the decimal encoded form of this character reference. | public static String | getDecimalCharacterReferenceString(int codePoint) Returns the decimal encoded form of the specified unicode code point.
- Example:
CharacterReference.getDecimalCharacterReferenceString('>') returns "> "
Parameters: codePoint - the unicode code point to encode. | public static Writer | getEncodingFilterWriter(Writer writer) Returns a filter Writer that
all text before passing it through to the specified Writer . | public String | getHexadecimalCharacterReferenceString() Returns the hexadecimal encoded form of this character reference. | public static String | getHexadecimalCharacterReferenceString(int codePoint) Returns the hexadecimal encoded form of the specified unicode code point.
- Example:
CharacterReference.getHexadecimalCharacterReferenceString('>') returns "> "
Parameters: codePoint - the unicode code point to encode. | public String | getUnicodeText() Returns the unicode code point of this character reference in U+ notation. | public static String | getUnicodeText(int codePoint) Returns the specified unicode code point in U+ notation.
- Example:
CharacterReference.getUnicodeText('>') returns "U+003E "
Parameters: codePoint - the unicode code point. | public boolean | isTerminated() Indicates whether this character reference is terminated by a semicolon (; ). | public static CharacterReference | parse(CharSequence characterReferenceText) Parses a single encoded character reference text into a CharacterReference object.
The character reference must be at the start of the given text, but may contain other characters at the end.
The
CharacterReference.getEnd() getEnd() method can be used on the resulting object to determine at which character position the character reference ended.
If the text does not represent a valid character reference, this method returns null .
Unterminated character references are always accepted, regardless of the settings in the
.
To decode all character references in a given text, use the
CharacterReference.decode(CharSequence) method instead.
- Example:
CharacterReference.parse(">").getChar() returns '> '
Parameters: characterReferenceText - the text containing a single encoded character reference. | public static String | reencode(CharSequence encodedText) Re-encodes the specified text, equivalent to
and then
again.
This process ensures that the specified encoded text does not contain any remaining unencoded characters.
IMPLEMENTATION NOTE: At present this method simply calls the
CharacterReference.decode(CharSequence) decode method
followed by the
CharacterReference.encode(CharSequence) encode method, but a more efficient implementation
may be used in future.
Parameters: encodedText - the text to re-encode. | final public static boolean | requiresEncoding(char ch) Indicates whether the specified character would need to be encoded in HTML text.
This is the case if a
exists for the character, or the unicode code point is greater than U+007F.
The only exception to this is an
(U+0027),
which only returns true if the static
Config.IsApostropheEncoded property
is currently set to true .
Parameters: ch - the character to test. |
INVALID_CODE_POINT | final public static int INVALID_CODE_POINT(Code) | | Represents an invalid unicode code point.
This can be the result of parsing a numeric character reference outside of the valid unicode range of 0x000000-0x10FFFF, or any other invalid character reference.
|
MAX_CODE_POINT | final static int MAX_CODE_POINT(Code) | | The maximum codepoint allowed by unicode, 0x10FFFF (decimal 1114111).
This can be replaced by Character.MAX_CODE_POINT in java 1.5
|
MAX_ENTITY_REFERENCE_LENGTH | static int MAX_ENTITY_REFERENCE_LENGTH(Code) | | |
CharacterReference | CharacterReference(Source source, int begin, int end, int codePoint)(Code) | | |
decode | public static String decode(CharSequence encodedText)(Code) | | Decodes the specified HTML encoded text into normal text.
All
and
are converted to their respective characters.
This is equivalent to
CharacterReference.decode(CharSequence,boolean) decode(encodedText,false) .
Unterminated character references are dealt with according to the rules for
text outside of attribute values in the
.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case,
some browsers also recognise them in a case-insensitive way.
For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
Parameters: encodedText - the text to decode. the decoded string. See Also: CharacterReference.encode(CharSequence) |
decode | public static String decode(CharSequence encodedText, boolean insideAttributeValue)(Code) | | Decodes the specified HTML encoded text into normal text.
All
and
are converted to their respective characters.
Unterminated character references are dealt with according to the
value of the insideAttributeValue parameter and the
.
Although character entity reference names are case sensitive, and in some cases differ from other entity references only by their case,
some browsers also recognise them in a case-insensitive way.
For this reason, all decoding methods in this library recognise character entity reference names even if they are in the wrong case.
Parameters: encodedText - the text to decode. Parameters: insideAttributeValue - specifies whether the encoded text is inside an attribute value. the decoded string. See Also: CharacterReference.decode(CharSequence) See Also: CharacterReference.encode(CharSequence) |
decodeCollapseWhiteSpace | public static String decodeCollapseWhiteSpace(CharSequence text)(Code) | | the specified text after collapsing its
.
All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
The result is how the text would normally be rendered by a
user agent,
assuming it does not contain any tags.
Unterminated character references are dealt with according to the rules for
text outside of attribute values in the
.
See the discussion of the insideAttributeValue parameter of the
#decode(CharSequence, boolean insideAttributeValue) method for a more detailed explanation of this topic.
Parameters: text - the source text the decoded text with collapsed white space. See Also: FormControl.getPredefinedValues |
decodeCollapseWhiteSpace | static String decodeCollapseWhiteSpace(CharSequence text, boolean convertNonBreakingSpaces)(Code) | | |
encode | public static String encode(char ch)(Code) | | Encodes the specified character into a character reference if
.
The encoding of the character follows the same rules as for each character in the
CharacterReference.encode(CharSequence unencodedText) method.
Parameters: ch - the character to encode. a character reference if appropriate, otherwise a string containing the original character. |
encodeWithWhiteSpaceFormatting | public static String encodeWithWhiteSpaceFormatting(CharSequence unencodedText)(Code) | | the specified text, preserving line breaks, tabs and spaces for rendering by converting them to markup.
This performs the same encoding as the
CharacterReference.encode(CharSequence) method, but also performs the following conversions:
- Line breaks, being Carriage Return (U+000D) or Line Feed (U+000A) characters, and Form Feed characters (U+000C)
are converted to "
<br /> ". CR/LF pairs are treated as a single line break.
- Multiple consecutive spaces are converted so that every second space is converted to "
"
while ensuring the last is always a normal space.
- Tab characters (U+0009) are converted as if they were four consecutive spaces.
The conversion of multiple consecutive spaces to alternating space/non-breaking-space allows the correct number of
spaces to be rendered, but also allows the line to wrap in the middle of it.
Note that zero-width spaces (U+200B) are converted to the numeric character reference
"​ " through the normal encoding process, but IE6 does not render them properly
either encoded or unencoded.
There is no method provided to reverse this encoding.
Parameters: unencodedText - the text to encode. the encoded string with whitespace formatting converted to markup. See Also: CharacterReference.encode(CharSequence) |
getChar | public char getChar()(Code) | | Returns the character represented by this character reference.
If this character reference represents a unicode
supplimentary code point,
any bits outside of the least significant 16 bits of the code point are truncated, yielding an incorrect result.
the character represented by this character reference. |
getCharacterReferenceString | public static String getCharacterReferenceString(int codePoint)(Code) | | Returns the encoded form of the specified unicode code point.
This method returns the
encoded form of the unicode code point
if one exists, otherwise it returns the
encoded form.
The only exception to this is an
(U+0027),
which is encoded as the numeric character reference "' " instead of its character entity reference
"' ".
- Examples:
CharacterReference.getCharacterReferenceString(62) returns "> "
CharacterReference.getCharacterReferenceString('>') returns "> "
CharacterReference.getCharacterReferenceString('☺') returns "☺ "
Parameters: codePoint - the unicode code point to encode. the encoded form of the specified unicode code point. See Also: CharacterReference.getHexadecimalCharacterReferenceString(int codePoint) |
getCodePoint | public int getCodePoint()(Code) | | Returns the unicode code point represented by this character reference.
the unicode code point represented by this character reference. |
getCodePointFromCharacterReferenceString | public static int getCodePointFromCharacterReferenceString(CharSequence characterReferenceText)(Code) | | Parses a single encoded character reference text into a unicode code point.
The character reference must be at the start of the given text, but may contain other characters at the end.
If the text does not represent a valid character reference, this method returns
CharacterReference.INVALID_CODE_POINT .
This is equivalent to
CharacterReference.parse(CharSequence) parse(characterReferenceText) .
CharacterReference.getCodePoint() ,
except that it returns
CharacterReference.INVALID_CODE_POINT if an invalid character reference is specified instead of throwing a
NullPointerException .
- Example:
CharacterReference.getCodePointFromCharacterReferenceString(">") returns 38
Parameters: characterReferenceText - the text containing a single encoded character reference. the unicode code point representing representing the specified text, or CharacterReference.INVALID_CODE_POINT if the text does not represent a valid character reference. |
getEncodingFilterWriter | public static Writer getEncodingFilterWriter(Writer writer)(Code) | | Returns a filter Writer that
all text before passing it through to the specified Writer .
Parameters: writer - the destination for the encoded text a filter Writer that all text before passing it through to the specified Writer . See Also: CharacterReference.encode(CharSequence unencodedText) |
getUnicodeText | public static String getUnicodeText(int codePoint)(Code) | | Returns the specified unicode code point in U+ notation.
- Example:
CharacterReference.getUnicodeText('>') returns "U+003E "
Parameters: codePoint - the unicode code point. the specified unicode code point in U+ notation. |
isTerminated | public boolean isTerminated()(Code) | | Indicates whether this character reference is terminated by a semicolon (; ).
Conversely, this library defines an unterminated character reference as one which does
not end with a semicolon.
The SGML specification allows unterminated character references in some circumstances, and because the
HTML 4.01 specification states simply that
"authors may use SGML character references",
it follows that they are also valid in HTML documents, although their use is strongly discouraged.
Unterminated character references are not allowed in XHTML documents.
true if this character reference is terminated by a semicolon, otherwise false . See Also: CharacterReference.decode(CharSequence encodedText,boolean insideAttributeValue) |
parse | public static CharacterReference parse(CharSequence characterReferenceText)(Code) | | Parses a single encoded character reference text into a CharacterReference object.
The character reference must be at the start of the given text, but may contain other characters at the end.
The
CharacterReference.getEnd() getEnd() method can be used on the resulting object to determine at which character position the character reference ended.
If the text does not represent a valid character reference, this method returns null .
Unterminated character references are always accepted, regardless of the settings in the
.
To decode all character references in a given text, use the
CharacterReference.decode(CharSequence) method instead.
- Example:
CharacterReference.parse(">").getChar() returns '> '
Parameters: characterReferenceText - the text containing a single encoded character reference. a CharacterReference object representing the specified text, or null if the text does not represent a valid character reference. See Also: CharacterReference.decode(CharSequence) |
reencode | public static String reencode(CharSequence encodedText)(Code) | | Re-encodes the specified text, equivalent to
and then
again.
This process ensures that the specified encoded text does not contain any remaining unencoded characters.
IMPLEMENTATION NOTE: At present this method simply calls the
CharacterReference.decode(CharSequence) decode method
followed by the
CharacterReference.encode(CharSequence) encode method, but a more efficient implementation
may be used in future.
Parameters: encodedText - the text to re-encode. the re-encoded string. |
requiresEncoding | final public static boolean requiresEncoding(char ch)(Code) | | Indicates whether the specified character would need to be encoded in HTML text.
This is the case if a
exists for the character, or the unicode code point is greater than U+007F.
The only exception to this is an
(U+0027),
which only returns true if the static
Config.IsApostropheEncoded property
is currently set to true .
Parameters: ch - the character to test. true if the specified character would need to be encoded in HTML text, otherwise false . |
|
|