| java.lang.Object au.id.jericho.lib.html.Segment
All known Subclasses: au.id.jericho.lib.html.CharacterReference, au.id.jericho.lib.html.Element, au.id.jericho.lib.html.FormControl, au.id.jericho.lib.html.nodoc.SequentialListSegment, au.id.jericho.lib.html.Attribute, au.id.jericho.lib.html.Source, au.id.jericho.lib.html.Tag,
Segment | public class Segment implements Comparable,CharSequence(Code) | | Represents a segment of a
Source document.
Many of the tag search methods are defined in this class.
The span of a segment is defined by the combination of its begin and end character positions.
|
Constructor Summary | |
public | Segment(Source source, int begin, int end) Constructs a new Segment within the specified
document with the specified begin and end character positions. | | Segment(int length) | | Segment() |
Method Summary | |
final static StringBuffer | appendCollapseWhiteSpace(StringBuffer sb, CharSequence text) Collapses the
in the specified text. | final public char | charAt(int index) Returns the character at the specified index.
This is logically equivalent to toString().charAt(index)
for valid argument values 0 <= index < length() .
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException is thrown
for an invalid argument value.
Parameters: index - the index of the character. | public int | compareTo(Object o) Compares this Segment object to another object. | final public boolean | encloses(Segment segment) Indicates whether this Segment encloses the specified Segment .
This is the case if
Segment.getBegin() <=segment.
Segment.getBegin() &&
Segment.getEnd() >=segment.
Segment.getEnd() .
Parameters: segment - the segment to be tested for being enclosed by this segment. | final public boolean | encloses(int pos) Indicates whether this segment encloses the specified character position in the source document.
This is the case if
Segment.getBegin() <= pos <
Segment.getEnd() .
Parameters: pos - the position in the Source document. | final public boolean | equals(Object object) Compares the specified object with this Segment for equality.
Returns true if and only if the specified object is also a Segment ,
and both segments have the same
Source , and the same begin and end positions.
Parameters: object - the object to be compared for equality with this Segment . | public String | extractText() Extracts the textual content from the HTML markup of this segment. | public String | extractText(boolean includeAttributes) Extracts the textual content from the HTML markup of this segment.
This method has been deprecated as of version 2.4 and replaced with the
Segment.getTextExtractor() method.
Parameters: includeAttributes - specifies whether the values of title, alt, label, and summary attributes are included in the output. | public List | findAllCharacterReferences() Returns a list of all
CharacterReference objects that are
by this segment. | public List | findAllElements() Returns a list of all
Element objects that are
by this segment. | public List | findAllElements(String name) Returns a list of all
Element objects with the specified name that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllStartTags(String name) method.
Specifying a null argument to the name parameter is equivalent to
Segment.findAllElements() .
This method also returns elements consisting of
tags if the specified name is not a valid
.
Parameters: name - the of the elements to find. | public List | findAllElements(StartTagType startTagType) Returns a list of all
Element objects with start tags of the specified
that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllTags(TagType) method.
Parameters: startTagType - the of start tags to find, must not be null . | public List | findAllElements(String attributeName, String value, boolean valueCaseSensitive) Returns a list of all
Element objects with the specified attribute name/value pair
that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllStartTags(String attributeName,String value,boolean valueCaseSensitive) method.
Parameters: attributeName - the attribute name (case insensitive) to search for, must not be null . Parameters: value - the value of the specified attribute to search for, must not be null . Parameters: valueCaseSensitive - specifies whether the attribute value matching is case sensitive. | public List | findAllStartTags() Returns a list of all
StartTag objects that are
by this segment. | public List | findAllStartTags(String name) Returns a list of all
StartTag objects with the specified name that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Specifying a null argument to the name parameter is equivalent to
Segment.findAllStartTags() .
This method also returns
tags if the specified name is not a valid
.
Parameters: name - the of the start tags to find. | public List | findAllStartTags(String attributeName, String value, boolean valueCaseSensitive) Returns a list of all
StartTag objects with the specified attribute name/value pair
that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Parameters: attributeName - the attribute name (case insensitive) to search for, must not be null . Parameters: value - the value of the specified attribute to search for, must not be null . Parameters: valueCaseSensitive - specifies whether the attribute value matching is case sensitive. | public List | findAllTags() Returns a list of all
Tag objects that are
by this segment. | public List | findAllTags(TagType tagType) Returns a list of all
Tag objects of the specified
that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Specifying a null argument to the tagType parameter is equivalent to
Segment.findAllTags() .
Parameters: tagType - the of tags to find. | public List | findFormControls() Returns a list of the
FormControl objects that are
by this segment. | public FormFields | findFormFields() Returns the
FormFields object representing all form fields that are
by this segment. | final public int | getBegin() Returns the character position in the
Source document at which this segment begins. | public List | getChildElements() Returns a list of the immediate children of this segment in the document element hierarchy. | public String | getDebugInfo() Returns a string representation of this object useful for debugging purposes. | final public int | getEnd() Returns the character position in the
Source document immediately after the end of this segment. | public Renderer | getRenderer() Performs a simple rendering of the HTML markup in this segment into text. | public TextExtractor | getTextExtractor() Extracts the textual content from the HTML markup of this segment. | public int | hashCode() Returns a hash code value for the segment. | public void | ignoreWhenParsing() Causes the this segment to be ignored when parsing.
Ignored segments are treated as blank spaces by the parsing mechanism, but are included as normal text in all other functions.
This method was originally the only means of preventing
located inside
tags from interfering with the parsing of the tags.
The most common scenario is where the
of a normal tag uses server tags to dynamically set the values of the attributes.
As of version 2.4 it is no longer necessary to use this method to ignore
inside normal tags,
as the attributes parser now automatically ignores common server tags.
As of version 2.5 it is also unnecessary to use this method to ignore the contents of
HTMLElementName.SCRIPT SCRIPT elements,
as the parser automatically ignores this content when performing a
.
This leaves only a few scenarios where calling this method still provides a significant benefit.
One such case is where XML-style server tags are used inside
tags.
Here is an example using an XML-style JSP tag:
<a href="<i18n:resource path="/Portal"/>?BACK=TRUE">back</a>
The first double-quote of "/Portal" will be interpreted as the end quote for the href attribute,
as there is no way for the parser to recognise the il8n:resource element as a server tag.
Such use of XML-style server tags inside
tags is generally seen as bad practice,
but it is nevertheless valid JSP. | final public boolean | isWhiteSpace() Indicates whether this segment consists entirely of
. | final public static boolean | isWhiteSpace(char ch) Indicates whether the specified character is white space.
The HTML 4.01 specification section 9.1
specifies the following white space characters:
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not
recognise them as whitespace and renders them as an unprintable character (empty square).
Even zero-width spaces included using the numeric character reference ​ are rendered this way.
Parameters: ch - the character to test. | final public int | length() Returns the length of the segment. | public Attributes | parseAttributes() Parses any
Attributes within this segment. | final public CharSequence | subSequence(int beginIndex, int endIndex) Returns a new character sequence that is a subsequence of this sequence.
This is logically equivalent to toString().subSequence(beginIndex,endIndex)
for valid values of beginIndex and endIndex .
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException is thrown
for invalid argument values as described in the String.subSequence(int,int) method.
Parameters: beginIndex - the begin index, inclusive. Parameters: endIndex - the end index, exclusive. | public String | toString() Returns the source text of this segment as a String . |
Segment | public Segment(Source source, int begin, int end)(Code) | | Constructs a new Segment within the specified
document with the specified begin and end character positions.
Parameters: source - the Source document, must not be null . Parameters: begin - the character position in the source where this segment begins. Parameters: end - the character position in the source where this segment ends. |
Segment | Segment(int length)(Code) | | |
appendCollapseWhiteSpace | final static StringBuffer appendCollapseWhiteSpace(StringBuffer sb, CharSequence text)(Code) | | Collapses the
in the specified text.
All leading and trailing white space is omitted, and any sections of internal white space are replaced by a single space.
|
charAt | final public char charAt(int index)(Code) | | Returns the character at the specified index.
This is logically equivalent to toString().charAt(index)
for valid argument values 0 <= index < length() .
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException is thrown
for an invalid argument value.
Parameters: index - the index of the character. the character at the specified index. |
compareTo | public int compareTo(Object o)(Code) | | Compares this Segment object to another object.
If the argument is not a Segment , a ClassCastException is thrown.
A segment is considered to be before another segment if its begin position is earlier,
or in the case that both segments begin at the same position, its end position is earlier.
Segments that begin and end at the same position are considered equal for
the purposes of this comparison, even if they relate to different source documents.
Note: this class has a natural ordering that is inconsistent with equals.
This means that this method may return zero in some cases where calling the
Segment.equals(Object) method with the same argument returns false .
Parameters: o - the segment to be compared a negative integer, zero, or a positive integer as this segment is before, equal to, or after the specified segment. throws: ClassCastException - if the argument is not a Segment |
encloses | final public boolean encloses(int pos)(Code) | | Indicates whether this segment encloses the specified character position in the source document.
This is the case if
Segment.getBegin() <= pos <
Segment.getEnd() .
Parameters: pos - the position in the Source document. true if this segment encloses the specified character position in the source document, otherwise false . |
equals | final public boolean equals(Object object)(Code) | | Compares the specified object with this Segment for equality.
Returns true if and only if the specified object is also a Segment ,
and both segments have the same
Source , and the same begin and end positions.
Parameters: object - the object to be compared for equality with this Segment . true if the specified object is equal to this Segment , otherwise false . |
findAllElements | public List findAllElements()(Code) | | Returns a list of all
Element objects that are
by this segment.
The
Source.fullSequentialParse method should be called after construction of the
Source object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the
Source object itself.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllStartTags() method.
a list of all Element objects that are by this segment. |
findAllElements | public List findAllElements(String name)(Code) | | Returns a list of all
Element objects with the specified name that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllStartTags(String name) method.
Specifying a null argument to the name parameter is equivalent to
Segment.findAllElements() .
This method also returns elements consisting of
tags if the specified name is not a valid
.
Parameters: name - the of the elements to find. a list of all Element objects with the specified name that are by this segment. |
findAllElements | public List findAllElements(StartTagType startTagType)(Code) | | Returns a list of all
Element objects with start tags of the specified
that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllTags(TagType) method.
Parameters: startTagType - the of start tags to find, must not be null . a list of all Element objects with start tags of the specified that are by this segment. |
findAllElements | public List findAllElements(String attributeName, String value, boolean valueCaseSensitive)(Code) | | Returns a list of all
Element objects with the specified attribute name/value pair
that are
by this segment.
The elements returned correspond exactly with the start tags returned in the
Segment.findAllStartTags(String attributeName,String value,boolean valueCaseSensitive) method.
Parameters: attributeName - the attribute name (case insensitive) to search for, must not be null . Parameters: value - the value of the specified attribute to search for, must not be null . Parameters: valueCaseSensitive - specifies whether the attribute value matching is case sensitive. a list of all Element objects with the specified attribute name/value pair that are by this segment. |
findAllStartTags | public List findAllStartTags()(Code) | | Returns a list of all
StartTag objects that are
by this segment.
The
Source.fullSequentialParse method should be called after construction of the
Source object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the
Source object itself.
See the
Tag class documentation for more details about the behaviour of this method.
a list of all StartTag objects that are by this segment. |
findAllStartTags | public List findAllStartTags(String name)(Code) | | Returns a list of all
StartTag objects with the specified name that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Specifying a null argument to the name parameter is equivalent to
Segment.findAllStartTags() .
This method also returns
tags if the specified name is not a valid
.
Parameters: name - the of the start tags to find. a list of all StartTag objects with the specified name that are by this segment. |
findAllStartTags | public List findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)(Code) | | Returns a list of all
StartTag objects with the specified attribute name/value pair
that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Parameters: attributeName - the attribute name (case insensitive) to search for, must not be null . Parameters: value - the value of the specified attribute to search for, must not be null . Parameters: valueCaseSensitive - specifies whether the attribute value matching is case sensitive. a list of all StartTag objects with the specified attribute name/value pair that are by this segment. |
findAllTags | public List findAllTags()(Code) | | Returns a list of all
Tag objects that are
by this segment.
The
Source.fullSequentialParse method should be called after construction of the
Source object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the
Source object itself.
See the
Tag class documentation for more details about the behaviour of this method.
a list of all Tag objects that are by this segment. |
findAllTags | public List findAllTags(TagType tagType)(Code) | | Returns a list of all
Tag objects of the specified
that are
by this segment.
See the
Tag class documentation for more details about the behaviour of this method.
Specifying a null argument to the tagType parameter is equivalent to
Segment.findAllTags() .
Parameters: tagType - the of tags to find. a list of all Tag objects of the specified that are by this segment. |
findFormControls | public List findFormControls()(Code) | | Returns a list of the
FormControl objects that are
by this segment.
a list of the FormControl objects that are by this segment. |
getBegin | final public int getBegin()(Code) | | Returns the character position in the
Source document at which this segment begins.
the character position in the Source document at which this segment begins. |
getChildElements | public List getChildElements()(Code) | | Returns a list of the immediate children of this segment in the document element hierarchy.
The returned list may include an element that extends beyond the end of this segment, as long as it begins within this segment.
An element found at the start of this segment is included in the list.
Note however that if this segment is an
Element , the overriding
Element.getChildElements method is called instead,
which only returns the children of the element.
Calling getChildElements() on an Element is usually more efficient than calling it on a Segment .
The objects in the list are all of type
Element .
The
Source.fullSequentialParse method should be called after construction of the
Source object
if this method is to be used on a large proportion of the source.
It is called automatically if this method is called on the
Source object itself.
See the
Source.getChildElements method for more details.
the a list of the immediate children of this segment in the document element hierarchy, guaranteed not null . See Also: Element.getParentElement |
getDebugInfo | public String getDebugInfo()(Code) | | Returns a string representation of this object useful for debugging purposes.
a string representation of this object useful for debugging purposes. |
getEnd | final public int getEnd()(Code) | | Returns the character position in the
Source document immediately after the end of this segment.
The character at the position specified by this property is not included in the segment.
the character position in the Source document immediately after the end of this segment. |
getRenderer | public Renderer getRenderer()(Code) | | Performs a simple rendering of the HTML markup in this segment into text.
The output can be configured by setting any number of properties on the returned
Renderer instance before
.
an instance of Renderer based on this segment. See Also: Segment.getTextExtractor() |
hashCode | public int hashCode()(Code) | | Returns a hash code value for the segment.
The current implementation returns the sum of the begin and end positions, although this is not
guaranteed in future versions.
a hash code value for the segment. |
ignoreWhenParsing | public void ignoreWhenParsing()(Code) | | Causes the this segment to be ignored when parsing.
Ignored segments are treated as blank spaces by the parsing mechanism, but are included as normal text in all other functions.
This method was originally the only means of preventing
located inside
tags from interfering with the parsing of the tags.
The most common scenario is where the
of a normal tag uses server tags to dynamically set the values of the attributes.
As of version 2.4 it is no longer necessary to use this method to ignore
inside normal tags,
as the attributes parser now automatically ignores common server tags.
As of version 2.5 it is also unnecessary to use this method to ignore the contents of
HTMLElementName.SCRIPT SCRIPT elements,
as the parser automatically ignores this content when performing a
.
This leaves only a few scenarios where calling this method still provides a significant benefit.
One such case is where XML-style server tags are used inside
tags.
Here is an example using an XML-style JSP tag:
<a href="<i18n:resource path="/Portal"/>?BACK=TRUE">back</a>
The first double-quote of "/Portal" will be interpreted as the end quote for the href attribute,
as there is no way for the parser to recognise the il8n:resource element as a server tag.
Such use of XML-style server tags inside
tags is generally seen as bad practice,
but it is nevertheless valid JSP. The only way to ensure that this library is able to parse the normal tag surrounding it is to
find these server tags first and call the ignoreWhenParsing method to ignore them before parsing the rest of the document.
It is important to understand the difference between ignoring the segment when parsing and removing the segment completely.
Any text inside a segment that is ignored when parsing is treated by most functions as content, and as such is included in the output of
tools such as
TextExtractor and
Renderer .
To remove segments completely, create an
OutputDocument and call its
OutputDocument.remove(Segment) remove(Segment) or
OutputDocument.replaceWithSpaces(intint) replaceWithSpaces(int begin, int end) method for each segment.
Then create a new source document using
Source.Source(CharSequence) new Source(outputDocument.toString()) and perform the desired operations on this new source object.
Calling this method after the
Source.fullSequentialParse method has been called is not permitted and throws an IllegalStateException .
Any tags appearing in this segment that are found before this method is called will remain in the
,
and so will continue to be found by the tag search methods.
If this is undesirable, the
Source.clearCache method can be called to remove them from the cache.
Calling the
Source.fullSequentialParse method after this method clears the cache automatically.
For best performance, this method should be called on all segments that need to be ignored without calling
any of the tag search methods in between.
See Also: Source.ignoreWhenParsing(Collection segments) |
isWhiteSpace | final public boolean isWhiteSpace()(Code) | | Indicates whether this segment consists entirely of
.
true if this segment consists entirely of , otherwise false . |
isWhiteSpace | final public static boolean isWhiteSpace(char ch)(Code) | | Indicates whether the specified character is white space.
The HTML 4.01 specification section 9.1
specifies the following white space characters:
- space (U+0020)
- tab (U+0009)
- form feed (U+000C)
- line feed (U+000A)
- carriage return (U+000D)
- zero-width space (U+200B)
Despite the explicit inclusion of the zero-width space in the HTML specification, Microsoft IE6 does not
recognise them as whitespace and renders them as an unprintable character (empty square).
Even zero-width spaces included using the numeric character reference ​ are rendered this way.
Parameters: ch - the character to test. true if the specified character is white space, otherwise false . |
length | final public int length()(Code) | | Returns the length of the segment.
This is defined as the number of characters between the begin and end positions.
the length of the segment. |
subSequence | final public CharSequence subSequence(int beginIndex, int endIndex)(Code) | | Returns a new character sequence that is a subsequence of this sequence.
This is logically equivalent to toString().subSequence(beginIndex,endIndex)
for valid values of beginIndex and endIndex .
However because this implementation works directly on the underlying document source string,
it should not be assumed that an IndexOutOfBoundsException is thrown
for invalid argument values as described in the String.subSequence(int,int) method.
Parameters: beginIndex - the begin index, inclusive. Parameters: endIndex - the end index, exclusive. a new character sequence that is a subsequence of this sequence. |
toString | public String toString()(Code) | | Returns the source text of this segment as a String .
The returned String is newly created with every call to this method, unless this
segment is itself an instance of
Source .
Note that before version 2.0 this returned a representation of this object useful for debugging purposes,
which can now be obtained via the
Segment.getDebugInfo() method.
the source text of this segment as a String . |
|
|