au.id.jericho.lib.html |
Jericho HTML Parser
A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while
reproducing verbatim any unrecognised or invalid HTML.
Also provides high-level HTML form manipulation functions.
For an introduction to the API, the documentation of the {@link au.id.jericho.lib.html.Source} class is the best place to start.
For a summary of features and sample applications, visit the homepage at
http://jerichohtml.sourceforge.net
For downloads, support and updates visit the SourceForge.net project page at
http://sourceforge.net/projects/jerichohtml/
The Jericho HTML Parser is an open source library released under the GNU Lesser General Public License (LGPL).
You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
|
Java Source File Name | Type | Comment |
Attribute.java | Class | Represents a single attribute
name/value segment within a
StartTag . |
Attributes.java | Class | Represents the list of
Attribute objects present within a particular
StartTag . |
AttributesOutputSegment.java | Class | Implements an
OutputSegment whose content is a list of attribute name/value pairs. |
BasicLogFormatter.java | Class | Provides basic formatting for log messages.
This class extends the java.util.logging.Formatter class, allowing it to be specified as a formatter for the java.util.logging system.
The static
BasicLogFormatter.format(String level,String message,String loggerName) method provides a means of using the same formatting
outside of the java.util.logging framework. |
BlankOutputSegment.java | Class | Implements an
OutputSegment whose content is a string of spaces with the same length as the segment. |
Cache.java | Class | Represents a cached map of character positions to tags. |
CharacterEntityReference.java | Class | Represents an HTML Character Entity Reference. |
CharacterReference.java | Class | Represents an HTML Character Reference,
implemented by the subclasses
CharacterEntityReference and
NumericCharacterReference . |
CharOutputSegment.java | Class | Implements an
OutputSegment whose content is a single character constant. |
CharStreamSource.java | Interface | Represents a character stream source. |
CharStreamSourceUtil.java | Class | Contains static utility methods for manipulating the way data is retrieved from a
CharStreamSource object. |
Config.java | Class | Encapsulates global configuration properties which determine the behaviour of various functions. |
Element.java | Class | Represents an element
in a specific
document, which encompasses a
,
an optional
and all
in between.
Take the following HTML segment as an example:
<p>This is a sample paragraph.</p>
The whole segment is represented by an Element object. |
EncodingDetector.java | Class | |
EndTag.java | Class | Represents the end tag of an
in a specific
document. |
EndTagType.java | Class | Defines the syntax for an end tag type. |
EndTagTypeGenericImplementation.java | Class | Provides a generic implementation of the abstract
EndTagType class based on the most common end tag behaviour. |
EndTagTypeMasonComponentCalledWithContent.java | Class | |
EndTagTypeMasonNamedBlock.java | Class | |
EndTagTypeNormal.java | Class | |
EndTagTypeUnregistered.java | Class | |
FormControl.java | Class | Represents an HTML form control.
A FormControl consists of a single
that matches one of the
.
The term output element is used to describe the element that is
if this form control is
in an
OutputDocument .
A predefined value control is a form control for which
FormControl.getFormControlType() .
FormControlType.hasPredefinedValue hasPredefinedValue() returns true . |
FormControlOutputStyle.java | Class | An enumerated type representing the three major output styles of a
output element. |
FormControlType.java | Class | Represents the control type
of a
FormControl . |
FormField.java | Class | Represents a field in an HTML form,
a field being defined as the group of all
having the same
.
The
FormField.getFormControls() method can be used to obtain the collection of this field's constituent
FormControl objects.
The
FormFields class, which represents a collection of FormField objects, provides the highest level
interface for dealing with form fields and controls. |
FormFields.java | Class | Represents a collection of
FormField objects. |
HTMLElementName.java | Interface | Contains static fields representing the
of
all elements defined in the HTML 4.01 specification. |
HTMLElementNameSet.java | Class | |
HTMLElements.java | Class | Contains static methods which group
by the characteristics of their associated
elements.
An HTML element is a normal element with a
that matches one of the
(ignoring case).
This type of element spans the logical HTML element as described in the
HTML 4.01 specification section 3.2.1,
which may be implicitly terminated if it specifies an
.
The term Non-HTML element refers to a normal element
with a
that does not match one of the
.
This type of element must be either a single tag element or
explicitly terminated.
All of the sets returned by the methods in this class may be modified to customise the behaviour of the parser.
Care must be taken however to ensure that the sets only contain tag names in lower case.
Below is a table summarising the default characteristics of each HTML element. |
HTMLElementTerminatingTagNameSets.java | Class | |
IntStringHashMap.java | Class | This is an internal class used to efficiently map integers to strings, which is used in the CharacterEntityReference class. |
Logger.java | Interface | Defines the interface for handling log messages. |
LoggerDisabled.java | Class | |
LoggerFactory.java | Class | |
LoggerProvider.java | Interface | Defines the interface for a factory class to provide
Logger instances for each
Source object. |
LoggerProviderDisabled.java | Class | |
LoggerProviderJava.java | Class | |
LoggerProviderJCL.java | Class | |
LoggerProviderLog4J.java | Class | |
LoggerProviderSLF4J.java | Class | |
LoggerProviderSTDERR.java | Class | |
MasonTagTypes.java | Class | Contains
related to the Mason server platform. |
NumericCharacterReference.java | Class | Represents an HTML Numeric Character Reference. |
OutputDocument.java | Class | Represents a modified version of an original
Source document.
An OutputDocument represents an original source document that
has been modified by substituting segments of it with other text.
Each of these substitutions must be registered in the output document,
which is most commonly done using the various replace , remove or insert methods in this class.
These methods internally
one or more
OutputSegment objects to define each substitution.
After all of the substitutions have been registered, the modified text can be retrieved using the
OutputDocument.writeTo(Writer) or
OutputDocument.toString() methods.
The registered
may be adjacent, and as of version 2.5 may also overlap.
In most cases only output segments that have been
or
legitimately overlap each other. |
OutputSegment.java | Interface | Defines the interface for an output segment, which is used in an
OutputDocument to
replace segments of the source document with other text. |
OutputSegmentComparator.java | Class | |
OverlappingOutputSegmentsException.java | Class | Previously signalled the detection of overlapping
in an
OutputDocument . |
ParseText.java | Class | Represents the text from the
document that is to be parsed. |
PHPTagTypes.java | Class | Contains
related to the PHP server platform. |
RemoveOutputSegment.java | Class | Implements an
OutputSegment with no content. |
Renderer.java | Class | Performs a simple rendering of HTML markup into text. |
RowColumnVector.java | Class | Represents the row and column number of a character position in the source document. |
Segment.java | Class | Represents a segment of a
Source document. |
Source.java | Class | Represents a source HTML document.
The first step in parsing an HTML document is always to construct a Source object from the source data, which can be a
String , Reader , InputStream or URL .
Each constructor uses all the evidence available to determine the original
of the data.
Once the Source object has been created, you can immediately start searching for
or
within the document
using the tag search methods.
In certain circumstances you may be able to improve performance by calling the
Source.fullSequentialParse() method before calling any
tag search methods. |
SourceFormatter.java | Class | Formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent.
Any indentation present in the original source text is removed.
Use one of the following methods to obtain the output:
The output text is functionally equivalent to the original source and should be rendered identically unless specified below.
The following points describe the process in general terms.
Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.
- Every element that is not an
appears on a new line
with an indent corresponding to its
in the document element hierarchy.
- The indent is formed by writing n repetitions of the string specified in the
SourceFormatter.setIndentString(String) IndentString property,
where n is the depth of the indentation.
- The
of an indented element starts on a new line and is indented at a depth one greater than that of the element,
with the end tag appearing on a new line at the same depth as the start tag.
If the content contains only text and
,
it may continue on the same line as the start tag.
|
StartTag.java | Class | Represents the start tag of an
in a specific
document. |
StartTagType.java | Class | Defines the syntax for a start tag type.
A start tag type is any
TagType that
with the character '< '
(as with all tag types), but whose second character is not '/ '.
This includes types for many tags which stand alone, without a
,
and would not intuitively be categorised as a "start tag". |
StartTagTypeCDATASection.java | Class | |
StartTagTypeComment.java | Class | |
StartTagTypeDoctypeDeclaration.java | Class | |
StartTagTypeGenericImplementation.java | Class | Provides a generic implementation of the abstract
StartTagType class based on the most common start tag behaviour. |
StartTagTypeMarkupDeclaration.java | Class | |
StartTagTypeMasonComponentCall.java | Class | |
StartTagTypeMasonComponentCalledWithContent.java | Class | |
StartTagTypeMasonNamedBlock.java | Class | |
StartTagTypeNormal.java | Class | |
StartTagTypePHPScript.java | Class | |
StartTagTypePHPShort.java | Class | |
StartTagTypePHPStandard.java | Class | |
StartTagTypeServerCommon.java | Class | |
StartTagTypeUnregistered.java | Class | |
StartTagTypeXMLDeclaration.java | Class | |
StartTagTypeXMLProcessingInstruction.java | Class | |
StreamEncodingDetector.java | Class | |
StringOutputSegment.java | Class | Implements an
OutputSegment whose content is a CharSequence . |
SubCache.java | Class | Represents a cached map of character positions to tags for a particular tag type,
or for all tag types if the tagType field is null. |
Tag.java | Class | Represents either a
StartTag or
EndTag in a specific
document.
Take the following HTML segment as an example:
<p>This is a sample paragraph.</p>
The "<p> " is represented by a
StartTag object, and the "</p> " is represented by an
EndTag object,
both of which are subclasses of the Tag class.
The whole segment, including the start tag, its corresponding end tag and all of the content in between, is represented by an
Element object.
The following process describes how each tag is identified by the parser:
-
Every '
< ' character found in the source document is considered to be the start of a tag.
The characters following it are compared with the
of all the
, and a list of matching tag types
is determined.
-
A more detailed analysis of the source is performed according to the features of each matching tag type from the first step,
in order of precedence, until a valid tag is able to be constructed.
The analysis performed in relation to each candidate tag type is a two-stage process:
-
The position of the tag is checked to determine whether it is
.
In theory, a
is valid in any position, but a non-server tag is not valid inside another non-server tag.
The
TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData) method is responsible for this check
and has a common default implementation for all tag types
(although custom tag types can override it if necessary).
Its behaviour differs depending on whether or not a
is peformed.
See the documentation of the
TagType.isValidPosition(Sourceintint[]) isValidPosition method for full details.
-
A final analysis is performed by the
TagType#constructTagAt(Source, int pos) method of the candidate tag type.
This method returns a valid
Tag object if all conditions of the candidate tag type are met, otherwise it returns
null and the process continues with the next candidate tag type.
-
If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next
'
> ' character is taken to be an
tag.
Some tag search methods ignore unregistered tags. |
TagType.java | Class | Defines the syntax for a tag type that can be recognised by the parser.
This class is the root abstract class common to all tag types, and contains methods to
and
tag types as well as various methods to aid in their implementation.
Every tag type is represented by an instance of a class (usually a singleton) that must be a subclass of either
StartTagType or
EndTagType . |
TagTypeRegister.java | Class | |
TextExtractor.java | Class | Extracts the textual content from HTML markup. |
Util.java | Class | Contains miscellaneous utility methods not directly associated with the HTML Parser library. |
WriterLogger.java | Class | Provides an implementation of the
Logger interface that sends output to the specified java.io.Writer . |