au.id.jericho.lib.html

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » HTML Parser » jericho html » au.id.jericho.lib.html

au.id.jericho.lib.html
Jericho HTML Parser A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions. For an introduction to the API, the documentation of the {@link au.id.jericho.lib.html.Source} class is the best place to start. For a summary of features and sample applications, visit the homepage at http://jerichohtml.sourceforge.net For downloads, support and updates visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/ The Jericho HTML Parser is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.
Java Source File Name	Type	Comment
Attribute.java	Class	Represents a single attribute name/value segment within a StartTag .
Attributes.java	Class	Represents the list of Attribute objects present within a particular StartTag .
AttributesOutputSegment.java	Class	Implements an OutputSegment whose content is a list of attribute name/value pairs.
BasicLogFormatter.java	Class	Provides basic formatting for log messages. This class extends the `java.util.logging.Formatter` class, allowing it to be specified as a formatter for the `java.util.logging` system. The static BasicLogFormatter.format(String level,String message,String loggerName) method provides a means of using the same formatting outside of the `java.util.logging` framework.
BlankOutputSegment.java	Class	Implements an OutputSegment whose content is a string of spaces with the same length as the segment.
Cache.java	Class	Represents a cached map of character positions to tags.
CharacterEntityReference.java	Class	Represents an HTML Character Entity Reference.
CharacterReference.java	Class	Represents an HTML Character Reference, implemented by the subclasses CharacterEntityReference and NumericCharacterReference .
CharOutputSegment.java	Class	Implements an OutputSegment whose content is a single character constant.
CharStreamSource.java	Interface	Represents a character stream source.
CharStreamSourceUtil.java	Class	Contains static utility methods for manipulating the way data is retrieved from a CharStreamSource object.
Config.java	Class	Encapsulates global configuration properties which determine the behaviour of various functions.
Element.java	Class	Represents an element in a specific document, which encompasses a , an optional and all in between. Take the following HTML segment as an example: `<p>This is a sample paragraph.</p>` The whole segment is represented by an `Element` object.
EncodingDetector.java	Class
EndTag.java	Class	Represents the end tag of an in a specific document.
EndTagType.java	Class	Defines the syntax for an end tag type.
EndTagTypeGenericImplementation.java	Class	Provides a generic implementation of the abstract EndTagType class based on the most common end tag behaviour.
EndTagTypeMasonComponentCalledWithContent.java	Class
EndTagTypeMasonNamedBlock.java	Class
EndTagTypeNormal.java	Class
EndTagTypeUnregistered.java	Class
FormControl.java	Class	Represents an HTML form control. A `FormControl` consists of a single that matches one of the . The term output element is used to describe the element that is if this form control is in an OutputDocument . A predefined value control is a form control for which FormControl.getFormControlType() . FormControlType.hasPredefinedValue hasPredefinedValue() returns `true`.
FormControlOutputStyle.java	Class	An enumerated type representing the three major output styles of a output element.
FormControlType.java	Class	Represents the control type of a FormControl .
FormField.java	Class	Represents a field in an HTML form, a field being defined as the group of all having the same . The FormField.getFormControls() method can be used to obtain the collection of this field's constituent FormControl objects. The FormFields class, which represents a collection of `FormField` objects, provides the highest level interface for dealing with form fields and controls.
FormFields.java	Class	Represents a collection of FormField objects.
HTMLElementName.java	Interface	Contains static fields representing the of all elements defined in the HTML 4.01 specification.
HTMLElementNameSet.java	Class
HTMLElements.java	Class	Contains static methods which group by the characteristics of their associated elements. An HTML element is a normal element with a that matches one of the (ignoring case). This type of element spans the logical HTML element as described in the HTML 4.01 specification section 3.2.1, which may be implicitly terminated if it specifies an . The term Non-HTML element refers to a normal element with a that does not match one of the . This type of element must be either a single tag element or explicitly terminated. All of the sets returned by the methods in this class may be modified to customise the behaviour of the parser. Care must be taken however to ensure that the sets only contain tag names in lower case. Below is a table summarising the default characteristics of each HTML element.
HTMLElementTerminatingTagNameSets.java	Class
IntStringHashMap.java	Class	This is an internal class used to efficiently map integers to strings, which is used in the CharacterEntityReference class.
Logger.java	Interface	Defines the interface for handling log messages.
LoggerDisabled.java	Class
LoggerFactory.java	Class
LoggerProvider.java	Interface	Defines the interface for a factory class to provide Logger instances for each Source object.
LoggerProviderDisabled.java	Class
LoggerProviderJava.java	Class
LoggerProviderJCL.java	Class
LoggerProviderLog4J.java	Class
LoggerProviderSLF4J.java	Class
LoggerProviderSTDERR.java	Class
MasonTagTypes.java	Class	Contains related to the Mason server platform.
NumericCharacterReference.java	Class	Represents an HTML Numeric Character Reference.
OutputDocument.java	Class	Represents a modified version of an original Source document. An `OutputDocument` represents an original source document that has been modified by substituting segments of it with other text. Each of these substitutions must be registered in the output document, which is most commonly done using the various `replace`, `remove` or `insert` methods in this class. These methods internally one or more OutputSegment objects to define each substitution. After all of the substitutions have been registered, the modified text can be retrieved using the OutputDocument.writeTo(Writer) or OutputDocument.toString() methods. The registered may be adjacent, and as of version 2.5 may also overlap. In most cases only output segments that have been or legitimately overlap each other.
OutputSegment.java	Interface	Defines the interface for an output segment, which is used in an OutputDocument to replace segments of the source document with other text.
OutputSegmentComparator.java	Class
OverlappingOutputSegmentsException.java	Class	Previously signalled the detection of overlapping in an OutputDocument .
ParseText.java	Class	Represents the text from the document that is to be parsed.
PHPTagTypes.java	Class	Contains related to the PHP server platform.
RemoveOutputSegment.java	Class	Implements an OutputSegment with no content.
Renderer.java	Class	Performs a simple rendering of HTML markup into text.
RowColumnVector.java	Class	Represents the row and column number of a character position in the source document.
Segment.java	Class	Represents a segment of a Source document.
Source.java	Class	Represents a source HTML document. The first step in parsing an HTML document is always to construct a `Source` object from the source data, which can be a `String`, `Reader`, `InputStream` or `URL`. Each constructor uses all the evidence available to determine the original of the data. Once the `Source` object has been created, you can immediately start searching for or within the document using the tag search methods. In certain circumstances you may be able to improve performance by calling the Source.fullSequentialParse() method before calling any tag search methods.
SourceFormatter.java	Class	Formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent. Any indentation present in the original source text is removed. Use one of the following methods to obtain the output: SourceFormatter.writeTo(Writer) SourceFormatter.toString() CharStreamSourceUtil.getReader(CharStreamSource) CharStreamSourceUtil.getReader(this) The output text is functionally equivalent to the original source and should be rendered identically unless specified below. The following points describe the process in general terms. Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions. Every element that is not an appears on a new line with an indent corresponding to its in the document element hierarchy. The indent is formed by writing n repetitions of the string specified in the SourceFormatter.setIndentString(String) IndentString property, where n is the depth of the indentation. The of an indented element starts on a new line and is indented at a depth one greater than that of the element, with the end tag appearing on a new line at the same depth as the start tag. If the content contains only text and , it may continue on the same line as the start tag.
StartTag.java	Class	Represents the start tag of an in a specific document.
StartTagType.java	Class	Defines the syntax for a start tag type. A start tag type is any TagType that with the character '`<`' (as with all tag types), but whose second character is not '`/`'. This includes types for many tags which stand alone, without a , and would not intuitively be categorised as a "start tag".
StartTagTypeCDATASection.java	Class
StartTagTypeComment.java	Class
StartTagTypeDoctypeDeclaration.java	Class
StartTagTypeGenericImplementation.java	Class	Provides a generic implementation of the abstract StartTagType class based on the most common start tag behaviour.
StartTagTypeMarkupDeclaration.java	Class
StartTagTypeMasonComponentCall.java	Class
StartTagTypeMasonComponentCalledWithContent.java	Class
StartTagTypeMasonNamedBlock.java	Class
StartTagTypeNormal.java	Class
StartTagTypePHPScript.java	Class
StartTagTypePHPShort.java	Class
StartTagTypePHPStandard.java	Class
StartTagTypeServerCommon.java	Class
StartTagTypeUnregistered.java	Class
StartTagTypeXMLDeclaration.java	Class
StartTagTypeXMLProcessingInstruction.java	Class
StreamEncodingDetector.java	Class
StringOutputSegment.java	Class	Implements an OutputSegment whose content is a `CharSequence`.
SubCache.java	Class	Represents a cached map of character positions to tags for a particular tag type, or for all tag types if the tagType field is null.
Tag.java	Class	Represents either a StartTag or EndTag in a specific document. Take the following HTML segment as an example: `<p>This is a sample paragraph.</p>` The "`<p>`" is represented by a StartTag object, and the "`</p>`" is represented by an EndTag object, both of which are subclasses of the `Tag` class. The whole segment, including the start tag, its corresponding end tag and all of the content in between, is represented by an Element object. Tag Parsing Process The following process describes how each tag is identified by the parser: Every '`<`' character found in the source document is considered to be the start of a tag. The characters following it are compared with the of all the , and a list of matching tag types is determined. A more detailed analysis of the source is performed according to the features of each matching tag type from the first step, in order of precedence, until a valid tag is able to be constructed. The analysis performed in relation to each candidate tag type is a two-stage process: The position of the tag is checked to determine whether it is . In theory, a is valid in any position, but a non-server tag is not valid inside another non-server tag. The TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData) method is responsible for this check and has a common default implementation for all tag types (although custom tag types can override it if necessary). Its behaviour differs depending on whether or not a is peformed. See the documentation of the TagType.isValidPosition(Sourceintint[]) isValidPosition method for full details. A final analysis is performed by the TagType#constructTagAt(Source, int pos) method of the candidate tag type. This method returns a valid Tag object if all conditions of the candidate tag type are met, otherwise it returns `null` and the process continues with the next candidate tag type. If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next '`>`' character is taken to be an tag. Some tag search methods ignore unregistered tags.
TagType.java	Class	Defines the syntax for a tag type that can be recognised by the parser. This class is the root abstract class common to all tag types, and contains methods to and tag types as well as various methods to aid in their implementation. Every tag type is represented by an instance of a class (usually a singleton) that must be a subclass of either StartTagType or EndTagType .
TagTypeRegister.java	Class
TextExtractor.java	Class	Extracts the textual content from HTML markup.
Util.java	Class	Contains miscellaneous utility methods not directly associated with the HTML Parser library.
WriterLogger.java	Class	Provides an implementation of the Logger interface that sends output to the specified `java.io.Writer`.

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.

au.id.jericho.lib.html

Tag Parsing Process