Java Doc for Source.java in  » HTML-Parser » jericho-html » au » id » jericho » lib » html » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » HTML Parser » jericho html » au.id.jericho.lib.html 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   au.id.jericho.lib.html.Segment
      au.id.jericho.lib.html.Source

Source
public class Source extends Segment (Code)
Represents a source HTML document.

The first step in parsing an HTML document is always to construct a Source object from the source data, which can be a String, Reader, InputStream or URL. Each constructor uses all the evidence available to determine the original of the data.

Once the Source object has been created, you can immediately start searching for or within the document using the tag search methods.

In certain circumstances you may be able to improve performance by calling the Source.fullSequentialParse() method before calling any tag search methods. See the documentation of the Source.fullSequentialParse() method for details.

Any issues encountered while parsing are logged to a Logger object. The Source.setLogger(Logger) method can be used to explicitly set a Logger implementation for a particular Source instance, otherwise the static Config.LoggerProvider property determines how the logger is set by default for all Source instances. See the documentation of the Config.LoggerProvider property for information about how the default logging provider is determined.

Note that many of the useful functions which can be performed on the source document are defined in its superclass, Segment . The source object is itself a segment which spans the entire document.

Most of the methods defined in this class are useful for determining the elements and tags surrounding or neighbouring a particular character position in the document.

For information on how to create a modified version of this source document, see the OutputDocument class.
See Also:   Segment



Field Summary
final static  StringPACKAGE_NAME
    
 ListallStartTags
    
 ListallTags
    
 Tag[]allTagsArray
    
final  Cachecache
    
 Loggerlogger
    
final  Stringstring
    
 booleanuseAllTypesCache
    
 booleanuseSpecialTypesCache
    

Constructor Summary
public  Source(CharSequence text)
     Constructs a new Source object from the specified text.
 Source(Reader reader, String encoding)
    
public  Source(Reader reader)
     Constructs a new Source object by loading the content from the specified Reader.
public  Source(InputStream inputStream)
     Constructs a new Source object by loading the content from the specified InputStream.
public  Source(URL url)
     Constructs a new Source object by loading the content from the specified URL.

The algorithm for detecting the character of the source document is as follows:
(process termination is marked by ♦)

  1. If the HTTP headers received when opening a connection to the URL include a Content-Type header specifying a charset parameter, then use the encoding specified in the value of the charset parameter.

Method Summary
public  voidclearCache()
     Clears the of all tags.
public  ListfindAllElements()
     Returns a list of all in this source document.
public  ListfindAllStartTags()
     Returns a list of all in this source document.
public  ListfindAllTags()
     Returns a list of all in this source document.
public  ElementfindEnclosingElement(int pos)
     Returns the most nested Element that the specified position in the source document.

The specified position can be anywhere inside the , , or of the element.

public  ElementfindEnclosingElement(int pos, String name)
     Returns the most nested Element with the specified that the specified position in the source document.

The specified position can be anywhere inside the , , or of the element.

public  TagfindEnclosingTag(int pos)
     Returns the Tag that the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document, may be out of bounds.

public  TagfindEnclosingTag(int pos, TagType tagType)
     Returns the Tag of the specified that the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document, may be out of bounds.
Parameters:
  tagType - the TagType to search for.

public  intfindNameEnd(int pos)
     Returns the end position of the XML Name that starts at the specified position.

This implementation first checks that the character at the specified position is a valid XML Name start character as defined by the Tag.isXMLNameStartChar(char) method.

public  CharacterReferencefindNextCharacterReference(int pos)
     Returns the CharacterReference beginning at or immediately following the specified position in the source document.

Character references positioned within an HTML are NOT ignored.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  ElementfindNextElement(int pos)
     Returns the Element beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int) findNextStartTag(pos) . StartTag.getElement getElement() , assuming the result is not null.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  ElementfindNextElement(int pos, String name)
     Returns the Element with the specified beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int,String) findNextStartTag(pos,name) . StartTag.getElement getElement() , assuming the result is not null.

Specifying a null argument to the name parameter is equivalent to Source.findNextStartTag(int) findNextElement(pos) .

Specifying an argument to the name parameter that ends in a colon (:) searches for all elements in the specified XML namespace.

This method also returns elements consisting of tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the element to search for.

public  ElementfindNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive)
     Returns the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int,String,String,boolean) findNextStartTag(pos,attributeName,value,valueCaseSensitive) . StartTag.getElement getElement() , assuming the result is not null.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  attributeName - the attribute name (case insensitive) to search for, must not be null.
Parameters:
  value - the value of the specified attribute to search for, must not be null.
Parameters:
  valueCaseSensitive - specifies whether the attribute value matching is case sensitive.

public  EndTagfindNextEndTag(int pos)
     Returns the EndTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  EndTagfindNextEndTag(int pos, String name)
     Returns the EndTag with the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null.

public  EndTagfindNextEndTag(int pos, String name, EndTagType endTagType)
     Returns the EndTag with the specified and beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null.
Parameters:
  endTagType - the of the end tag to search for, must not be null.

public  StartTagfindNextStartTag(int pos)
     Returns the StartTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  StartTagfindNextStartTag(int pos, String name)
     Returns the StartTag with the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to Source.findNextStartTag(int) findNextStartTag(pos) .

Specifying an argument to the name parameter that ends in a colon (:) searches for all start tags in the specified XML namespace.

This method also returns tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the start tag to search for.

public  StartTagfindNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)
     Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  attributeName - the attribute name (case insensitive) to search for, must not be null.
Parameters:
  value - the value of the specified attribute to search for, must not be null.
Parameters:
  valueCaseSensitive - specifies whether the attribute value matching is case sensitive.

public  TagfindNextTag(int pos)
     Returns the Tag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Use Tag.findNextTag to find the tag immediately following another tag.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  TagfindNextTag(int pos, TagType tagType)
     Returns the Tag of the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  tagType - the TagType to search for.

public  CharacterReferencefindPreviousCharacterReference(int pos)
     Returns the CharacterReference at or immediately preceding (or ) the specified position in the source document.

Character references positioned within an HTML are NOT ignored.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  EndTagfindPreviousEndTag(int pos)
     Returns the EndTag beginning at or immediately preceding the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  EndTagfindPreviousEndTag(int pos, String name)
     Returns the EndTag with the specified at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null.

public  StartTagfindPreviousStartTag(int pos)
     Returns the StartTag at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  StartTagfindPreviousStartTag(int pos, String name)
     Returns the StartTag with the specified at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to Source.findPreviousStartTag(int) findPreviousStartTag(pos) .

This method also returns tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the start tag to search for.

public  TagfindPreviousTag(int pos)
     Returns the Tag beginning at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.

public  TagfindPreviousTag(int pos, TagType tagType)
     Returns the Tag of the specified beginning at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  tagType - the TagType to search for.

public  Tag[]fullSequentialParse()
     Parses all of the in this source document sequentially from beginning to end.

Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.

Calling the Source.findAllTags() , Source.findAllStartTags() , Source.findAllElements() or Source.getChildElements() method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.

If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.

By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a .

Generally speaking, a tag is in a valid position if it does not appear inside any another tag. can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.

When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical.

 StringgetBestGuessNewLine()
    
public  StringgetCacheDebugInfo()
     Returns a string representation of the tag cache, useful for debugging purposes.
static  StringgetCharsetParameterFromHttpHeaderValue(String httpHeaderValue)
    
public  ListgetChildElements()
     Returns a list of the top-level in the document element hierarchy.

The objects in the list are all of type Element .

The term top-level element refers to an element that is not nested within any other element in the document.

The term document element hierarchy refers to the hierarchy of elements that make up this source document. The source document itself is not considered to be part of the hierarchy, meaning there is typically more than one top-level element. Even when the source represents an entire HTML document, the and/or an often exist as top-level elements along with the HTMLElementName.HTML HTML element itself.

The Element.getChildElements method can be used to get the children of the top-level elements, with recursive use providing a means to visit every element in the document hierarchy.

The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text.

public  intgetColumn(int pos)
     Returns the column number of the specified character position in the source document.
Parameters:
  pos - the position in the source document.
public  StringgetDocumentSpecifiedEncoding()
     Returns the document specified within the text of the document.

The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:

  1. An encoding declaration within the of an XML document, which must be present if it has an encoding other than UTF-8 or UTF-16.
    <?xml version="1.0" encoding="ISO-8859-1" ?>
  2. A META declaration, which is in the form of a HTMLElementName.META META tag with attribute http-equiv="Content-Type". The encoding is specified in the charset parameter of a Content-Type HTTP header value, which is placed in the value of the meta tag's content attribute. This META declaration should appear as early as possible in the HTMLElementName.HEAD HEAD element.
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

Both of these tags must only use characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding.

public  ElementgetElementById(String id)
     Returns the Element with the specified id attribute value.

This simulates the script method getElementById defined in DOM HTML level 1.

This is equivalent to Source.findNextStartTag(int,String,String,boolean) findNextStartTag (0,"id",id,true). StartTag.getElement getElement() , assuming that the element exists.

A well formed HTML document should have no more than one element with any given id attribute value.
Parameters:
  id - the id attribute value (case sensitive) to search for, must not be null.

public  StringgetEncoding()
     Returns the character encoding scheme of the source byte stream used to create this object.

The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 uses the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the class java.nio.charset.Charset). This often causes confusion, as a modern "coded character set" such as Unicode can have several encodings, such as UTF-8, UTF-16, and UTF-32. See the Wikipedia character encoding article for an explanation of the terminology.

This method makes the best possible effort to return the name of the encoding used to decode the original source byte stream into character data.

public  StringgetEncodingSpecificationInfo()
     Returns a concise description of how the of the source document was determined.
public  WritergetLogWriter()
     Returns the destination Writer for log messages.
public  LoggergetLogger()
     Returns the Logger that handles log messages.
public  StringgetNewLine()
     Returns the newline character sequence used in the source document.
final public  ParseTextgetParseText()
     Returns the of this source document.
 ListgetParsedTags()
     Gets a list of all the tags that have been parsed so far.
public  StringgetPreliminaryEncodingInfo()
     Returns the preliminary encoding of the source document together with a concise description of how it was determined.
public  intgetRow(int pos)
     Returns the row number of the specified character position in the source document.
Parameters:
  pos - the position in the source document.
public  RowColumnVectorgetRowColumnVector(int pos)
     Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.
Parameters:
  pos - the position in the source document.
public  SourceFormattergetSourceFormatter()
     Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.
final public  TaggetTagAt(int pos)
     Returns the Tag at the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This method also returns tags.
Parameters:
  pos - the position in the source document, may be out of bounds.

public  voidignoreWhenParsing(int begin, int end)
     Causes the specified range of the source text to be ignored when parsing.
public  voidignoreWhenParsing(Collection segments)
     Causes all of the segments in the specified collection to be ignored when parsing.
public  CharStreamSourceindent(String indentString, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements)
     Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.

This method has been deprecated as of version 2.4 and replaced with the Source.getSourceFormatter() method.
Parameters:
  indentString - the string to use for indentation.
Parameters:
  tidyTags - specifies whether to replace the original text of each tag with the output from its Tag.tidy method.
Parameters:
  collapseWhiteSpace - specifies whether to collapse the white space in the text between the tags.
Parameters:
  indentAllElements - specifies whether to indent all elements, including and those with preformatted contents.

public  booleanisLoggingEnabled()
     Indicates whether logging is currently enabled.
public  booleanisXML()
     Indicates whether the source document is likely to be XML.
public  voidlog(String message)
     Writes the specified message to the log.
static  LoggernewLogger()
    
public  AttributesparseAttributes(int pos, int maxEnd)
     Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes method should be used in normal situations.

The returned Attributes segment always begins at pos, and ends at the end of the last attribute before either maxEnd or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first.

Only returns null if the segment contains a major syntactical error or more than the number of minor syntactical errors.

This is equivalent to Source.parseAttributes(int,int,int) parseAttributes (pos,maxEnd, Attributes.getDefaultMaxErrorCount )}.
Parameters:
  pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
Parameters:
  maxEnd - the maximum end position of the attribute list, or -1 if no maximum.

public  AttributesparseAttributes(int pos, int maxEnd, int maxErrorCount)
     Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes method should be used in normal situations.

Only returns null if the segment contains a major syntactical error or more than the specified number of minor syntactical errors.

The maxErrorCount argument overrides the .

See Source.parseAttributes(int pos,int maxEnd) for more information.
Parameters:
  pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
Parameters:
  maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
Parameters:
  maxErrorCount - the maximum number of minor errors allowed while parsing.

public  voidsetLogWriter(Writer writer)
     to an implementation that that sends all output to a specified Writer.
public  voidsetLogger(Logger logger)
     Sets the Logger that handles log messages.
public  StringtoString()
     Returns the source text as a String.

Field Detail
PACKAGE_NAME
final static String PACKAGE_NAME(Code)



allStartTags
List allStartTags(Code)



allTags
List allTags(Code)



allTagsArray
Tag[] allTagsArray(Code)



cache
final Cache cache(Code)



logger
Logger logger(Code)



string
final String string(Code)



useAllTypesCache
boolean useAllTypesCache(Code)



useSpecialTypesCache
boolean useSpecialTypesCache(Code)




Constructor Detail
Source
public Source(CharSequence text)(Code)
Constructs a new Source object from the specified text.
Parameters:
  text - the source text.



Source
Source(Reader reader, String encoding) throws IOException(Code)



Source
public Source(Reader reader) throws IOException(Code)
Constructs a new Source object by loading the content from the specified Reader.

If the specified reader is an instance of InputStreamReader, the Source.getEncoding() method of the created source object returns the encoding from InputStreamReader.getEncoding().
Parameters:
  reader - the java.io.Reader from which to load the source text.
throws:
  java.io.IOException - if an I/O error occurs.




Source
public Source(InputStream inputStream) throws IOException(Code)
Constructs a new Source object by loading the content from the specified InputStream.

The algorithm for detecting the character of the source document from the raw bytes of the specified input stream is the same as that for the Source.Source(URL) constructor, except that the first step is not possible as there is no Content-Type header to check.
Parameters:
  inputStream - the java.io.InputStream from which to load the source text.
throws:
  java.io.IOException - if an I/O error occurs.
See Also:   Source.getEncoding()




Source
public Source(URL url) throws IOException(Code)
Constructs a new Source object by loading the content from the specified URL.

The algorithm for detecting the character of the source document is as follows:
(process termination is marked by ♦)

  1. If the HTTP headers received when opening a connection to the URL include a Content-Type header specifying a charset parameter, then use the encoding specified in the value of the charset parameter. ♦
  2. Read the first four bytes of the input stream.
  3. If the input stream is empty, the created source document has zero length and its Source.getEncoding() method returns null. ♦
  4. If the input stream starts with a unicode Byte Order Mark (BOM), then use the encoding signified by the BOM. ♦
    BOM BytesEncoding
    EF BB FFUTF-8
    FF FE 00 00UTF-32 (little-endian)
    00 00 FE FFUTF-32 (big-endian)
    FF FEUTF-16 (little-endian)
    FE FFUTF-16 (big-endian)
    0E FE FFSCSU
    2B 2F 76UTF-7
    DD 73 66 73UTF-EBCDIC
    FB EE 28BOCU-1
  5. If the stream contains less than four bytes, then:
    1. If the stream contains either one or three bytes, then use the encoding ISO-8859-1. ♦
    2. If the stream starts with a zero byte, then use the encoding UTF-16BE. ♦
    3. If the second byte of the stream is zero, then use the encoding UTF-16LE. ♦
    4. Otherwise use the encoding ISO-8859-1. ♦
  6. Determine a by examining the first four bytes of the input stream. See the Source.getPreliminaryEncodingInfo() method for details.
  7. Read the first 2048 bytes of the input stream and decode it using the preliminary encoding to create a "preview segment". If the detected preliminary encoding is not supported on this platform, create the preview segment using ISO-8859-1 instead (this incident is logged at level).
  8. Search the preview segment for an encoding specification, which should always appear at or near the top of the document.
  9. If an encoding specification is found:
    1. If the specified encoding is supported on this platform, use it. ♦
    2. If the specified encoding is not supported on this platform, use the encoding that was used to create the preview segment, which is normally the detected . ♦
  10. If the document , then use UTF-8. ♦
    Section 4.3.3 of the XML 1.0 specification states that an XML file that is not encoded in UTF-8 must contain either a UTF-16 BOM or an encoding declaration in its . Since neither of these was detected, we can assume the encoding is UTF-8.
  11. Use the encoding that was used to create the preview segment, which is normally the detected . ♦
    This is the best guess, in the absence of any explicit information about the encoding, based on the first four bytes of the stream. The HTTP protocol section 3.7.1 states that an encoding of ISO-8859-1 can be assumed if no charset parameter was included in the HTTP Content-Type header. This is consistent with the preliminary encoding detected in this scenario.

Parameters:
  url - the URL from which to load the source text.
throws:
  java.io.IOException - if an I/O error occurs.
See Also:   Source.getEncoding()




Method Detail
clearCache
public void clearCache()(Code)
Clears the of all tags.

This method may be useful after calling the Segment.ignoreWhenParsing method so that any tags previously found within the ignored segments will no longer be returned by the tag search methods.




findAllElements
public List findAllElements()(Code)
Returns a list of all in this source document.

Calling this method on the Source object performs a automatically.

The elements returned correspond exactly with the start tags returned in the Source.findAllStartTags() method. a list of all in this source document.




findAllStartTags
public List findAllStartTags()(Code)
Returns a list of all in this source document.

Calling this method on the Source object performs a automatically.

See the Tag class documentation for more details about the behaviour of this method. a list of all in this source document.




findAllTags
public List findAllTags()(Code)
Returns a list of all in this source document.

Calling this method on the Source object performs a automatically.

See the Tag class documentation for more details about the behaviour of this method. a list of all in this source document.




findEnclosingElement
public Element findEnclosingElement(int pos)(Code)
Returns the most nested Element that the specified position in the source document.

The specified position can be anywhere inside the , , or of the element. There is no requirement that the returned element has an end tag, and it may be a or HTML .

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document, may be out of bounds. the most nested Element that the specified position in the source document, or null if the position is not within an element or is out of bounds.




findEnclosingElement
public Element findEnclosingElement(int pos, String name)(Code)
Returns the most nested Element with the specified that the specified position in the source document.

The specified position can be anywhere inside the , , or of the element. There is no requirement that the returned element has an end tag, and it may be a or HTML .

See the Tag class documentation for more details about the behaviour of this method.

This method also returns elements consisting of tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document, may be out of bounds.
Parameters:
  name - the of the element to search for. the most nested Element with the specified that the specified position in the source document, or null if none exists or the specified position is out of bounds.




findEnclosingTag
public Tag findEnclosingTag(int pos)(Code)
Returns the Tag that the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document, may be out of bounds. the Tag that the specified position in the source document, or null if the position is not within a tag or is out of bounds.




findEnclosingTag
public Tag findEnclosingTag(int pos, TagType tagType)(Code)
Returns the Tag of the specified that the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document, may be out of bounds.
Parameters:
  tagType - the TagType to search for. the Tag of the specified that the specified position in the source document, or null if the position is not within a tag of the specified type or is out of bounds.




findNameEnd
public int findNameEnd(int pos)(Code)
Returns the end position of the XML Name that starts at the specified position.

This implementation first checks that the character at the specified position is a valid XML Name start character as defined by the Tag.isXMLNameStartChar(char) method. If this is not the case, the value -1 is returned.

Once the first character has been checked, subsequent characters are checked using the Tag.isXMLNameChar(char) method until one is found that is not a valid XML Name character or the end of the document is reached. This position is then returned.
Parameters:
  pos - the position in the source document of the first character of the XML Name. the end position of the XML Name that starts at the specified position.
throws:
  IndexOutOfBoundsException - if the specified position is not within the bounds of the document.




findNextCharacterReference
public CharacterReference findNextCharacterReference(int pos)(Code)
Returns the CharacterReference beginning at or immediately following the specified position in the source document.

Character references positioned within an HTML are NOT ignored.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the CharacterReference beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextElement
public Element findNextElement(int pos)(Code)
Returns the Element beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int) findNextStartTag(pos) . StartTag.getElement getElement() , assuming the result is not null.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the Element beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextElement
public Element findNextElement(int pos, String name)(Code)
Returns the Element with the specified beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int,String) findNextStartTag(pos,name) . StartTag.getElement getElement() , assuming the result is not null.

Specifying a null argument to the name parameter is equivalent to Source.findNextStartTag(int) findNextElement(pos) .

Specifying an argument to the name parameter that ends in a colon (:) searches for all elements in the specified XML namespace.

This method also returns elements consisting of tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the element to search for. the Element with the specified beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextElement
public Element findNextElement(int pos, String attributeName, String value, boolean valueCaseSensitive)(Code)
Returns the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

This is equivalent to Source.findNextStartTag(int,String,String,boolean) findNextStartTag(pos,attributeName,value,valueCaseSensitive) . StartTag.getElement getElement() , assuming the result is not null.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  attributeName - the attribute name (case insensitive) to search for, must not be null.
Parameters:
  value - the value of the specified attribute to search for, must not be null.
Parameters:
  valueCaseSensitive - specifies whether the attribute value matching is case sensitive. the Element with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextEndTag
public EndTag findNextEndTag(int pos)(Code)
Returns the EndTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the EndTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextEndTag
public EndTag findNextEndTag(int pos, String name)(Code)
Returns the EndTag with the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null. the EndTag with the specified beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextEndTag
public EndTag findNextEndTag(int pos, String name, EndTagType endTagType)(Code)
Returns the EndTag with the specified and beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null.
Parameters:
  endTagType - the of the end tag to search for, must not be null. the EndTag with the specified and beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextStartTag
public StartTag findNextStartTag(int pos)(Code)
Returns the StartTag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the StartTag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextStartTag
public StartTag findNextStartTag(int pos, String name)(Code)
Returns the StartTag with the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to Source.findNextStartTag(int) findNextStartTag(pos) .

Specifying an argument to the name parameter that ends in a colon (:) searches for all start tags in the specified XML namespace.

This method also returns tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the start tag to search for. the StartTag with the specified beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextStartTag
public StartTag findNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)(Code)
Returns the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  attributeName - the attribute name (case insensitive) to search for, must not be null.
Parameters:
  value - the value of the specified attribute to search for, must not be null.
Parameters:
  valueCaseSensitive - specifies whether the attribute value matching is case sensitive. the StartTag with the specified attribute name/value pair beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextTag
public Tag findNextTag(int pos)(Code)
Returns the Tag beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Use Tag.findNextTag to find the tag immediately following another tag.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the Tag beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findNextTag
public Tag findNextTag(int pos, TagType tagType)(Code)
Returns the Tag of the specified beginning at or immediately following the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  tagType - the TagType to search for. the Tag with the specified beginning at or immediately following the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousCharacterReference
public CharacterReference findPreviousCharacterReference(int pos)(Code)
Returns the CharacterReference at or immediately preceding (or ) the specified position in the source document.

Character references positioned within an HTML are NOT ignored.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the CharacterReference beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousEndTag
public EndTag findPreviousEndTag(int pos)(Code)
Returns the EndTag beginning at or immediately preceding the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the EndTag beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousEndTag
public EndTag findPreviousEndTag(int pos, String name)(Code)
Returns the EndTag with the specified at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the end tag to search for, must not be null. the EndTag with the specified at or immediately preceding (or ) the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousStartTag
public StartTag findPreviousStartTag(int pos)(Code)
Returns the StartTag at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the StartTag at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousStartTag
public StartTag findPreviousStartTag(int pos, String name)(Code)
Returns the StartTag with the specified at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

Specifying a null argument to the name parameter is equivalent to Source.findPreviousStartTag(int) findPreviousStartTag(pos) .

This method also returns tags if the specified name is not a valid .
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  name - the of the start tag to search for. the StartTag with the specified at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousTag
public Tag findPreviousTag(int pos)(Code)
Returns the Tag beginning at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds. the Tag beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




findPreviousTag
public Tag findPreviousTag(int pos, TagType tagType)(Code)
Returns the Tag of the specified beginning at or immediately preceding (or ) the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.
Parameters:
  pos - the position in the source document from which to start the search, may be out of bounds.
Parameters:
  tagType - the TagType to search for. the Tag with the specified beginning at or immediately preceding the specified position in the source document, or null if none exists or the specified position is out of bounds.




fullSequentialParse
public Tag[] fullSequentialParse()(Code)
Parses all of the in this source document sequentially from beginning to end.

Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.

Calling the Source.findAllTags() , Source.findAllStartTags() , Source.findAllElements() or Source.getChildElements() method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.

If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.

By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a .

Generally speaking, a tag is in a valid position if it does not appear inside any another tag. can appear anywhere in a document, including inside other tags, so this relates only to non-server tags. Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.

When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with . The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.

The documentation of the TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData) method, which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.

Calling this method a second or subsequent time has no effect.

This method returns the same list of tags as the Source.findAllTags Source.findAllTags() method, but as an array instead of a list.

If this method is called after any of the tag search methods are called, the is cleared of any previously found tags before being restocked via the full sequential parse. This is significant if the Segment.ignoreWhenParsing method has been called since the tags were first found, as any tags inside the ignored segments will no longer be returned by any of the tag search methods.

See also the Tag class documentation for more general details about how tags are parsed. an array of all in this source document.




getBestGuessNewLine
String getBestGuessNewLine()(Code)



getCacheDebugInfo
public String getCacheDebugInfo()(Code)
Returns a string representation of the tag cache, useful for debugging purposes. a string representation of the tag cache, useful for debugging purposes.



getCharsetParameterFromHttpHeaderValue
static String getCharsetParameterFromHttpHeaderValue(String httpHeaderValue)(Code)



getChildElements
public List getChildElements()(Code)
Returns a list of the top-level in the document element hierarchy.

The objects in the list are all of type Element .

The term top-level element refers to an element that is not nested within any other element in the document.

The term document element hierarchy refers to the hierarchy of elements that make up this source document. The source document itself is not considered to be part of the hierarchy, meaning there is typically more than one top-level element. Even when the source represents an entire HTML document, the and/or an often exist as top-level elements along with the HTMLElementName.HTML HTML element itself.

The Element.getChildElements method can be used to get the children of the top-level elements, with recursive use providing a means to visit every element in the document hierarchy.

The document element hierarchy differs from that of the Document Object Model in that it is only a representation of the elements that are physically present in the source text. Unlike the DOM, it does not include any "implied" HTML elements such as HTMLElementName.TBODY TBODY if they are not present in the source text.

Elements formed from are not included in the hierarchy at all.

Structural errors in this source document such as overlapping elements are reported in the . When elements are found to overlap, the position of the start tag determines the location of the element in the hierarchy.

Calling this method on the Source object performs a automatically.

A visual representation of the document element hierarchy can be obtained by calling:
Source.getSourceFormatter() . SourceFormatter.setIndentAllElements(boolean) setIndentAllElements(true) . SourceFormatter.setCollapseWhiteSpace(boolean) setCollapseWhiteSpace(true) . SourceFormatter.setTidyTags(boolean) setTidyTags(true) . SourceFormatter.toString toString() a list of the top-level in the document element hierarchy, guaranteed not null.
See Also:   Element.getParentElement
See Also:   Element.getChildElements
See Also:   Element.getDepth




getColumn
public int getColumn(int pos)(Code)
Returns the column number of the specified character position in the source document.
Parameters:
  pos - the position in the source document. the column number of the specified character position in the source document.
throws:
  IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:   Source.getRow(int pos)
See Also:   Source.getRowColumnVector(int pos)



getDocumentSpecifiedEncoding
public String getDocumentSpecifiedEncoding()(Code)
Returns the document specified within the text of the document.

The document encoding can be specified within the document text in two ways. They are referred to generically in this library as an encoding specification, and are listed below in order of precedence:

  1. An encoding declaration within the of an XML document, which must be present if it has an encoding other than UTF-8 or UTF-16.
    <?xml version="1.0" encoding="ISO-8859-1" ?>
  2. A META declaration, which is in the form of a HTMLElementName.META META tag with attribute http-equiv="Content-Type". The encoding is specified in the charset parameter of a Content-Type HTTP header value, which is placed in the value of the meta tag's content attribute. This META declaration should appear as early as possible in the HTMLElementName.HEAD HEAD element.
    <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

Both of these tags must only use characters in the range U+0000 to U+007F, and in the case of the META declaration must use ASCII encoding. This, along with the fact that they must occur at or near the beginning of the document, assists in their detection and decoding without the need to know the exact encoding of the full text. the document specified within the text of the document, or null if no encoding is specified.
See Also:   Source.getEncoding()




getElementById
public Element getElementById(String id)(Code)
Returns the Element with the specified id attribute value.

This simulates the script method getElementById defined in DOM HTML level 1.

This is equivalent to Source.findNextStartTag(int,String,String,boolean) findNextStartTag (0,"id",id,true). StartTag.getElement getElement() , assuming that the element exists.

A well formed HTML document should have no more than one element with any given id attribute value.
Parameters:
  id - the id attribute value (case sensitive) to search for, must not be null. the Element with the specified id attribute value, or null if no such element exists.




getEncoding
public String getEncoding()(Code)
Returns the character encoding scheme of the source byte stream used to create this object.

The encoding of a document defines how the original byte stream was encoded into characters. The HTTP specification section 3.4 uses the term "character set" to refer to the encoding, and the term "charset" is similarly used in Java (see the class java.nio.charset.Charset). This often causes confusion, as a modern "coded character set" such as Unicode can have several encodings, such as UTF-8, UTF-16, and UTF-32. See the Wikipedia character encoding article for an explanation of the terminology.

This method makes the best possible effort to return the name of the encoding used to decode the original source byte stream into character data. This decoding takes place in the constructor when a parameter based on a byte stream such as an InputStream or URL is used to specify the source text. The documentation of the Source.Source(InputStream) and Source.Source(URL) constructors describe how the return value of this method is determined in these cases. It is also possible in some circumstances for the encoding to be determined in the Source.Source(Reader) constructor.

If a constructor was used that specifies the source text directly in character form (not requiring the decoding of a byte sequence) then the document itself is searched for an encoding specification. In this case, this method returns the same value as the Source.getDocumentSpecifiedEncoding() method.

The Source.getEncodingSpecificationInfo() method returns a simple description of how the value of this method was determined. the character encoding scheme of the source byte stream used to create this object, or null if the encoding is not known.
See Also:   Source.getEncodingSpecificationInfo()




getEncodingSpecificationInfo
public String getEncodingSpecificationInfo()(Code)
Returns a concise description of how the of the source document was determined.

The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed. a concise description of how the of the source document was determined.
See Also:   Source.getEncoding()




getLogWriter
public Writer getLogWriter()(Code)
Returns the destination Writer for log messages.

This method has been deprecated as of version 2.4 in favour of the more generic Source.getLogger() method.

Returns null if the is not an instance of WriterLogger . the destination Writer for log messages, or null if the is not an instance of WriterLogger.WriterLoggerSource.getLogger()WriterLogger.getWriter getWriter()




getLogger
public Logger getLogger()(Code)
Returns the Logger that handles log messages.

A logger instance is created automatically for each Source object using the LoggerProvider specified by the static Config.LoggerProvider property. This can be overridden by calling the Source.setLogger(Logger) method. The name used for all automatically created logger instances is "net.htmlparser.jericho". the Logger that handles log messages, or null if logging is disabled.




getNewLine
public String getNewLine()(Code)
Returns the newline character sequence used in the source document.

If the document does not contain any newline characters, this method returns null.

The three possible return values (aside from null) are "\n", "\r\n" and "\r". the newline character sequence used in the source document, or null if none is present.




getParseText
final public ParseText getParseText()(Code)
Returns the of this source document.

This method is normally only of interest to users who wish to create custom tag types.

The parse text is defined as the entire text of the source document in lower case, with all segments replaced by space characters. the of this source document.




getParsedTags
List getParsedTags()(Code)
Gets a list of all the tags that have been parsed so far.

This information may be useful for debugging purposes. Execution of this method collects information from the internal cache and is relatively expensive. a list of all the tags that have been parsed so far.
See Also:   Source.getCacheDebugInfo()




getPreliminaryEncodingInfo
public String getPreliminaryEncodingInfo()(Code)
Returns the preliminary encoding of the source document together with a concise description of how it was determined.

It is sometimes necessary for the Source.Source(InputStream) and Source.Source(URL) constructors to search the document for an encoding specification in order to determine the exact of the source byte stream.

In order to search for the before the exact encoding is known, a preliminary encoding is determined using the first four bytes of the input stream.

Because the encoding specification must only use characters in the range U+0000 to U+007F, the preliminary encoding need only have the following basic properties determined:

  • Code unit size (8-bit, 16-bit or 32-bit)
  • Byte order (big-endian or little-endian) if the code unit size is 16-bit or 32-bit
  • Basic encoding of characters in the range U+0000 to U+007F (current implementation only distinguishes between ASCII and EBCDIC)

The encodings used to represent the most commonly encountered combinations of these basic properties are:

  • ISO-8859-1: 8-bit ASCII-compatible encoding
  • Cp037: 8-bit EBCDIC-compatible encoding
  • UTF-16BE: 16-bit big-endian encoding
  • UTF-16LE: 16-bit little-endian encoding
  • UTF-32BE: 32-bit big-endian encoding (not supported on most java platforms)
  • UTF-32LE: 32-bit little-endian encoding (not supported on most java platforms)
Note: all encodings with a code unit size greater than 8 bits are assumed to use an ASCII-compatible low-order byte.

In some descriptions returned by this method, and the documentation below, a pattern is used to help demonstrate the contents of the first four bytes of the stream. The patterns use the characters "00" to signify a zero byte, "XX" to signify a non-zero byte, and "??" to signify a byte than can be either zero or non-zero.

The algorithm for determining the preliminary encoding is as follows:

  1. Byte pattern "00 00..." : If the stream starts with two zero bytes, the default 32-bit big-endian encoding UTF-32BE is used.
  2. Byte pattern "00 XX..." : If the stream starts with a single zero byte, the default 16-bit big-endian encoding UTF-16BE is used.
  3. Byte pattern "XX ?? 00 00..." : If the third and fourth bytes of the stream are zero, the default 32-bit little-endian encoding UTF-32LE is used.
  4. Byte pattern "XX 00..." or "XX ?? XX 00..." : If the second or fourth byte of the stream is zero, the default 16-bit little-endian encoding UTF-16LE is used.
  5. Byte pattern "XX XX 00 XX..." : If the third byte of the stream is zero, the default 16-bit big-endian encoding UTF-16BE is used (assumes the first character is > U+00FF).
  6. Byte pattern "4C XX XX XX..." : If the first four bytes are consistent with the EBCDIC encoding of an ("<?xm") or a ("<!DO"), or any other string starting with the EBCDIC character '<' followed by three non-ASCII characters (8th bit set), which is consistent with EBCDIC alphanumeric characters, the default EBCDIC-compatible encoding Cp037 is used.
  7. Byte pattern "XX XX XX XX..." : Otherwise, if all of the first four bytes of the stream are non-zero, the default 8-bit ASCII-compatible encoding ISO-8859-1 is used.

If it was not necessary to search for a when determining the of this source document from a byte stream, this method returns null.

See the documentation of the Source.Source(InputStream) and Source.Source(URL) constructors for more detailed information about when the detection of a preliminary encoding is required.

The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed. the preliminary encoding of the source document together with a concise description of how it was determined, or null if no preliminary encoding was required.
See Also:   Source.getEncoding()




getRow
public int getRow(int pos)(Code)
Returns the row number of the specified character position in the source document.
Parameters:
  pos - the position in the source document. the row number of the specified character position in the source document.
throws:
  IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:   Source.getColumn(int pos)
See Also:   Source.getRowColumnVector(int pos)



getRowColumnVector
public RowColumnVector getRowColumnVector(int pos)(Code)
Returns a RowColumnVector object representing the row and column number of the specified character position in the source document.
Parameters:
  pos - the position in the source document. a RowColumnVector object representing the row and column number of the specified character position in the source document.
throws:
  IndexOutOfBoundsException - if the specified position is not within the bounds of the document.
See Also:   Source.getRow(int pos)
See Also:   Source.getColumn(int pos)



getSourceFormatter
public SourceFormatter getSourceFormatter()(Code)
Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.

The output format can be configured by setting any number of properties on the returned SourceFormatter instance before .

To create a SourceFormatter instance based on a Segment rather than an entire Source document, use instead. an instance of SourceFormatter based on this source document.




getTagAt
final public Tag getTagAt(int pos)(Code)
Returns the Tag at the specified position in the source document.

See the Tag class documentation for more details about the behaviour of this method.

This method also returns tags.
Parameters:
  pos - the position in the source document, may be out of bounds. the Tag at the specified position in the source document, or null if no tag exists at the specified position or it is out of bounds.




ignoreWhenParsing
public void ignoreWhenParsing(int begin, int end)(Code)
Causes the specified range of the source text to be ignored when parsing.

See the documentation of the Segment.ignoreWhenParsing method for more information.
Parameters:
  begin - the beginning character position in the source text.
Parameters:
  end - the end character position in the source text.




ignoreWhenParsing
public void ignoreWhenParsing(Collection segments)(Code)
Causes all of the segments in the specified collection to be ignored when parsing.

This is equivalent to calling Segment.ignoreWhenParsing on each segment in the collection.




indent
public CharStreamSource indent(String indentString, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements)(Code)
Formats the HTML source by laying out each non-inline-level element on a new line with an appropriate indent.

This method has been deprecated as of version 2.4 and replaced with the Source.getSourceFormatter() method.
Parameters:
  indentString - the string to use for indentation.
Parameters:
  tidyTags - specifies whether to replace the original text of each tag with the output from its Tag.tidy method.
Parameters:
  collapseWhiteSpace - specifies whether to collapse the white space in the text between the tags.
Parameters:
  indentAllElements - specifies whether to indent all elements, including and those with preformatted contents. a CharStreamSource that produces the output.Source.getSourceFormatter()SourceFormatter.setIndentString(String) setIndentString(indentString)SourceFormatter.setTidyTags(boolean) setTidyTags(tidyTags)SourceFormatter.setCollapseWhiteSpace(boolean) setCollapseWhiteSpace(collapseWhiteSpace)SourceFormatter.setIndentAllElements(boolean) setIndentAllElements(indentAllElements)




isLoggingEnabled
public boolean isLoggingEnabled()(Code)
Indicates whether logging is currently enabled.

This method has been deprecated as of version 2.4 as its purpose was to allow efficient use of the Source.log(String) method, which has been deprecated. true if logging is currently enabled, otherwise false.Source.getLogger()




isXML
public boolean isXML()(Code)
Indicates whether the source document is likely to be XML.

The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.

The algorithm is as follows:

  1. If the document begins with an , it is an XML document.
  2. If the document contains a that contains the text "xhtml", it is an XHTML document, and hence also an XML document.
  3. If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.

As of version 2.5, this method no longer returns true if the document doesn't contain an HTMLElementName.HTML HTML element. The library is often used to parse partial HTML documents, so the lack of an HTMLElementName.HTML HTML element is not a reliable test for an XML document. true if the source document is likely to be XML, otherwise false.




log
public void log(String message)(Code)
Writes the specified message to the log.

This method has been deprecated as of version 2.4 as logging is now perfomed via the Logger interface obtained via the Source.getLogger() method.
Parameters:
  message - the message to logSource.getLogger()




newLogger
static Logger newLogger()(Code)



parseAttributes
public Attributes parseAttributes(int pos, int maxEnd)(Code)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes method should be used in normal situations.

The returned Attributes segment always begins at pos, and ends at the end of the last attribute before either maxEnd or the first occurrence of "/>" or ">" outside of a quoted attribute value, whichever comes first.

Only returns null if the segment contains a major syntactical error or more than the number of minor syntactical errors.

This is equivalent to Source.parseAttributes(int,int,int) parseAttributes (pos,maxEnd, Attributes.getDefaultMaxErrorCount )}.
Parameters:
  pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
Parameters:
  maxEnd - the maximum end position of the attribute list, or -1 if no maximum. the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.
See Also:   StartTag.getAttributes
See Also:   Segment.parseAttributes




parseAttributes
public Attributes parseAttributes(int pos, int maxEnd, int maxErrorCount)(Code)
Parses any Attributes starting at the specified position. This method is only used in the unusual situation where attributes exist outside of a start tag. The StartTag.getAttributes method should be used in normal situations.

Only returns null if the segment contains a major syntactical error or more than the specified number of minor syntactical errors.

The maxErrorCount argument overrides the .

See Source.parseAttributes(int pos,int maxEnd) for more information.
Parameters:
  pos - the position in the source document at the beginning of the attribute list, may be out of bounds.
Parameters:
  maxEnd - the maximum end position of the attribute list, or -1 if no maximum.
Parameters:
  maxErrorCount - the maximum number of minor errors allowed while parsing. the Attributes starting at the specified position, or null if too many errors occur while parsing or the specified position is out of bounds.
See Also:   StartTag.getAttributes
See Also:   Source.parseAttributes(int pos,int MaxEnd)




setLogWriter
public void setLogWriter(Writer writer)(Code)
to an implementation that that sends all output to a specified Writer.

This method has been deprecated as of version 2.4 in favour of the more generic Source.setLogger(Logger) method.
Parameters:
  writer - the destination java.io.Writer for log messages.Source.setLogger(Logger) setLoggerWriterLogger.WriterLogger(Writer) WriterLogger




setLogger
public void setLogger(Logger logger)(Code)
Sets the Logger that handles log messages.

Specifying a null argument disables logging completely for operations performed on this Source object.

A logger instance is created automatically for each Source object using the LoggerProvider specified by the static Config.LoggerProvider property. The name used for all automatically created logger instances is "net.htmlparser.jericho".

Use of this method with a non-null argument is therefore not usually necessary, unless specifying an instance of WriterLogger or a user-defined Logger implementation.
Parameters:
  logger - the logger that will handle log messages, or null to disable logging.
See Also:   Config.LoggerProvider




toString
public String toString()(Code)
Returns the source text as a String. the source text as a String.



Fields inherited from au.id.jericho.lib.html.Segment
final int begin(Code)(Java Doc)
List childElements(Code)(Java Doc)
final int end(Code)(Java Doc)
final Source source(Code)(Java Doc)

Methods inherited from au.id.jericho.lib.html.Segment
final static StringBuffer appendCollapseWhiteSpace(StringBuffer sb, CharSequence text)(Code)(Java Doc)
final public char charAt(int index)(Code)(Java Doc)
public int compareTo(Object o)(Code)(Java Doc)
final public boolean encloses(Segment segment)(Code)(Java Doc)
final public boolean encloses(int pos)(Code)(Java Doc)
final public boolean equals(Object object)(Code)(Java Doc)
public String extractText()(Code)(Java Doc)
public String extractText(boolean includeAttributes)(Code)(Java Doc)
public List findAllCharacterReferences()(Code)(Java Doc)
public List findAllElements()(Code)(Java Doc)
public List findAllElements(String name)(Code)(Java Doc)
public List findAllElements(StartTagType startTagType)(Code)(Java Doc)
public List findAllElements(String attributeName, String value, boolean valueCaseSensitive)(Code)(Java Doc)
public List findAllStartTags()(Code)(Java Doc)
public List findAllStartTags(String name)(Code)(Java Doc)
public List findAllStartTags(String attributeName, String value, boolean valueCaseSensitive)(Code)(Java Doc)
public List findAllTags()(Code)(Java Doc)
public List findAllTags(TagType tagType)(Code)(Java Doc)
public List findFormControls()(Code)(Java Doc)
public FormFields findFormFields()(Code)(Java Doc)
final public int getBegin()(Code)(Java Doc)
public List getChildElements()(Code)(Java Doc)
public String getDebugInfo()(Code)(Java Doc)
final public int getEnd()(Code)(Java Doc)
public Renderer getRenderer()(Code)(Java Doc)
public TextExtractor getTextExtractor()(Code)(Java Doc)
public int hashCode()(Code)(Java Doc)
public void ignoreWhenParsing()(Code)(Java Doc)
final public boolean isWhiteSpace()(Code)(Java Doc)
final public static boolean isWhiteSpace(char ch)(Code)(Java Doc)
final public int length()(Code)(Java Doc)
public Attributes parseAttributes()(Code)(Java Doc)
final public CharSequence subSequence(int beginIndex, int endIndex)(Code)(Java Doc)
public String toString()(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.