| java.lang.Object net.matuschek.html.HtmlDocument
HtmlDocument | public class HtmlDocument (Code) | | This class implements an HTML document
It uses JTidy to parse the given HTML code to an internal DOM
representation.
author: Daniel Matuschek version: $Id $ |
Constructor Summary | |
public | HtmlDocument(URL url, byte[] content) Initializes an HTML document with the given content.
Parameters: url - the URL of this document. | public | HtmlDocument(URL url, byte[] content, String newEncoding) Initializes an HTML document with the given content.
Parameters: url - the URL of this document. | public | HtmlDocument(URL url, String contentStr) Initalizes an HTML document from a String. |
Method Summary | |
protected void | extractElements(Element element, String type, Vector<Element> elementList) Extract elements from the given DOM subtree and put it into the given
vector.
Parameters: element - the top level DOM element of the DOM tree to parse Parameters: type - HTML tag to extract (e.g. | protected void | extractImageLinks(Element element, Vector<URL> links) Extract links to includes images from the given DOM subtree and
put them into the given vector. | protected void | extractLinks(Element element, Vector<URL> links) Extract links from the given DOM subtree and put it into the given
vector. | public URL | getBaseURL() | public Vector | getElements(String type) gets all Element nodes of a given type as a Vector
Parameters: type - the type of elements to return. | public Vector | getImageLinks() Extracts all links to included images from this HTML document. | public Vector<URL> | getLinks() |
HtmlDocument | public HtmlDocument(URL url, byte[] content)(Code) | | Initializes an HTML document with the given content.
Parameters: url - the URL of this document. Needed for link extraction. Parameters: content - some HTML text as an array of bytes |
HtmlDocument | public HtmlDocument(URL url, byte[] content, String newEncoding)(Code) | | Initializes an HTML document with the given content.
Parameters: url - the URL of this document. Needed for link extraction. Parameters: content - some HTML text as an array of bytes Parameters: newEncoding - Is the encoding of the content. |
HtmlDocument | public HtmlDocument(URL url, String contentStr)(Code) | | Initalizes an HTML document from a String. Convert string to
bytes using default encoding
|
extractElements | protected void extractElements(Element element, String type, Vector<Element> elementList)(Code) | | Extract elements from the given DOM subtree and put it into the given
vector.
Parameters: element - the top level DOM element of the DOM tree to parse Parameters: type - HTML tag to extract (e.g. "a", "form", "head" ...) Parameters: elementList - the vector that will store the elements |
extractImageLinks | protected void extractImageLinks(Element element, Vector<URL> links)(Code) | | Extract links to includes images from the given DOM subtree and
put them into the given vector.
Parameters: element - the top level DOM element of the DOM tree to parse Parameters: links - the vector that will store the links |
extractLinks | protected void extractLinks(Element element, Vector<URL> links)(Code) | | Extract links from the given DOM subtree and put it into the given
vector.
Parameters: element - the top level DOM element of the DOM tree to parse Parameters: links - the vector that will store the links |
getImageLinks | public Vector getImageLinks()(Code) | | Extracts all links to included images from this HTML document.
a Vector of URLs containing the included links |
|
|