| java.lang.Object websphinx.Region websphinx.Page
Page | public class Page extends Region (Code) | | A Web page. Although a Page can represent any MIME type, it mainly
supports HTML pages, which are automatically parsed. The parsing produces
a list of tags, a list of words, an HTML parse tree, and a list of links.
|
Constructor Summary | |
public | Page(Link link) Make a Page by downloading and parsing a Link. | public | Page(Link link, DownloadParameters dp) Make a Page by downloading a Link. | public | Page(Link link, DownloadParameters dp, HTMLParser parser) Make a Page by downloading a Link. | public | Page(URL url, String html) Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. | public | Page(URL url, String html, HTMLParser parser) Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. | public | Page(String content) Make a Page from a string of content. | public | Page(byte[] content) Make a Page from a byte array of content. |
Method Summary | |
public void | discardContent() Unlock the page's content (allowing it to be garbage-collected, to
save space during a Web crawl). | public void | download(DownloadParameters dp, HTMLParser parser) | void | downloadSafely() | public URL | getBase() Get the base URL, relative to which the page's links were interpreted.
The base URL defaults to the URL of the
Link that was used to download the page. | public String | getContent() Get the content of the page as a String. | public byte[] | getContentBytes() Get the content of the page as an array of bytes. | public String | getContentEncoding() Get content encoding of page.
the encoding type of page, such as "base-64", or null if not known. | public String | getContentType() Get MIME type of page.
the MIME type of page, such as "text/html", or null if not known. | public int | getDepth() Get depth of page in crawl. | public Element[] | getElements() Get the HTML elements in the page. | public long | getExpiration() Get expiration date of page.
the expiration date of the page, or 0 if not known. | public long | getLastModified() Get last-modified date of page.
the date when the page was last modified, or 0 if not known. | public Link[] | getLinks() Get the links found in the page. | public Link | getOrigin() Get the Link that points to this page. | public int | getResponseCode() Get response code returned by the Web server. | public String | getResponseMessage() Get response message returned by the Web server.
response message, such as "OK" or "Not Found". | public Element | getRootElement() Get the root HTML element of the page. | public Tag[] | getTags() Get the tag sequence of the page. | public String | getTitle() Get the title of the page. | public Region[] | getTokens() Get the token sequence of the page. | public URL | getURL() Get the URL. | public Text[] | getWords() Get the words in the page. | final public boolean | hasContent() Test if page content is available. | public boolean | isHTML() Test whether page is HTML. | public boolean | isImage() Test whether page is a GIF or JPEG image. | public boolean | isParsed() Test whether page has been parsed. | public void | keepContent() Lock the page's content (to prevent it from being discarded).
This method increments a lock counter, representing all the
callers interested in preserving the content. | public static void | main(String[] args) | public void | parse(HTMLParser parser) Parse the page. | public void | setContentEncoding(String encoding) Set content encoding of page.
Parameters: encoding - the encoding type of page, such as "base-64", or null if not known. | public void | setContentType(String type) Set MIME type of page.
Parameters: type - the MIME type of page, such as "text/html", or null if not known. | public void | setExpiration(long expire) Set expiration date of page.
Parameters: expire - the expiration date of the page, or 0 if not known. | public void | setLastModified(long last) Set last-modified date of page.
Parameters: last - the date when the page was last modified, or 0 if not known. | public String | substringCanonicalTags(int start, int end) Get canonicalized HTML tags found in a region.
A canonicalized tag looks like the following:
<tagname#index attr=value attr=value attr=value ...>
where tagname and attr are all lowercase, index is the tag's
index in the page's tokens array. | public String | substringContent(int start, int end) Get raw content found in a region. | public String | substringHTML(int start, int end) Get HTML found in a region. | public String | substringTags(int start, int end) Get HTML tags found in a region. | public String | substringText(int start, int end) Get tagless text found in a region. | public String | toDescription() Generate a human-readable description of the page. | public String | toString() Get page containing the region. | public String | toURL() |
TYPICAL_LENGTH | final static int TYPICAL_LENGTH(Code) | | |
contentBytes | byte[] contentBytes(Code) | | |
contentLock | int contentLock(Code) | | |
expiration | long expiration(Code) | | |
lastModified | long lastModified(Code) | | |
responseCode | int responseCode(Code) | | |
Page | public Page(Link link) throws IOException(Code) | | Make a Page by downloading and parsing a Link.
Parameters: link - Link to download |
Page | public Page(URL url, String html)(Code) | | Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters: url - URL to use as a base for relative links on the page Parameters: html - the HTML content of the page |
Page | public Page(URL url, String html, HTMLParser parser)(Code) | | Make a Page from a URL and a string of HTML.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters: url - URL to use as a base for relative links on the page Parameters: html - the HTML content of the page Parameters: parser - HTML parser to use |
Page | public Page(String content)(Code) | | Make a Page from a string of content. The content is not parsed.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters: content - HTML content of the page |
Page | public Page(byte[] content)(Code) | | Make a Page from a byte array of content. The content is not parsed.
The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters: content - byte content of the page |
discardContent | public void discardContent()(Code) | | Unlock the page's content (allowing it to be garbage-collected, to
save space during a Web crawl). This method decrements a lock counter.
If the counter falls to
0 (meaning no callers are interested in the content),
the content is released. At least the following
fields are discarded: content, tokens, tags, words, elements, and
root. After the content has been discarded, calling getContent()
(or getTokens(), getTags(), etc.) will force the page to be downloaded
again. Hopefully the download will come from the cache, however.
Links are not considered part of the content, and are not subject to
discarding by this method. Also, if the page was created from a string
(rather than by downloading), its content is not subject to discarding
(since there would be no way to recover it).
|
downloadSafely | void downloadSafely()(Code) | | |
getBase | public URL getBase()(Code) | | Get the base URL, relative to which the page's links were interpreted.
The base URL defaults to the URL of the
Link that was used to download the page. If any redirects occur
while downloading the page, the final location becomes the new base
URL. Lastly, if a element is found in the page, that
becomes the new base URL.
the page's base URL. |
getContent | public String getContent()(Code) | | Get the content of the page as a String. May not work properly for
binary data like images; use getContentBytes instead.
the String content of the page. |
getContentBytes | public byte[] getContentBytes()(Code) | | Get the content of the page as an array of bytes.
the content of the page in binary form. |
getContentEncoding | public String getContentEncoding()(Code) | | Get content encoding of page.
the encoding type of page, such as "base-64", or null if not known. |
getContentType | public String getContentType()(Code) | | Get MIME type of page.
the MIME type of page, such as "text/html", or null if not known. |
getDepth | public int getDepth()(Code) | | Get depth of page in crawl.
depth of page from root (depth of page is same as depth of its originating link) |
getElements | public Element[] getElements()(Code) | | Get the HTML elements in the page. All elements in the page
are included in the list, in the order they would appear in
an inorder traversal of the HTML parse tree.
HTML elements in the page ordered by inorder, or null if the pagehasn't been downloaded or parsed. |
getExpiration | public long getExpiration()(Code) | | Get expiration date of page.
the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT. |
getLastModified | public long getLastModified()(Code) | | Get last-modified date of page.
the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT |
getLinks | public Link[] getLinks()(Code) | | Get the links found in the page.
links in the page, or null if the page hasn't been downloaded or parsed. |
getOrigin | public Link getOrigin()(Code) | | Get the Link that points to this page.
the Link object that was used to download this page. |
getResponseCode | public int getResponseCode()(Code) | | Get response code returned by the Web server. For list of
possible values, see java.net.HttpURLConnection.
response code, such as 200 (for OK) or 404 (not found).Code is -1 if unknown. See Also: java.net.HttpURLConnection |
getResponseMessage | public String getResponseMessage()(Code) | | Get response message returned by the Web server.
response message, such as "OK" or "Not Found". The response message is null if the page failed to be fetched or not known. |
getRootElement | public Element getRootElement()(Code) | | Get the root HTML element of the page.
first top-level HTML element in the page, or null if the page hasn't been downloaded or parsed. |
getTags | public Tag[] getTags()(Code) | | Get the tag sequence of the page.
tags in the page, or null if the page hasn't been downloaded or parsed. |
getTitle | public String getTitle()(Code) | | Get the title of the page.
the page's title, or null if the page hasn't been parsed. |
getTokens | public Region[] getTokens()(Code) | | Get the token sequence of the page. Tokens are tags and whitespace-delimited text.
token regions in the page, or null if the page hasn't been downloaded or parsed. |
getURL | public URL getURL()(Code) | | Get the URL.
the URL of the link that was used to download this page |
getWords | public Text[] getWords()(Code) | | Get the words in the page. Words are whitespace- and tag-delimited text.
words in the page, or null if the page hasn't been downloaded or parsed. |
hasContent | final public boolean hasContent()(Code) | | Test if page content is available.
true if content is downloaded and available, false if content has not been downloaded or has been discarded. |
isHTML | public boolean isHTML()(Code) | | Test whether page is HTML.
true if page is HTML. |
isImage | public boolean isImage()(Code) | | Test whether page is a GIF or JPEG image.
true if page is a GIF or JPEG image, false if not |
isParsed | public boolean isParsed()(Code) | | Test whether page has been parsed. Pages are parsed during
download only if its MIME type is HTML or unspecified.
true if page was parsed, false if not |
keepContent | public void keepContent()(Code) | | Lock the page's content (to prevent it from being discarded).
This method increments a lock counter, representing all the
callers interested in preserving the content. The lock
counter is set to 1 when the page is initially downloaded.
|
parse | public void parse(HTMLParser parser)(Code) | | Parse the page. Assumes the page has already been downloaded.
Parameters: parser - HTML parser to use exception: RuntimeException - if an error occurs in downloading the page |
setContentEncoding | public void setContentEncoding(String encoding)(Code) | | Set content encoding of page.
Parameters: encoding - the encoding type of page, such as "base-64", or null if not known. |
setContentType | public void setContentType(String type)(Code) | | Set MIME type of page.
Parameters: type - the MIME type of page, such as "text/html", or null if not known. |
setExpiration | public void setExpiration(long expire)(Code) | | Set expiration date of page.
Parameters: expire - the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT. |
setLastModified | public void setLastModified(long last)(Code) | | Set last-modified date of page.
Parameters: last - the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT |
substringCanonicalTags | public String substringCanonicalTags(int start, int end)(Code) | | Get canonicalized HTML tags found in a region.
A canonicalized tag looks like the following:
<tagname#index attr=value attr=value attr=value ...>
where tagname and attr are all lowercase, index is the tag's
index in the page's tokens array. Attributes are sorted in
increasing order by attribute name. Attributes without values
omit the entire "=value" portion. Values are delimited by a
space. All occurences of <, >, space, and % characters
in a value are URL-encoded (e.g., space is converted to %20).
Thus the only occurences of these characters in the canonical
tag are the tag delimiters.
For example, raw HTML that looks like:
<IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>
would be canonicalized to:
<img ismap src=http://foo.com/map%3C%3E.gif></img>
Comment and declaration tags (whose tag name is !) are omitted
from the canonicalization.
Parameters: start - starting offset of region Parameters: end - ending offset of region canonicalized tags contained in the region |
substringContent | public String substringContent(int start, int end)(Code) | | Get raw content found in a region.
Parameters: start - starting offset of region Parameters: end - ending offset of region raw HTML contained in the region |
substringHTML | public String substringHTML(int start, int end)(Code) | | Get HTML found in a region.
Parameters: start - starting offset of region Parameters: end - ending offset of region representation of region as HTML |
substringTags | public String substringTags(int start, int end)(Code) | | Get HTML tags found in a region. Whitespace and text among the
tags are deleted.
Parameters: start - starting offset of region Parameters: end - ending offset of region tags contained in the region |
substringText | public String substringText(int start, int end)(Code) | | Get tagless text found in a region.
Runs of whitespace and tags are reduced to a single space character.
Parameters: start - starting offset of region Parameters: end - ending offset of region tagless text contained in the region |
toDescription | public String toDescription()(Code) | | Generate a human-readable description of the page.
a description of the link, in the form "title [url]". |
toString | public String toString()(Code) | | Get page containing the region.
page containing the region |
toURL | public String toURL()(Code) | | Convert the link's URL to a String
the URL represented as a string |
|
|