Java Doc for Page.java in  » Web-Crawler » WebSPHINX » websphinx » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » WebSPHINX » websphinx 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   websphinx.Region
      websphinx.Page

Page
public class Page extends Region (Code)
A Web page. Although a Page can represent any MIME type, it mainly supports HTML pages, which are automatically parsed. The parsing produces a list of tags, a list of words, an HTML parse tree, and a list of links.


Field Summary
final static  intTYPICAL_LENGTH
    
 URLbase
    
 StringcanonicalTags
    
 Stringcontent
    
 byte[]contentBytes
    
 StringcontentEncoding
    
 intcontentLock
    
 StringcontentType
    
 Element[]elements
    
 longexpiration
    
 longlastModified
    
 Link[]links
    
 Linkorigin
    
 intresponseCode
    
 StringresponseMessage
    
 Elementroot
    
 Tag[]tags
    
 Stringtitle
    
 Region[]tokens
    
 Text[]words
    

Constructor Summary
public  Page(Link link)
     Make a Page by downloading and parsing a Link.
public  Page(Link link, DownloadParameters dp)
     Make a Page by downloading a Link.
public  Page(Link link, DownloadParameters dp, HTMLParser parser)
     Make a Page by downloading a Link.
public  Page(URL url, String html)
     Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc.
public  Page(URL url, String html, HTMLParser parser)
     Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc.
public  Page(String content)
     Make a Page from a string of content.
public  Page(byte[] content)
     Make a Page from a byte array of content.

Method Summary
public  voiddiscardContent()
     Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl).
public  voiddownload(DownloadParameters dp, HTMLParser parser)
    
 voiddownloadSafely()
    
public  URLgetBase()
     Get the base URL, relative to which the page's links were interpreted. The base URL defaults to the URL of the Link that was used to download the page.
public  StringgetContent()
     Get the content of the page as a String.
public  byte[]getContentBytes()
     Get the content of the page as an array of bytes.
public  StringgetContentEncoding()
     Get content encoding of page. the encoding type of page, such as "base-64", or null if not known.
public  StringgetContentType()
     Get MIME type of page. the MIME type of page, such as "text/html", or null if not known.
public  intgetDepth()
     Get depth of page in crawl.
public  Element[]getElements()
     Get the HTML elements in the page.
public  longgetExpiration()
     Get expiration date of page. the expiration date of the page, or 0 if not known.
public  longgetLastModified()
     Get last-modified date of page. the date when the page was last modified, or 0 if not known.
public  Link[]getLinks()
     Get the links found in the page.
public  LinkgetOrigin()
     Get the Link that points to this page.
public  intgetResponseCode()
     Get response code returned by the Web server.
public  StringgetResponseMessage()
     Get response message returned by the Web server. response message, such as "OK" or "Not Found".
public  ElementgetRootElement()
     Get the root HTML element of the page.
public  Tag[]getTags()
     Get the tag sequence of the page.
public  StringgetTitle()
     Get the title of the page.
public  Region[]getTokens()
     Get the token sequence of the page.
public  URLgetURL()
     Get the URL.
public  Text[]getWords()
     Get the words in the page.
final public  booleanhasContent()
     Test if page content is available.
public  booleanisHTML()
     Test whether page is HTML.
public  booleanisImage()
     Test whether page is a GIF or JPEG image.
public  booleanisParsed()
     Test whether page has been parsed.
public  voidkeepContent()
     Lock the page's content (to prevent it from being discarded). This method increments a lock counter, representing all the callers interested in preserving the content.
public static  voidmain(String[] args)
    
public  voidparse(HTMLParser parser)
     Parse the page.
public  voidsetContentEncoding(String encoding)
     Set content encoding of page.
Parameters:
  encoding - the encoding type of page, such as "base-64", or null if not known.
public  voidsetContentType(String type)
     Set MIME type of page.
Parameters:
  type - the MIME type of page, such as "text/html", or null if not known.
public  voidsetExpiration(long expire)
     Set expiration date of page.
Parameters:
  expire - the expiration date of the page, or 0 if not known.
public  voidsetLastModified(long last)
     Set last-modified date of page.
Parameters:
  last - the date when the page was last modified, or 0 if not known.
public  StringsubstringCanonicalTags(int start, int end)
     Get canonicalized HTML tags found in a region. A canonicalized tag looks like the following:
 <tagname#index attr=value attr=value attr=value ...>
 
 where tagname and attr are all lowercase, index is the tag's
 index in the page's tokens array.
public  StringsubstringContent(int start, int end)
     Get raw content found in a region.
public  StringsubstringHTML(int start, int end)
     Get HTML found in a region.
public  StringsubstringTags(int start, int end)
     Get HTML tags found in a region.
public  StringsubstringText(int start, int end)
     Get tagless text found in a region.
public  StringtoDescription()
     Generate a human-readable description of the page.
public  StringtoString()
     Get page containing the region.
public  StringtoURL()
    

Field Detail
TYPICAL_LENGTH
final static int TYPICAL_LENGTH(Code)



base
URL base(Code)



canonicalTags
String canonicalTags(Code)



content
String content(Code)



contentBytes
byte[] contentBytes(Code)



contentEncoding
String contentEncoding(Code)



contentLock
int contentLock(Code)



contentType
String contentType(Code)



elements
Element[] elements(Code)



expiration
long expiration(Code)



lastModified
long lastModified(Code)



links
Link[] links(Code)



origin
Link origin(Code)



responseCode
int responseCode(Code)



responseMessage
String responseMessage(Code)



root
Element root(Code)



tags
Tag[] tags(Code)



title
String title(Code)



tokens
Region[] tokens(Code)



words
Text[] words(Code)




Constructor Detail
Page
public Page(Link link) throws IOException(Code)
Make a Page by downloading and parsing a Link.
Parameters:
  link - Link to download



Page
public Page(Link link, DownloadParameters dp) throws IOException(Code)
Make a Page by downloading a Link.
Parameters:
  link - Link to download
Parameters:
  dp - Download parameters to use



Page
public Page(Link link, DownloadParameters dp, HTMLParser parser) throws IOException(Code)
Make a Page by downloading a Link.
Parameters:
  link - Link to download
Parameters:
  parser - HTML parser to use



Page
public Page(URL url, String html)(Code)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters:
  url - URL to use as a base for relative links on the page
Parameters:
  html - the HTML content of the page



Page
public Page(URL url, String html, HTMLParser parser)(Code)
Make a Page from a URL and a string of HTML. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters:
  url - URL to use as a base for relative links on the page
Parameters:
  html - the HTML content of the page
Parameters:
  parser - HTML parser to use



Page
public Page(String content)(Code)
Make a Page from a string of content. The content is not parsed. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters:
  content - HTML content of the page



Page
public Page(byte[] content)(Code)
Make a Page from a byte array of content. The content is not parsed. The created page has no originating link, so calls to getURL(), getProtocol(), etc. will fail.
Parameters:
  content - byte content of the page




Method Detail
discardContent
public void discardContent()(Code)
Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl). This method decrements a lock counter. If the counter falls to 0 (meaning no callers are interested in the content), the content is released. At least the following fields are discarded: content, tokens, tags, words, elements, and root. After the content has been discarded, calling getContent() (or getTokens(), getTags(), etc.) will force the page to be downloaded again. Hopefully the download will come from the cache, however.

Links are not considered part of the content, and are not subject to discarding by this method. Also, if the page was created from a string (rather than by downloading), its content is not subject to discarding (since there would be no way to recover it).




download
public void download(DownloadParameters dp, HTMLParser parser) throws IOException(Code)



downloadSafely
void downloadSafely()(Code)



getBase
public URL getBase()(Code)
Get the base URL, relative to which the page's links were interpreted. The base URL defaults to the URL of the Link that was used to download the page. If any redirects occur while downloading the page, the final location becomes the new base URL. Lastly, if a element is found in the page, that becomes the new base URL. the page's base URL.



getContent
public String getContent()(Code)
Get the content of the page as a String. May not work properly for binary data like images; use getContentBytes instead. the String content of the page.



getContentBytes
public byte[] getContentBytes()(Code)
Get the content of the page as an array of bytes. the content of the page in binary form.



getContentEncoding
public String getContentEncoding()(Code)
Get content encoding of page. the encoding type of page, such as "base-64", or null if not known.



getContentType
public String getContentType()(Code)
Get MIME type of page. the MIME type of page, such as "text/html", or null if not known.



getDepth
public int getDepth()(Code)
Get depth of page in crawl. depth of page from root (depth of page is same as depth of its originating link)



getElements
public Element[] getElements()(Code)
Get the HTML elements in the page. All elements in the page are included in the list, in the order they would appear in an inorder traversal of the HTML parse tree. HTML elements in the page ordered by inorder, or null if the pagehasn't been downloaded or parsed.



getExpiration
public long getExpiration()(Code)
Get expiration date of page. the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.



getLastModified
public long getLastModified()(Code)
Get last-modified date of page. the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT



getLinks
public Link[] getLinks()(Code)
Get the links found in the page. links in the page, or null if the page hasn't been downloaded or parsed.



getOrigin
public Link getOrigin()(Code)
Get the Link that points to this page. the Link object that was used to download this page.



getResponseCode
public int getResponseCode()(Code)
Get response code returned by the Web server. For list of possible values, see java.net.HttpURLConnection. response code, such as 200 (for OK) or 404 (not found).Code is -1 if unknown.
See Also:   java.net.HttpURLConnection



getResponseMessage
public String getResponseMessage()(Code)
Get response message returned by the Web server. response message, such as "OK" or "Not Found". The response message is null if the page failed to be fetched or not known.



getRootElement
public Element getRootElement()(Code)
Get the root HTML element of the page. first top-level HTML element in the page, or null if the page hasn't been downloaded or parsed.



getTags
public Tag[] getTags()(Code)
Get the tag sequence of the page. tags in the page, or null if the page hasn't been downloaded or parsed.



getTitle
public String getTitle()(Code)
Get the title of the page. the page's title, or null if the page hasn't been parsed.



getTokens
public Region[] getTokens()(Code)
Get the token sequence of the page. Tokens are tags and whitespace-delimited text. token regions in the page, or null if the page hasn't been downloaded or parsed.



getURL
public URL getURL()(Code)
Get the URL. the URL of the link that was used to download this page



getWords
public Text[] getWords()(Code)
Get the words in the page. Words are whitespace- and tag-delimited text. words in the page, or null if the page hasn't been downloaded or parsed.



hasContent
final public boolean hasContent()(Code)
Test if page content is available. true if content is downloaded and available, false if content has not been downloaded or has been discarded.



isHTML
public boolean isHTML()(Code)
Test whether page is HTML. true if page is HTML.



isImage
public boolean isImage()(Code)
Test whether page is a GIF or JPEG image. true if page is a GIF or JPEG image, false if not



isParsed
public boolean isParsed()(Code)
Test whether page has been parsed. Pages are parsed during download only if its MIME type is HTML or unspecified. true if page was parsed, false if not



keepContent
public void keepContent()(Code)
Lock the page's content (to prevent it from being discarded). This method increments a lock counter, representing all the callers interested in preserving the content. The lock counter is set to 1 when the page is initially downloaded.



main
public static void main(String[] args) throws Exception(Code)



parse
public void parse(HTMLParser parser)(Code)
Parse the page. Assumes the page has already been downloaded.
Parameters:
  parser - HTML parser to use
exception:
  RuntimeException - if an error occurs in downloading the page



setContentEncoding
public void setContentEncoding(String encoding)(Code)
Set content encoding of page.
Parameters:
  encoding - the encoding type of page, such as "base-64", or null if not known.



setContentType
public void setContentType(String type)(Code)
Set MIME type of page.
Parameters:
  type - the MIME type of page, such as "text/html", or null if not known.



setExpiration
public void setExpiration(long expire)(Code)
Set expiration date of page.
Parameters:
  expire - the expiration date of the page, or 0 if not known. The value is number of seconds since January 1, 1970 GMT.



setLastModified
public void setLastModified(long last)(Code)
Set last-modified date of page.
Parameters:
  last - the date when the page was last modified, or 0 if not known. The value is number of seconds since January 1, 1970 GMT



substringCanonicalTags
public String substringCanonicalTags(int start, int end)(Code)
Get canonicalized HTML tags found in a region. A canonicalized tag looks like the following:
 <tagname#index attr=value attr=value attr=value ...>
 
 where tagname and attr are all lowercase, index is the tag's
 index in the page's tokens array.  Attributes are sorted in
 increasing order by attribute name. Attributes without values
 omit the entire "=value" portion.  Values are delimited by a 
 space.  All occurences of <, >, space, and % characters 
 in a value are URL-encoded (e.g., space is converted to %20).  
 Thus the only occurences of these characters in the canonical 
 tag are the tag delimiters.
 

For example, raw HTML that looks like:

 <IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>
 
would be canonicalized to:
 <img ismap src=http://foo.com/map%3C%3E.gif></img>
 

Comment and declaration tags (whose tag name is !) are omitted from the canonicalization.
Parameters:
  start - starting offset of region
Parameters:
  end - ending offset of region canonicalized tags contained in the region




substringContent
public String substringContent(int start, int end)(Code)
Get raw content found in a region.
Parameters:
  start - starting offset of region
Parameters:
  end - ending offset of region raw HTML contained in the region



substringHTML
public String substringHTML(int start, int end)(Code)
Get HTML found in a region.
Parameters:
  start - starting offset of region
Parameters:
  end - ending offset of region representation of region as HTML



substringTags
public String substringTags(int start, int end)(Code)
Get HTML tags found in a region. Whitespace and text among the tags are deleted.
Parameters:
  start - starting offset of region
Parameters:
  end - ending offset of region tags contained in the region



substringText
public String substringText(int start, int end)(Code)
Get tagless text found in a region. Runs of whitespace and tags are reduced to a single space character.
Parameters:
  start - starting offset of region
Parameters:
  end - ending offset of region tagless text contained in the region



toDescription
public String toDescription()(Code)
Generate a human-readable description of the page. a description of the link, in the form "title [url]".



toString
public String toString()(Code)
Get page containing the region. page containing the region



toURL
public String toURL()(Code)
Convert the link's URL to a String the URL represented as a string



Fields inherited from websphinx.Region
final static int INITIAL_SIZE(Code)(Java Doc)
final public static String TRUE(Code)(Java Doc)
protected int end(Code)(Java Doc)
protected Hashtable names(Code)(Java Doc)
protected Page source(Code)(Java Doc)
protected int start(Code)(Java Doc)

Methods inherited from websphinx.Region
public Enumeration enumerateObjectLabels()(Code)(Java Doc)
public static int findEnd(Region[] regions, int p)(Code)(Java Doc)
public static int findStart(Region[] regions, int p)(Code)(Java Doc)
public int getEnd()(Code)(Java Doc)
public Region getField(String name)(Code)(Java Doc)
public Region[] getFields(String name)(Code)(Java Doc)
public String getLabel(String name)(Code)(Java Doc)
public String getLabel(String name, String defaultValue)(Code)(Java Doc)
public int getLength()(Code)(Java Doc)
public Number getNumericLabel(String name, Number defaultValue)(Code)(Java Doc)
public Object getObjectLabel(String name)(Code)(Java Doc)
public String getObjectLabels()(Code)(Java Doc)
public Element getRootElement()(Code)(Java Doc)
public Page getSource()(Code)(Java Doc)
public int getStart()(Code)(Java Doc)
public boolean hasAllLabels(String expr)(Code)(Java Doc)
public boolean hasAllLabels(String[] labels)(Code)(Java Doc)
public boolean hasAnyLabels(String expr)(Code)(Java Doc)
public boolean hasAnyLabels(String[] labels)(Code)(Java Doc)
public boolean hasLabel(String name)(Code)(Java Doc)
public void removeLabel(String name)(Code)(Java Doc)
public void setField(String name, Region region)(Code)(Java Doc)
public void setFields(String name, Region[] regions)(Code)(Java Doc)
public void setLabel(String name, String value)(Code)(Java Doc)
public void setLabel(String name)(Code)(Java Doc)
public void setObjectLabel(String name, Object value)(Code)(Java Doc)
public Region span(Region r)(Code)(Java Doc)
public String toHTML()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
public String toTags()(Code)(Java Doc)
public String toText()(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.