Java Doc for Crawler.java in  » Web-Crawler » WebSPHINX » websphinx » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » WebSPHINX » websphinx 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   websphinx.Crawler

All known Subclasses:   websphinx.searchengine.Search,
Crawler
public class Crawler implements Runnable,Serializable(Code)
Web crawler.

To write a crawler, extend this class and override shouldVisit () and visit() to create your own crawler.

To use a crawler:

  1. Initialize the crawler by calling setRoot() (or one of its variants) and setting other crawl parameters.
  2. Register any classifiers you need with addClassifier().
  3. Connect event listeners to monitor the crawler, such as websphinx.EventLog, websphinx.workbench.WebGraph, or websphinx.workbench.Statistics.
  4. Call run() to start the crawler.
A running crawler consists of a priority queue of Links waiting to be visited and a set of threads retrieving pages in parallel. When a page is downloaded, it is processed as follows:
  1. classify(): The page is passed to the classify() method of every registered classifier, in increasing order of their priority values. Classifiers typically attach informative labels to the page and its links, such as "homepage" or "root page".
  2. visit(): The page is passed to the crawler's visit() method for user-defined processing.
  3. expand(): The page is passed to the crawler's expand() method to be expanded. The default implementation tests every unvisited hyperlink on the page with shouldVisit(), and puts each link approved by shouldVisit() into the crawling queue.
By default, when expanding the links of a page, the crawler only considers hyperlinks (not applets or inline images, for instance) that point to Web pages (not mailto: links, for instance). If you want shouldVisit() to test every link on the page, use setLinkType(Crawler.ALL_LINKS).


Field Summary
final public static  String[]ALL_LINKS
    
final public static  String[]HYPERLINKS
     Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).
final public static  String[]HYPERLINKS_AND_IMAGES
     Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.
final public static  String[]SERVER
     Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.
final public static  String[]SUBTREE
     Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.
final public static  String[]WEB
     Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.

Constructor Summary
public  Crawler()
     Make a new Crawler.

Method Summary
public  voidaddClassifier(Classifier c)
     Adds a classifier to this crawler.
public  voidaddCrawlListener(CrawlListener listen)
     Adds a listener to the set of CrawlListeners for this crawler.
public  voidaddLinkListener(LinkListener listen)
     Adds a listener to the set of LinkListeners for this crawler.
public  voidaddRoot(Link link)
     Add a root to the existing set of roots.
public  voidclear()
     Initialize the crawler for a fresh crawl.
protected  voidclearVisited()
     Clear the set of visited links.
public  EnumerationenumerateClassifiers()
     Enumerates the set of classifiers.
public  EnumerationenumerateQueue()
     Enumerate crawling queue.
public  voidexpand(Page page)
     Expand the crawl from a page.
 voidfetch(Worm w)
    
 voidfetchTimedOut(Worm w, int interval)
    
public  ActiongetAction()
     Get action.
public  intgetActiveThreads()
     Get number of threads currently working.
public  Classifier[]getClassifiers()
    
public  Link[]getCrawledRoots()
     Get roots of last crawl.
public  booleangetDepthFirst()
     Get depth-first search flag.
public  String[]getDomain()
     Get crawl domain.
public  DownloadParametersgetDownloadParameters()
    
public  booleangetIgnoreVisitedLinks()
     Get ignore-visited-links flag.
public  LinkPredicategetLinkPredicate()
     Get link predicate.
public  String[]getLinkType()
     Get legal link types to crawl.
public  intgetLinksTested()
     Get number of links tested.
public  intgetMaxDepth()
     Get maximum depth.
public  StringgetName()
     Get human-readable name of crawler.
public  PagePredicategetPagePredicate()
     Get page predicate.
public  intgetPagesLeft()
     Get number of pages left to be visited.
public  intgetPagesVisited()
     Get number of pages visited.
public  StringgetRootHrefs()
     Get starting points of crawl as a String of newline-delimited URLs.
public  Link[]getRoots()
     Get starting points of crawl as an array of Link objects.
public  intgetState()
     Get state of crawler.
public  booleangetSynchronous()
     Get synchronous flag.
public static  voidmain(String[] args)
    
protected  voidmarkVisited(Link link)
     Register that a link has been visited.
public  voidpause()
     Pause the crawl in progress.
 voidprocess(Link link)
    
public  voidremoveAllClassifiers()
     Clears the set of classifiers.
public  voidremoveClassifier(Classifier c)
     Removes a classifier from the set of classifiers.
public  voidremoveCrawlListener(CrawlListener listen)
     Removes a listener from the set of CrawlListeners.
public  voidremoveLinkListener(LinkListener listen)
     Removes a listener from the set of LinkListeners.
public  voidrun()
     Start crawling.
protected  voidsendCrawlEvent(int id)
     Send a CrawlEvent to all CrawlListeners registered with this crawler.
protected  voidsendLinkEvent(Link l, int id)
     Send a LinkEvent to all LinkListeners registered with this crawler.
protected  voidsendLinkEvent(Link l, int id, Throwable exception)
     Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
public  voidsetAction(Action act)
     Set the action.
public  voidsetDepthFirst(boolean useDFS)
     Set depth-first search flag.
public  voidsetDomain(String[] domain)
     Set crawl domain.
public  voidsetDownloadParameters(DownloadParameters dp)
    
public  voidsetIgnoreVisitedLinks(boolean f)
     Set ignore-visited-links flag.
public  voidsetLinkPredicate(LinkPredicate pred)
     Set link predicate.
public  voidsetLinkType(String[] type)
     Set legal link types to crawl.
public  voidsetMaxDepth(int maxDepth)
     Set maximum depth.
public  voidsetName(String name)
     Set human-readable name of crawler.
public  voidsetPagePredicate(PagePredicate pred)
     Set page predicate.
public  voidsetRoot(Link link)
     Set starting point of crawl as a single Link.
public  voidsetRootHrefs(String hrefs)
     Set starting points of crawl as a string of whitespace-delimited URLs.
public  voidsetRoots(Link[] links)
     Set starting points of crawl as an array of Links.
public  voidsetSynchronous(boolean f)
     Set ssynchronous flag.
public  booleanshouldVisit(Link l)
     Callback for testing whether a link should be traversed. Default version returns true for all links.
public  voidstop()
     Stop the crawl in progress.
public  voidsubmit(Link link)
     Puts a link into the crawling queue.
public  voidsubmit(Link[] links)
     Submit an array of Links for crawling.
 voidtimedOut()
    
public  StringtoString()
     Convert the crawler to a String.
public  voidvisit(Page page)
     Callback for visiting a page.
public  booleanvisited(Link link)
     Test whether the page corresponding to a link has been visited (or queued for visiting).

Field Detail
ALL_LINKS
final public static String[] ALL_LINKS(Code)
Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link



HYPERLINKS
final public static String[] HYPERLINKS(Code)
Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).



HYPERLINKS_AND_IMAGES
final public static String[] HYPERLINKS_AND_IMAGES(Code)
Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.



SERVER
final public static String[] SERVER(Code)
Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.



SUBTREE
final public static String[] SUBTREE(Code)
Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.



WEB
final public static String[] WEB(Code)
Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.




Constructor Detail
Crawler
public Crawler()(Code)
Make a new Crawler.




Method Detail
addClassifier
public void addClassifier(Classifier c)(Code)
Adds a classifier to this crawler. If the classifier is already found in the set, does nothing.
Parameters:
  c - a classifier



addCrawlListener
public void addCrawlListener(CrawlListener listen)(Code)
Adds a listener to the set of CrawlListeners for this crawler. If the listener is already found in the set, does nothing.
Parameters:
  listen - a listener



addLinkListener
public void addLinkListener(LinkListener listen)(Code)
Adds a listener to the set of LinkListeners for this crawler. If the listener is already found in the set, does nothing.
Parameters:
  listen - a listener



addRoot
public void addRoot(Link link)(Code)
Add a root to the existing set of roots.
Parameters:
  link - starting point to add



clear
public void clear()(Code)
Initialize the crawler for a fresh crawl. Clears the crawling queue and sets all crawling statistics to 0. Stops the crawler if it is currently running.



clearVisited
protected void clearVisited()(Code)
Clear the set of visited links.



enumerateClassifiers
public Enumeration enumerateClassifiers()(Code)
Enumerates the set of classifiers. An enumeration of the classifiers.



enumerateQueue
public Enumeration enumerateQueue()(Code)
Enumerate crawling queue. an enumeration of Link objects which are waiting to be visited.



expand
public void expand(Page page)(Code)
Expand the crawl from a page. The default implementation of this method tests every link on the page using shouldVisit (), and submit()s the links that are approved. A subclass may want to override this method if it's inconvenient to consider the links individually with shouldVisit().
Parameters:
  page - Page to expand



fetch
void fetch(Worm w)(Code)



fetchTimedOut
void fetchTimedOut(Worm w, int interval)(Code)



getAction
public Action getAction()(Code)
Get action. current action



getActiveThreads
public int getActiveThreads()(Code)
Get number of threads currently working. number of threads downloading pages



getClassifiers
public Classifier[] getClassifiers()(Code)
Get the set of classifiers An array containing the registered classifiers.



getCrawledRoots
public Link[] getCrawledRoots()(Code)
Get roots of last crawl. May differ from getRoots() if new roots have been set. array of Links from which crawler started its last crawl,or null if the crawler was cleared.



getDepthFirst
public boolean getDepthFirst()(Code)
Get depth-first search flag. Default value is true. true if search is depth-first, false if search is breadth-first.



getDomain
public String[] getDomain()(Code)
Get crawl domain. Default value is WEB. WEB, SERVER, or SUBTREE.



getDownloadParameters
public DownloadParameters getDownloadParameters()(Code)
Get download parameters (such as number of threads, timeouts, maximum page size, etc.)



getIgnoreVisitedLinks
public boolean getIgnoreVisitedLinks()(Code)
Get ignore-visited-links flag. Default value is true. true if search skips links whose URLs have already been visited(or queued for visiting).



getLinkPredicate
public LinkPredicate getLinkPredicate()(Code)
Get link predicate. current link predicate



getLinkType
public String[] getLinkType()(Code)
Get legal link types to crawl. Default value is HYPERLINKS. HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.



getLinksTested
public int getLinksTested()(Code)
Get number of links tested. number of links passed to shouldVisit() so far in this crawl



getMaxDepth
public int getMaxDepth()(Code)
Get maximum depth. Default value is 5. maximum depth of crawl, in hops from starting point.



getName
public String getName()(Code)
Get human-readable name of crawler. Default value is the class name, e.g., "Crawler". Useful for identifying the crawler in a user interface; also used as the default User-agent for identifying the crawler to a remote Web server. (The User-agent can be changed independently of the crawler name with setDownloadParameters().) human-readable name of crawler



getPagePredicate
public PagePredicate getPagePredicate()(Code)
Get page predicate. current page predicate



getPagesLeft
public int getPagesLeft()(Code)
Get number of pages left to be visited. number of links approved by shouldVisit() but not yet visited



getPagesVisited
public int getPagesVisited()(Code)
Get number of pages visited. number of pages passed to visit() so far in this crawl



getRootHrefs
public String getRootHrefs()(Code)
Get starting points of crawl as a String of newline-delimited URLs. URLs where crawler will start, separated by newlines.



getRoots
public Link[] getRoots()(Code)
Get starting points of crawl as an array of Link objects. array of Links from which crawler will start its next crawl.



getState
public int getState()(Code)
Get state of crawler. one of CrawlEvent.STARTED, CrawlEvent.PAUSED, STOPPED, CLEARED.



getSynchronous
public boolean getSynchronous()(Code)
Get synchronous flag. Default value is false. true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.



main
public static void main(String[] args) throws Exception(Code)



markVisited
protected void markVisited(Link link)(Code)
Register that a link has been visited.
Parameters:
  link - Link that has been visited



pause
public void pause()(Code)
Pause the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. The queues remain as-is, so calling run() again will resume the crawl exactly where it left off. pause() can be called from any thread.



process
void process(Link link)(Code)



removeAllClassifiers
public void removeAllClassifiers()(Code)
Clears the set of classifiers.



removeClassifier
public void removeClassifier(Classifier c)(Code)
Removes a classifier from the set of classifiers. If c is not found in the set, does nothing.
Parameters:
  c - a classifier



removeCrawlListener
public void removeCrawlListener(CrawlListener listen)(Code)
Removes a listener from the set of CrawlListeners. If it is not found in the set, does nothing.
Parameters:
  listen - a listener



removeLinkListener
public void removeLinkListener(LinkListener listen)(Code)
Removes a listener from the set of LinkListeners. If it is not found in the set, does nothing.
Parameters:
  listen - a listener



run
public void run()(Code)
Start crawling. Returns either when the crawl is done, or when pause() or stop() is called. Because this method implements the java.lang.Runnable interface, a crawler can be run in the background thread.



sendCrawlEvent
protected void sendCrawlEvent(int id)(Code)
Send a CrawlEvent to all CrawlListeners registered with this crawler.
Parameters:
  id - Event id



sendLinkEvent
protected void sendLinkEvent(Link l, int id)(Code)
Send a LinkEvent to all LinkListeners registered with this crawler.
Parameters:
  l - Link related to event
Parameters:
  id - Event id



sendLinkEvent
protected void sendLinkEvent(Link l, int id, Throwable exception)(Code)
Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
Parameters:
  l - Link related to event
Parameters:
  id - Event id
Parameters:
  exception - Exception associated with event



setAction
public void setAction(Action act)(Code)
Set the action. This is an alternative way to specify an action performed on every page. If act is non-null, then every page passed to visit() is also passed to this action.
Parameters:
  act - Action



setDepthFirst
public void setDepthFirst(boolean useDFS)(Code)
Set depth-first search flag. If neither depth-first nor breadth-first is desired, then override shouldVisit() to set a custom priority on each link.
Parameters:
  useDFS - true if search should be depth-first, false if search should be breadth-first.



setDomain
public void setDomain(String[] domain)(Code)
Set crawl domain.
Parameters:
  domain - one of WEB, SERVER, or SUBTREE.



setDownloadParameters
public void setDownloadParameters(DownloadParameters dp)(Code)
Set download parameters (such as number of threads, timeouts, maximum page size, etc.)
Parameters:
  dp - Download parameters



setIgnoreVisitedLinks
public void setIgnoreVisitedLinks(boolean f)(Code)
Set ignore-visited-links flag.
Parameters:
  f - true if search skips links whose URLs have already been visited(or queued for visiting).



setLinkPredicate
public void setLinkPredicate(LinkPredicate pred)(Code)
Set link predicate. This is an alternative way to specify the links to walk. If the link predicate is non-null, then only links that satisfy the link predicate AND shouldVisit() are crawled.
Parameters:
  pred - Link predicate



setLinkType
public void setLinkType(String[] type)(Code)
Set legal link types to crawl.
Parameters:
  domain - one of HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.



setMaxDepth
public void setMaxDepth(int maxDepth)(Code)
Set maximum depth.
Parameters:
  maxDepth - maximum depth of crawl, in hops from starting point



setName
public void setName(String name)(Code)
Set human-readable name of crawler.
Parameters:
  name - new name for crawler



setPagePredicate
public void setPagePredicate(PagePredicate pred)(Code)
Set page predicate. This is a way to filter the pages passed to visit(). If the page predicate is non-null, then only pages that satisfy it are passed to visit().
Parameters:
  pred - Page predicate



setRoot
public void setRoot(Link link)(Code)
Set starting point of crawl as a single Link.
Parameters:
  link - starting point



setRootHrefs
public void setRootHrefs(String hrefs) throws MalformedURLException(Code)
Set starting points of crawl as a string of whitespace-delimited URLs.
Parameters:
  hrefs - URLs of starting point, separated by space, \t, or \n
exception:
  java.net.MalformedURLException - if any of the URLs is invalid,leaving starting points unchanged



setRoots
public void setRoots(Link[] links)(Code)
Set starting points of crawl as an array of Links.
Parameters:
  links - starting points



setSynchronous
public void setSynchronous(boolean f)(Code)
Set ssynchronous flag.
Parameters:
  f - true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.



shouldVisit
public boolean shouldVisit(Link l)(Code)
Callback for testing whether a link should be traversed. Default version returns true for all links. Override this method for more interesting behavior.
Parameters:
  l - Link encountered by the crawler true if link should be followed, false if it should be ignored.



stop
public void stop()(Code)
Stop the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. Empties the crawling queue.



submit
public void submit(Link link)(Code)
Puts a link into the crawling queue. If the crawler is running, the link will eventually be retrieved and passed to visit().
Parameters:
  link - Link to put in queue



submit
public void submit(Link[] links)(Code)
Submit an array of Links for crawling. If the crawler is running, these links will eventually be retrieved and passed to visit().
Parameters:
  links - Links to put in queue



timedOut
void timedOut()(Code)



toString
public String toString()(Code)
Convert the crawler to a String. Human-readable name of crawler.



visit
public void visit(Page page)(Code)
Callback for visiting a page. Default version does nothing.
Parameters:
  page - Page retrieved by the crawler



visited
public boolean visited(Link link)(Code)
Test whether the page corresponding to a link has been visited (or queued for visiting).
Parameters:
  link - Link to test true if link has been passed to walk() during this crawl



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.