| java.lang.Object websphinx.Crawler
All known Subclasses: websphinx.searchengine.Search,
Crawler | public class Crawler implements Runnable,Serializable(Code) | | Web crawler.
To write a crawler, extend this class and override
shouldVisit () and visit() to create your own crawler.
To use a crawler:
- Initialize the crawler by calling
setRoot() (or one of its variants) and setting other
crawl parameters.
- Register any classifiers you need with addClassifier().
- Connect event listeners to monitor the crawler,
such as websphinx.EventLog, websphinx.workbench.WebGraph,
or websphinx.workbench.Statistics.
- Call run() to start the crawler.
A running crawler consists of a priority queue of
Links waiting to be visited and a set of threads
retrieving pages in parallel. When a page is downloaded,
it is processed as follows:
- classify(): The page is passed to the classify() method of
every registered classifier, in increasing order of
their priority values. Classifiers typically attach
informative labels to the page and its links, such as "homepage"
or "root page".
- visit(): The page is passed to the crawler's
visit() method for user-defined processing.
- expand(): The page is passed to the crawler's
expand() method to be expanded. The default implementation
tests every unvisited hyperlink on the page with shouldVisit(),
and puts
each link approved by shouldVisit() into the crawling queue.
By default, when expanding the links of a page, the crawler
only considers hyperlinks (not applets or inline images, for instance) that
point to Web pages (not mailto: links, for instance). If you want
shouldVisit() to test every link on the page, use setLinkType(Crawler.ALL_LINKS).
|
Field Summary | |
final public static String[] | ALL_LINKS | final public static String[] | HYPERLINKS Specify HYPERLINKS as the link type to allow the crawler
to visit only hyperlinks (A, AREA, and FRAME tags which
point to http:, ftp:, file:, or gopher: URLs). | final public static String[] | HYPERLINKS_AND_IMAGES Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler
to visit only hyperlinks and inline images. | final public static String[] | SERVER Specify SERVER as the crawl domain to limit the crawler
to visit only pages on the same Web server (hostname
and port number) as the root link from which it started. | final public static String[] | SUBTREE Specify SUBTREE as the crawl domain to limit the crawler
to visit only pages which are descendants of the root link
from which it started. | final public static String[] | WEB Specify WEB as the crawl domain to allow the crawler
to visit any page on the World Wide Web. |
Constructor Summary | |
public | Crawler() Make a new Crawler. |
ALL_LINKS | final public static String[] ALL_LINKS(Code) | | Specify ALL_LINKS as the link type to allow the crawler
to visit any kind of link
|
HYPERLINKS | final public static String[] HYPERLINKS(Code) | | Specify HYPERLINKS as the link type to allow the crawler
to visit only hyperlinks (A, AREA, and FRAME tags which
point to http:, ftp:, file:, or gopher: URLs).
|
HYPERLINKS_AND_IMAGES | final public static String[] HYPERLINKS_AND_IMAGES(Code) | | Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler
to visit only hyperlinks and inline images.
|
SERVER | final public static String[] SERVER(Code) | | Specify SERVER as the crawl domain to limit the crawler
to visit only pages on the same Web server (hostname
and port number) as the root link from which it started.
|
SUBTREE | final public static String[] SUBTREE(Code) | | Specify SUBTREE as the crawl domain to limit the crawler
to visit only pages which are descendants of the root link
from which it started.
|
WEB | final public static String[] WEB(Code) | | Specify WEB as the crawl domain to allow the crawler
to visit any page on the World Wide Web.
|
Crawler | public Crawler()(Code) | | Make a new Crawler.
|
addClassifier | public void addClassifier(Classifier c)(Code) | | Adds a classifier to this crawler. If the
classifier is already found in the set, does nothing.
Parameters: c - a classifier |
addCrawlListener | public void addCrawlListener(CrawlListener listen)(Code) | | Adds a listener to the set of CrawlListeners for this crawler.
If the listener is already found in the set, does nothing.
Parameters: listen - a listener |
addLinkListener | public void addLinkListener(LinkListener listen)(Code) | | Adds a listener to the set of LinkListeners for this crawler.
If the listener is already found in the set, does nothing.
Parameters: listen - a listener |
addRoot | public void addRoot(Link link)(Code) | | Add a root to the existing set of roots.
Parameters: link - starting point to add |
clear | public void clear()(Code) | | Initialize the crawler for a fresh crawl. Clears the crawling queue
and sets all crawling statistics to 0. Stops the crawler
if it is currently running.
|
clearVisited | protected void clearVisited()(Code) | | Clear the set of visited links.
|
enumerateClassifiers | public Enumeration enumerateClassifiers()(Code) | | Enumerates the set of classifiers.
An enumeration of the classifiers. |
enumerateQueue | public Enumeration enumerateQueue()(Code) | | Enumerate crawling queue.
an enumeration of Link objects which are waiting to be visited. |
expand | public void expand(Page page)(Code) | | Expand the crawl from a page. The default implementation of this
method tests every link on the page using shouldVisit (), and
submit()s the links that are approved. A subclass may want to override
this method if it's inconvenient to consider the links individually
with shouldVisit().
Parameters: page - Page to expand |
fetch | void fetch(Worm w)(Code) | | |
fetchTimedOut | void fetchTimedOut(Worm w, int interval)(Code) | | |
getAction | public Action getAction()(Code) | | Get action.
current action |
getActiveThreads | public int getActiveThreads()(Code) | | Get number of threads currently working.
number of threads downloading pages |
getClassifiers | public Classifier[] getClassifiers()(Code) | | Get the set of classifiers
An array containing the registered classifiers. |
getCrawledRoots | public Link[] getCrawledRoots()(Code) | | Get roots of last crawl. May differ from getRoots()
if new roots have been set.
array of Links from which crawler started its last crawl,or null if the crawler was cleared. |
getDepthFirst | public boolean getDepthFirst()(Code) | | Get depth-first search flag. Default value is true.
true if search is depth-first, false if search is breadth-first. |
getDomain | public String[] getDomain()(Code) | | Get crawl domain. Default value is WEB.
WEB, SERVER, or SUBTREE. |
getDownloadParameters | public DownloadParameters getDownloadParameters()(Code) | | Get download parameters (such as number of threads, timeouts, maximum
page size, etc.)
|
getIgnoreVisitedLinks | public boolean getIgnoreVisitedLinks()(Code) | | Get ignore-visited-links flag. Default value is true.
true if search skips links whose URLs have already been visited(or queued for visiting). |
getLinkPredicate | public LinkPredicate getLinkPredicate()(Code) | | Get link predicate.
current link predicate |
getLinkType | public String[] getLinkType()(Code) | | Get legal link types to crawl. Default value is HYPERLINKS.
HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS. |
getLinksTested | public int getLinksTested()(Code) | | Get number of links tested.
number of links passed to shouldVisit() so far in this crawl |
getMaxDepth | public int getMaxDepth()(Code) | | Get maximum depth. Default value is 5.
maximum depth of crawl, in hops from starting point. |
getName | public String getName()(Code) | | Get human-readable name of crawler. Default value is the
class name, e.g., "Crawler". Useful for identifying the crawler in a
user interface; also used as the default User-agent for identifying
the crawler to a remote Web server. (The User-agent can be
changed independently of the crawler name with setDownloadParameters().)
human-readable name of crawler |
getPagePredicate | public PagePredicate getPagePredicate()(Code) | | Get page predicate.
current page predicate |
getPagesLeft | public int getPagesLeft()(Code) | | Get number of pages left to be visited.
number of links approved by shouldVisit() but not yet visited |
getPagesVisited | public int getPagesVisited()(Code) | | Get number of pages visited.
number of pages passed to visit() so far in this crawl |
getRootHrefs | public String getRootHrefs()(Code) | | Get starting points of crawl as a String of newline-delimited URLs.
URLs where crawler will start, separated by newlines. |
getRoots | public Link[] getRoots()(Code) | | Get starting points of crawl as an array of Link objects.
array of Links from which crawler will start its next crawl. |
getState | public int getState()(Code) | | Get state of crawler.
one of CrawlEvent.STARTED, CrawlEvent.PAUSED, STOPPED, CLEARED. |
getSynchronous | public boolean getSynchronous()(Code) | | Get synchronous flag. Default value is false.
true if crawler must visit the pages in priority order; false if crawler can visit pages in any order. |
markVisited | protected void markVisited(Link link)(Code) | | Register that a link has been visited.
Parameters: link - Link that has been visited |
pause | public void pause()(Code) | | Pause the crawl in progress. If the crawler is running, then
it finishes processing the current page, then returns. The queues remain as-is,
so calling run() again will resume the crawl exactly where it left off.
pause() can be called from any thread.
|
removeAllClassifiers | public void removeAllClassifiers()(Code) | | Clears the set of classifiers.
|
removeClassifier | public void removeClassifier(Classifier c)(Code) | | Removes a classifier from the set of classifiers.
If c is not found in the set, does nothing.
Parameters: c - a classifier |
removeCrawlListener | public void removeCrawlListener(CrawlListener listen)(Code) | | Removes a listener from the set of CrawlListeners. If it is not found in the set,
does nothing.
Parameters: listen - a listener |
removeLinkListener | public void removeLinkListener(LinkListener listen)(Code) | | Removes a listener from the set of LinkListeners. If it is not found in the set,
does nothing.
Parameters: listen - a listener |
run | public void run()(Code) | | Start crawling. Returns either when the crawl is done, or
when pause() or stop() is called. Because this method implements the
java.lang.Runnable interface, a crawler can be run in the
background thread.
|
sendCrawlEvent | protected void sendCrawlEvent(int id)(Code) | | Send a CrawlEvent to all CrawlListeners registered with this crawler.
Parameters: id - Event id |
sendLinkEvent | protected void sendLinkEvent(Link l, int id)(Code) | | Send a LinkEvent to all LinkListeners registered with this crawler.
Parameters: l - Link related to event Parameters: id - Event id |
sendLinkEvent | protected void sendLinkEvent(Link l, int id, Throwable exception)(Code) | | Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
Parameters: l - Link related to event Parameters: id - Event id Parameters: exception - Exception associated with event |
setAction | public void setAction(Action act)(Code) | | Set the action. This is an alternative way to specify
an action performed on every page. If act is non-null,
then every page passed to visit() is also passed to this
action.
Parameters: act - Action |
setDepthFirst | public void setDepthFirst(boolean useDFS)(Code) | | Set depth-first search flag. If neither depth-first nor breadth-first
is desired, then override shouldVisit() to set a custom priority on
each link.
Parameters: useDFS - true if search should be depth-first, false if search should be breadth-first. |
setDomain | public void setDomain(String[] domain)(Code) | | Set crawl domain.
Parameters: domain - one of WEB, SERVER, or SUBTREE. |
setDownloadParameters | public void setDownloadParameters(DownloadParameters dp)(Code) | | Set download parameters (such as number of threads, timeouts, maximum
page size, etc.)
Parameters: dp - Download parameters |
setIgnoreVisitedLinks | public void setIgnoreVisitedLinks(boolean f)(Code) | | Set ignore-visited-links flag.
Parameters: f - true if search skips links whose URLs have already been visited(or queued for visiting). |
setLinkPredicate | public void setLinkPredicate(LinkPredicate pred)(Code) | | Set link predicate. This is an alternative way to
specify the links to walk. If the link predicate is
non-null, then only links that satisfy
the link predicate AND shouldVisit() are crawled.
Parameters: pred - Link predicate |
setLinkType | public void setLinkType(String[] type)(Code) | | Set legal link types to crawl.
Parameters: domain - one of HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS. |
setMaxDepth | public void setMaxDepth(int maxDepth)(Code) | | Set maximum depth.
Parameters: maxDepth - maximum depth of crawl, in hops from starting point |
setName | public void setName(String name)(Code) | | Set human-readable name of crawler.
Parameters: name - new name for crawler |
setPagePredicate | public void setPagePredicate(PagePredicate pred)(Code) | | Set page predicate. This is a way to filter the pages
passed to visit(). If the page predicate is
non-null, then only pages that satisfy it are passed to visit().
Parameters: pred - Page predicate |
setRoot | public void setRoot(Link link)(Code) | | Set starting point of crawl as a single Link.
Parameters: link - starting point |
setRootHrefs | public void setRootHrefs(String hrefs) throws MalformedURLException(Code) | | Set starting points of crawl as a string of whitespace-delimited URLs.
Parameters: hrefs - URLs of starting point, separated by space, \t, or \n exception: java.net.MalformedURLException - if any of the URLs is invalid,leaving starting points unchanged |
setRoots | public void setRoots(Link[] links)(Code) | | Set starting points of crawl as an array of Links.
Parameters: links - starting points |
setSynchronous | public void setSynchronous(boolean f)(Code) | | Set ssynchronous flag.
Parameters: f - true if crawler must visit the pages in priority order; false if crawler can visit pages in any order. |
shouldVisit | public boolean shouldVisit(Link l)(Code) | | Callback for testing whether a link should be traversed.
Default version returns true for all links. Override this method
for more interesting behavior.
Parameters: l - Link encountered by the crawler true if link should be followed, false if it should be ignored. |
stop | public void stop()(Code) | | Stop the crawl in progress. If the crawler is running, then
it finishes processing the current page, then returns.
Empties the crawling queue.
|
submit | public void submit(Link link)(Code) | | Puts a link into the crawling queue. If the crawler is running, the
link will eventually be retrieved and passed to visit().
Parameters: link - Link to put in queue |
submit | public void submit(Link[] links)(Code) | | Submit an array of Links for crawling. If the crawler is running,
these links will eventually be retrieved and passed to visit().
Parameters: links - Links to put in queue |
timedOut | void timedOut()(Code) | | |
toString | public String toString()(Code) | | Convert the crawler to a String.
Human-readable name of crawler. |
visit | public void visit(Page page)(Code) | | Callback for visiting a page. Default version does nothing.
Parameters: page - Page retrieved by the crawler |
visited | public boolean visited(Link link)(Code) | | Test whether the page corresponding to a link has been visited
(or queued for visiting).
Parameters: link - Link to test true if link has been passed to walk() during this crawl |
|
|