| |
|
| java.lang.Object java.lang.Thread bdd.search.spider.Crawler
Crawler | public class Crawler extends Thread (Code) | | Written by Tim Macinta 1997
Distributed under the GNU Public License
(a copy of which is enclosed with the source).
Calling the Crawler's start() method will cause the Crawler to
index all of the sites in its queue and then replace the main
index with the updated index when it completes. The Crawler's
queue should be filled with the starting URLs before calling
start().
|
Constructor Summary | |
public | Crawler(File working_dir, EnginePrefs eng_prefs) "working_dir" should be a directory that only this
Crawler and a given Indexer will be
accessing. |
Method Summary | |
public void | addURL(URL url_to_queue) Takes "url_to_queue" and adds it to this Crawler's queue of URLs.
This method should be used to add all of the desired starting URLs to
the queue before the Crawler is started. | public static void | main(String arg) This is the method that is called when this class is invoked from
the command line. | public static void | main(File file, EnginePrefs prefs) | public static void | main(File file, EnginePrefs prefs, boolean exit) | public void | run() This is where the actual crawling occurs. | URL | simplify(URL url) Takes "url" and removes all references to "/./" and "/../" . |
exit_when_done | boolean exit_when_done(Code) | | |
Crawler | public Crawler(File working_dir, EnginePrefs eng_prefs)(Code) | | "working_dir" should be a directory that only this
Crawler and a given Indexer will be
accessing. This means that if several Crawlers are running
simultaneously, they should all be given different "working_dir"
directories. Also, no other threads should write to this
directory (except for the selected Indexer).
|
addURL | public void addURL(URL url_to_queue)(Code) | | Takes "url_to_queue" and adds it to this Crawler's queue of URLs.
This method should be used to add all of the desired starting URLs to
the queue before the Crawler is started. If the URL has already
been processed or if it is an unallowed URL it is not added.
|
main | public static void main(String arg)(Code) | | This is the method that is called when this class is invoked from
the command line. calling this method will cause a Crawler to be
created and started with the starting URLs being listed in a file
specified by the first argument (arg[0]). The file listing the URLs
should contain only the URLs with each URL on a line by itself. Blank
lines are allowed and lines beginning with "#" are considered comments
and are ignored.
|
run | public void run()(Code) | | This is where the actual crawling occurs.
|
simplify | URL simplify(URL url)(Code) | | Takes "url" and removes all references to "/./" and "/../" . This
can be used to help eliminate looping. Also removes all anchors
(i.e., everything after and including a '#').
|
|
|
|