| java.lang.Object net.matuschek.spider.NoRobots
NoRobots | public class NoRobots (Code) | | Implements the Robot Exclusion Standard.
The basic idea of the Robot Exclusion Standard is that each web server
can set up a single file called "/robots.txt" which contains pathnames
that robots should not look at.
See the full spec
for details.
Using this class is very simple - you create the object using your robot's
name and the httptool to retrieve the date, and then you call check() on
each URL. For efficiency, the class caches entries for servers you've
visited recently.
author: cn version: 0.1 |
Field Summary | |
Category | log |
Method Summary | |
public void | finish() This method finishes the HttpTool. | public boolean | getIgnore() Method getIgnore. | protected static boolean | match(String pattern, String string) Method match. | public boolean | ok(URL url) Check whether it's ok for this robot to fetch this URL. | public void | setIgnore(boolean ignore) Method setIgnore. |
NoRobots | public NoRobots(String robotName, HttpTool inhttpTool)(Code) | | Constructor.
Parameters: robotName - the name of the robot Parameters: httpTool - the HttpTool instance for downloading the robotFile |
finish | public void finish()(Code) | | This method finishes the HttpTool.
|
getIgnore | public boolean getIgnore()(Code) | | Method getIgnore.
tells if the robot exclusion standard is ignored
boolean true if the check on robots.txt is not done |
match | protected static boolean match(String pattern, String string)(Code) | | Method match.
Checks whether a string matches a given wildcard pattern.
Only does ? and *, and multiple patterns separated by |.
Parameters: pattern - Parameters: string - boolean |
ok | public boolean ok(URL url)(Code) | | Check whether it's ok for this robot to fetch this URL. reads the
information in the robots.txt file on this host. If a robots.txt file is
there and this file disallows the robot to retrieve the requested url
then the method returns false
Parameters: url - the url we want to retrieve boolean true if allowed to retireve the url, false otherwise |
setIgnore | public void setIgnore(boolean ignore)(Code) | | Method setIgnore.
set the robot exclusion standard.
Parameters: ignore - if ignore is true then the robot exclusion standard is ignored |
|
|