| |
|
| org.archive.crawler.settings.ModuleType org.archive.crawler.datamodel.RobotsHonoringPolicy
RobotsHonoringPolicy | public class RobotsHonoringPolicy extends ModuleType (Code) | | RobotsHonoringPolicy represent the strategy used by the crawler
for determining how robots.txt files will be honored.
Five kinds of policies exist:
- classic:
- obey the first set of robots.txt directives that apply to your
current user-agent
- ignore:
- ignore robots.txt directives entirely
- custom:
- obey a specific operator-entered set of robots.txt directives
for a given host
- most-favored:
- obey the most liberal restrictions offered (if *any* crawler is
allowed to get a page, get it)
- most-favored-set:
- given some set of user-agent patterns, obey the most liberal
restriction offered to any
The two last ones has the opportunity of adopting a different user-agent
to reflect the restrictions we've opted to use.
author: John Erik Halse |
Method Summary | |
public String | getCustomRobots(CrawlerSettings settings) | public int | getType(Object context) Get the policy-type. | public StringList | getUserAgents(CrawlerSettings settings) If policy-type is most favored crawler of set, then this method
gets a list of all useragents in that set. | public boolean | isType(Object o, int type) Check if policy is of a certain type.
Parameters: o - An object that can be resolved into a settings object. Parameters: type - the type to check against. | public boolean | shouldMasquerade(CrawlURI curi) This method returns true if the crawler should masquerade as the user agent
which restrictions it opted to use. |
ATTR_CUSTOM_ROBOTS | final public static String ATTR_CUSTOM_ROBOTS(Code) | | |
ATTR_MASQUERADE | final public static String ATTR_MASQUERADE(Code) | | |
ATTR_USER_AGENTS | final public static String ATTR_USER_AGENTS(Code) | | |
CLASSIC | final public static int CLASSIC(Code) | | |
CUSTOM | final public static int CUSTOM(Code) | | |
IGNORE | final public static int IGNORE(Code) | | |
MOST_FAVORED | final public static int MOST_FAVORED(Code) | | |
MOST_FAVORED_SET | final public static int MOST_FAVORED_SET(Code) | | |
RobotsHonoringPolicy | public RobotsHonoringPolicy(String name)(Code) | | Creates a new instance of RobotsHonoringPolicy.
Parameters: name - the name of the RobotsHonoringPolicy attirubte. |
RobotsHonoringPolicy | public RobotsHonoringPolicy()(Code) | | |
getCustomRobots | public String getCustomRobots(CrawlerSettings settings)(Code) | | Get the supplied custom robots.txt
String with content of alternate robots.txt |
getUserAgents | public StringList getUserAgents(CrawlerSettings settings)(Code) | | If policy-type is most favored crawler of set, then this method
gets a list of all useragents in that set.
List of Strings with user agents |
isType | public boolean isType(Object o, int type)(Code) | | Check if policy is of a certain type.
Parameters: o - An object that can be resolved into a settings object. Parameters: type - the type to check against. true if the policy is of the submitted type |
shouldMasquerade | public boolean shouldMasquerade(CrawlURI curi)(Code) | | This method returns true if the crawler should masquerade as the user agent
which restrictions it opted to use.
(Only relevant for policy-types: most-favored and most-favored-set).
true if we should masquerade |
|
|
|