| |
|
| org.archive.crawler.filter.URIRegExpFilter org.archive.crawler.filter.PathologicalPathFilter
PathologicalPathFilter | public class PathologicalPathFilter extends URIRegExpFilter (Code) | | Checks if a URI contains a repeated pattern.
This filter is checking if a pattern is repeated a specific number of times.
The use is to avoid crawler traps where the server adds the same pattern to
the requested URI like: http://host/img/img/img/img.... . This
filter returns TRUE if the path is pathological. FALSE otherwise.
author: John Erik HalseDecidingFilterDecideRule |
ATTR_REPETITIONS | final public static String ATTR_REPETITIONS(Code) | | |
DEFAULT_REPETITIONS | final public static Integer DEFAULT_REPETITIONS(Code) | | |
PathologicalPathFilter | public PathologicalPathFilter(String name)(Code) | | Constructs a new PathologicalPathFilter.
Parameters: name - the name of the filter. |
getFilterOffPosition | protected boolean getFilterOffPosition(CrawlURI curi)(Code) | | |
getRegexp | protected String getRegexp(Object o)(Code) | | Construct the regexp string to be matched aginst the URI.
Parameters: o - an object to extract a URI from. the regexp pattern. |
|
|
|