org.archive.crawler.filter |
|
Java Source File Name | Type | Comment |
ContentTypeRegExpFilter.java | Class | Compares the content-type of the passed CrawlURI to a regular expression. |
FilePatternFilter.java | Class | Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern accepting matches. |
FilePatternFilterTest.java | Class | Tests FilePatternFilter default pattern (all default file extension) and
separate subgroups patterns such as images, audio, video, and
miscellaneous groups. |
HopsFilter.java | Class | Accepts (returns for)) for all CandidateURIs passed in
with a link-hop-count greater than the max-link-hops
value. |
HTTPMidFetchUnchangedFilter.java | Class | A mid fetch filter for HTTP fetcher processors. |
OrFilter.java | Class | OrFilter allows any number of other filters to be set up
inside it, as child elements. |
PathDepthFilter.java | Class | Accepts all urls passed in with a path depth
less or equal than the max-path-depth
value. |
PathologicalPathFilter.java | Class | Checks if a URI contains a repeated pattern.
This filter is checking if a pattern is repeated a specific number of times.
The use is to avoid crawler traps where the server adds the same pattern to
the requested URI like: http://host/img/img/img/img.... . |
PathologicalPathFilterTest.java | Class | |
SurtPrefixFilter.java | Class | A filter which tests a URI against a set of SURT
prefixes, and if the URI's prefix is in the set,
returns the chosen true/false accepts value. |
TransclusionFilter.java | Class | Filter which accepts CandidateURI/CrawlURI instances which contain more
than zero but fewer than max-trans-hops entries at the end of their
discovery path. |
URIListRegExpFilter.java | Class | Compares passed object -- a CrawlURI, UURI, or String --
against regular expressions, accepting matches. |
URIRegExpFilter.java | Class | Compares passed object -- a CrawlURI, UURI, or String --
against a regular expression, accepting matches. |