org.archive.crawler.deciderules |
Provides classes for a simple decision rules framework.
Each 'step' in a decision rule set which can
affect an objects ultimate fate is called a DecideRule .
Each DecideRule renders a decision (possibly neutral) on the
passed objects fate.
Possible decisions are:
- ACCEPT means the object is ruled-in for further processing
- REJECT means the object is ruled-out for further processing
- PASS means this particular DecideRule has no opinion
As previously outlined, each DecideRule is applied in turn;
the last one to express a non-PASS preference wins.
For example, if the rules are:
- AcceptDecideRule -- ACCEPTs all (establishing a default)
- TooManyHopsDecideRule(max-hops=3) -- REJECTS all with
hopsPath.length()>3, PASSes otherwise
- PrerequisiteAcceptDecideRule -- ACCEPTs any with 'P' as
last hop, PASSes otherwise (this allows 'LLL's which
need a 'LLLP' prerequisite a chance to complete)
Then, you have a crawl that will go 3 hops (of any type)
from the seeds, with a special affordance to get prerequisites
of 3-hop items (which may be 4 "hops" out)
To allow this style of decision processing to be plugged into the
existing Filter and Scope slots:
- There's a DecidingFilter which takes an (ordered) map of
DecideRules
- There's a DecidingScope which takes the same
See NewScopingModel
for background.
|
Java Source File Name | Type | Comment |
AcceptDecideRule.java | Class | Rule which responds ACCEPT to anything passed in. |
AddRedirectFromRootServerToScope.java | Class | |
BeanShellDecideRule.java | Class | Rule which runs a groovy script to make its decision. |
ClassKeyMatchesRegExpDecideRule.java | Class | Rule applies configured decision to any CrawlURI class key -- i.e. |
ConfiguredDecideRule.java | Class | Rule which can be configured to ACCEPT or REJECT at
operator's option. |
ConfiguredDecideRuleTest.java | Class | |
ContentTypeMatchesRegExpDecideRule.java | Class | DecideRule whose decision is applied if the URI's content-type
is present and matches the supplied regular expression. |
ContentTypeNotMatchesRegExpDecideRule.java | Class | DecideRule whose decision is applied if the URI's content-type
is present and does not match the supplied regular expression. |
DecideRule.java | Class | Interface for rules which, given an object to evaluate,
respond with a decision:
DecideRule.ACCEPT ,
DecideRule.REJECT , or
DecideRule.PASS . |
DecideRuleSequence.java | Class | RuleSequence represents a series of Rules, which are applied in turn
to give the final result. |
DecideRuleSequenceTest.java | Class | |
DecidingFilter.java | Class | DecidingFilter: a classic Filter which makes its accept/reject
decision based on whatever
DecideRule s have been set up inside
it. |
DecidingScope.java | Class | DecidingScope: a Scope which makes its accept/reject decision based on
whatever DecideRules have been set up inside it. |
ExceedsDocumentLengthTresholdDecideRule.java | Class | |
ExternalGeoLocationDecideRule.java | Class | A rule that can be configured to take alternate implementations
of the ExternalGeoLocationInterface. |
ExternalGeoLookupInterface.java | Interface | Interface used by
ExternalImplDecideRule . |
ExternalImplDecideRule.java | Class | A rule that can be configured to take alternate implementations
of the ExternalImplInterface. |
ExternalImplInterface.java | Interface | Interface used by
ExternalImplDecideRule . |
FetchStatusDecideRule.java | Class | Rule applies the configured decision for any URI which has a
fetch status equal to the 'target-status' setting. |
FetchStatusMatchesRegExpDecideRule.java | Class | |
FetchStatusNotMatchesRegExpDecideRule.java | Class | |
FilterDecideRule.java | Class | FilterDecideRule wraps a legacy Filter for use in DecideRule
contexts. |
HasViaDecideRule.java | Class | Rule applies the configured decision for any URI which has a 'via'
(essentially, any URI that was a seed or some kinds of mid-crawl adds). |
HopsPathMatchesRegExpDecideRule.java | Class | Rule applies configured decision to any CrawlURIs whose 'hops-path'
(string like "LLXE" etc.) matches the supplied regexp. |
MatchesFilePatternDecideRule.java | Class | Compares suffix of a passed CrawlURI, UURI, or String against a regular
expression pattern, applying its configured decision to all matches.
Several predefined patterns are available for convenience. |
MatchesListRegExpDecideRule.java | Class | Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexps. |
MatchesRegExpDecideRule.java | Class | Rule applies configured decision to any CrawlURIs whose String URI
matches the supplied regexp. |
NotExceedsDocumentLengthTresholdDecideRule.java | Class | |
NotMatchesFilePatternDecideRule.java | Class | Rule applies configured decision to any URIs which do *not*
match the supplied (file-pattern) regexp. |
NotMatchesListRegExpDecideRule.java | Class | Rule applies configured decision to any URIs which do *not*
match the supplied regexp. |
NotMatchesRegExpDecideRule.java | Class | Rule applies configured decision to any URIs which do *not*
match the supplied regexp. |
NotOnDomainsDecideRule.java | Class | Rule applies configured decision to any URIs that are
not* in one of the domains in the configured set of
domains, filled from the seed set. |
NotOnHostsDecideRule.java | Class | Rule applies configured decision to any URIs that
are *not* on one of the hosts in the configured set of
hosts, filled from the seed set. |
NotSurtPrefixedDecideRule.java | Class | Rule applies configured decision to any URIs that, when
expressed in SURT form, do *not* begin with one of the prefixes
in the configured set. |
OnDomainsDecideRule.java | Class | Rule applies configured decision to any URIs that
are on one of the domains in the configured set of
domains, filled from the seed set. |
OnHostsDecideRule.java | Class | Rule applies configured decision to any URIs that
are on one of the hosts in the configured set of
hosts, filled from the seed set. |
PathologicalPathDecideRule.java | Class | |
PredicatedDecideRule.java | Class | Rule which applies the configured decision only if a
test evaluates to true. |
PrerequisiteAcceptDecideRule.java | Class | Rule which ACCEPTs all 'prerequisite' URIs (those with a 'P' in
the last hopsPath position). |
RejectDecideRule.java | Class | Rule which answers REJECT to everything evaluated. |
ScopePlusOneDecideRule.java | Class | Rule allows one level of discovery beyond configured scope
(e.g. |
SeedAcceptDecideRule.java | Class | Rule which ACCEPTs all 'seed' URIs (those for which
isSeed is true). |
SurtPrefixedDecideRule.java | Class | Rule applies configured decision to any URIs that, when
expressed in SURT form, begin with one of the prefixes
in the configured set. |
TooManyHopsDecideRule.java | Class | Rule REJECTs any CrawlURIs whose total number of hops (length of the
hopsPath string, traversed links of any type) is over a threshold. |
TooManyPathSegmentsDecideRule.java | Class | Rule REJECTs any CrawlURIs whose total number of path-segments (as
indicated by the count of '/' characters not including the first '//')
is over a given threshold. |
TransclusionDecideRule.java | Class | Rule ACCEPTs any CrawlURIs whose path-from-seed ('hopsPath' -- see
CandidateURI.getPathFromSeed ) ends
with at least one, but not more than, the given number of
non-navlink ('L') hops. |