org.archive.crawler.scope |
|
Java Source File Name | Type | Comment |
BroadScope.java | Class | A CrawlScope instance defines which URIs are "in"
a particular crawl.
It is essentially a Filter which determines, looking at
the totality of information available about a
CandidateURI/CrawlURI instamce, if that URI should be
scheduled for crawling.
Dynamic information inherent in the discovery of the
URI -- such as the path by which it was discovered --
may be considered.
Dynamic information which requires the consultation
of external and potentially volatile information --
such as current robots.txt requests and the history
of attempts to crawl the same URI -- should NOT be
considered. |
ClassicScope.java | Class | ClassicScope: superclass with shared Scope behavior for
most common scopes. |
DomainScope.java | Class | A core CrawlScope suitable for the most common
crawl needs.
Roughly, its logic is that a URI is included if:
(( isSeed(uri) || focusFilter.accepts(uri) )
|| transitiveFilter.accepts(uri) )
&& ! excludeFilter.accepts(uri)
The focusFilter may be specified by either:
- adding a 'mode' attribute to the
scope element. |
DomainScopeTest.java | Class | Test the domain scope focus filter. |
HostScope.java | Class | A core CrawlScope suitable for the most common
crawl needs.
Roughly, its logic is that a URI is included if:
(( isSeed(uri) || focusFilter.accepts(uri) )
|| transitiveFilter.accepts(uri) )
&& ! excludeFilter.accepts(uri)
The focusFilter may be specified by either:
- adding a 'mode' attribute to the
scope element. |
PathScope.java | Class | A core CrawlScope suitable for the most common
crawl needs.
Roughly, its logic is that a URI is included if:
(( isSeed(uri) || focusFilter.accepts(uri) )
|| transitiveFilter.accepts(uri) )
&& ! excludeFilter.accepts(uri)
The focusFilter may be specified by either:
- adding a 'mode' attribute to the
scope element. |
RefinedScope.java | Class | Superclass for Scopes which make use of "additional focus"
to add items by pattern, or want to swap in alternative
transitive filter. |
SeedCachingScope.java | Class | A CrawlScope that caches its seed list for the
convenience of scope-tests that are based on the
seeds. |
SeedCachingScopeTest.java | Class | Test
SeedCachingScope . |
SeedFileIterator.java | Class | Iterator wrapper for seeds file on disk. |
SeedFileIteratorTest.java | Class | Test
SeedFileIterator . |
SeedListener.java | Interface | Implemented by components which want notifications of
seed list changes from a Scope. |
SurtPrefixScope.java | Class | A specialized CrawlScope suitable for the most common crawl needs. |