| org.archive.crawler.framework.Filter org.archive.crawler.framework.CrawlScope
All known Subclasses: org.archive.crawler.scope.ClassicScope, org.archive.crawler.deciderules.DecidingScope,
CrawlScope | public class CrawlScope extends Filter (Code) | | A CrawlScope instance defines which URIs are "in"
a particular crawl.
It is essentially a Filter which determines, looking at
the totality of information available about a
CandidateURI/CrawlURI instamce, if that URI should be
scheduled for crawling.
Dynamic information inherent in the discovery of the
URI -- such as the path by which it was discovered --
may be considered.
Dynamic information which requires the consultation
of external and potentially volatile information --
such as current robots.txt requests and the history
of attempts to crawl the same URI -- should NOT be
considered. Those potentially high-latency decisions
should be made at another step.
author: gojomo |
ATTR_REREAD_SEEDS_ON_CONFIG | final public static String ATTR_REREAD_SEEDS_ON_CONFIG(Code) | | Whether every configu change should trigger a
rereading of the original seeds spec/file.
|
DEFAULT_REREAD_SEEDS_ON_CONFIG | final public static Boolean DEFAULT_REREAD_SEEDS_ON_CONFIG(Code) | | |
CrawlScope | public CrawlScope(String name)(Code) | | Constructs a new CrawlScope.
Parameters: name - the name is ignored since it always have to be the value ofthe constant ATT_NAME. |
CrawlScope | public CrawlScope()(Code) | | Default constructor.
|
addSeed | public boolean addSeed(CandidateURI curi)(Code) | | Add a new seed to scope. By default, simply appends
to seeds file, though subclasses may handle differently.
This method is *not* sufficient to get the new seed
scheduled in the Frontier for crawling -- it only
affects the Scope's seed record (and decisions which
flow from seeds).
Parameters: curi - CandidateUri to add true if successful, false if add failed for any reason |
checkClose | protected void checkClose(Iterator iter)(Code) | | Convenience method to close SeedFileIterator, if appropriate.
Parameters: iter - Iterator to check if SeedFileIterator needing closing |
getSeedfile | public File getSeedfile()(Code) | | Seed list file or null if problem getting settings file. |
isSameHost | protected boolean isSameHost(UURI a, UURI b)(Code) | | Parameters: a - First UURI of compare. Parameters: b - Second UURI of compare. True if UURIs are of same host. |
isSeed | protected boolean isSeed(Object o)(Code) | | Check if a URI is in the seeds.
Parameters: o - the URI to check. true if URI is a seed. |
kickUpdate | public void kickUpdate()(Code) | | Take note of a situation (such as settings edit) where
involved reconfiguration (such as reading from external
files) may be necessary.
|
refreshSeeds | public void refreshSeeds()(Code) | | Refresh seeds.
|
seedsIterator | public Iterator<UURI> seedsIterator()(Code) | | Gets an iterator over all configured seeds. Subclasses
which cache seeds in memory can override with more
efficient implementation.
Iterator, perhaps over a disk file, of seeds |
seedsIterator | public Iterator<UURI> seedsIterator(Writer ignoredItemWriter)(Code) | | Gets an iterator over all configured seeds. Subclasses
which cache seeds in memory can override with more
efficient implementation.
Parameters: ignoredItemWriter - optional writer to get ignored seed items report Iterator, perhaps over a disk file, of seeds |
Fields inherited from org.archive.crawler.framework.Filter | final public static String ATTR_ENABLED(Code)(Java Doc)
|
|
|