| org.archive.crawler.framework.Frontier
All known Subclasses: org.archive.crawler.frontier.AdaptiveRevisitFrontier, org.archive.crawler.frontier.WorkQueue, org.archive.crawler.frontier.AbstractFrontier,
Field Summary | |
final public static String | ATTR_NAME All URI Frontiers should have the same 'name' attribute. |
Method Summary | |
public long | averageDepth() | public float | congestionRatio() | public void | considerIncluded(UURI u) Notify Frontier that it should consider the given UURI as if
already scheduled. | public long | deepestUri() | public long | deleteURIs(String match) Delete any URI that matches the given regular expression from the list
of discovered and pending URIs. | public void | deleted(CrawlURI curi) Notify Frontier that a CrawlURI has been deleted outside of the
normal next()/finished() lifecycle. | public long | discoveredUriCount() Number of discovered URIs.
That is any URI that has been confirmed be within 'scope'
(i.e. | public long | disregardedUriCount() Number of URIs that were scheduled at one point but have been
disregarded.
Counts any URI that is scheduled only to be disregarded
because it is determined to lie outside the scope of the crawl. | public long | failedFetchCount() Number of URIs that failed to process.
URIs that could not be processed because of some error or failure in
the processing chain. | public void | finished(CrawlURI cURI) Report a URI being processed as having finished processing. | public long | finishedUriCount() Number of URIs that have finished processing.
Includes both those that were processed successfully and failed to be
processed (excluding those that failed but will be retried). | public String | getClassKey(CandidateURI cauri) Parameters: cauri - CandidateURI for which we're to calculate andset class key. | public FrontierJournal | getFrontierJournal() Return the instance of FrontierJournal thatthis Frontier is using. | public FrontierGroup | getGroup(CrawlURI curi) Get the 'frontier group' (usually queue) for the given
CrawlURI. | public FrontierMarker | getInitialMarker(String regexpr, boolean inCacheOnly) Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier.
Parameters: regexpr - The regular expression that URIs within the frontier mustmatch to be considered within the scope of this marker Parameters: inCacheOnly - If set to true, only those URIs within the frontierthat are stored in cache (usually this means in memoryrather then on disk, but that is an implementationdetail) will be considered. | public ArrayList | getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) Returns a list of all uncrawled URIs starting from a specified marker
until numberOfMatches is reached.
Any encountered URI that has not been successfully crawled, terminally
failed, disregarded or is currently being processed is included. | public void | importRecoverLog(String pathToLog, boolean retainFailures) Recover earlier state by reading a recovery log.
Some Frontiers are able to write detailed logs that can be loaded
after a system crash to recover the state of the Frontier prior to the
crash. | public void | initialize(CrawlController c) Initialize the Frontier.
This method is invoked by the CrawlController once it has
created the Frontier. | boolean | isEmpty() Returns true if the frontier contains no more URIs to crawl.
That is to say that there are no more URIs either currently availible
(ready to be emitted), URIs belonging to deferred hosts or pending URIs
in the Frontier. | public void | kickUpdate() Notify Frontier that it should consider updating configuration
info that may have changed in external files. | public void | loadSeeds() Request that the Frontier load (or reload) crawl seeds,
typically by contacting the Scope. | CrawlURI | next() Get the next URI that should be processed. | public void | pause() Notify Frontier that it should not release any URIs, instead
holding all threads, until instructed otherwise. | public long | queuedUriCount() Number of URIs queued up and waiting for processing.
This includes any URIs that failed but will be retried. | public void | schedule(CandidateURI caURI) Schedules a CandidateURI.
This method accepts one URI and schedules it immediately. | public void | start() Request that Frontier allow crawling to begin. | public long | succeededFetchCount() Number of successfully processed URIs.
Any URI that was processed successfully. | public void | terminate() Notify Frontier that it should end the crawl, giving
any worker ToeThread that askss for a next() an
EndedException. | public long | totalBytesWritten() Total number of bytes contained in all URIs that have been processed. | public void | unpause() Resumes the release of URIs to crawl, allowing worker
ToeThreads to proceed. |
ATTR_NAME | final public static String ATTR_NAME(Code) | | All URI Frontiers should have the same 'name' attribute. This constant
defines that name. This is a name used to reference the Frontier being
used in a given crawl order and since there can only be one Frontier
per crawl order a fixed, unique name for Frontiers is optimal.
See Also: org.archive.crawler.settings.ModuleType.ModuleType(String) |
averageDepth | public long averageDepth()(Code) | | |
congestionRatio | public float congestionRatio()(Code) | | |
considerIncluded | public void considerIncluded(UURI u)(Code) | | Notify Frontier that it should consider the given UURI as if
already scheduled.
Parameters: u - UURI instance to add to the Already Included set. |
deepestUri | public long deepestUri()(Code) | | |
deleteURIs | public long deleteURIs(String match)(Code) | | Delete any URI that matches the given regular expression from the list
of discovered and pending URIs. This does not prevent them from being
rediscovered.
Any encountered URI that has not been successfully crawled, terminally
failed, disregarded or is currently being processed is considered to be
a pending URI.
Warning: It is unsafe to make changes to the frontier while
this method is executing. The crawler should be in a paused state before
invoking it.
Parameters: match - A regular expression, any URIs that matches it will bedeleted. The number of URIs deleted |
deleted | public void deleted(CrawlURI curi)(Code) | | Notify Frontier that a CrawlURI has been deleted outside of the
normal next()/finished() lifecycle.
Parameters: curi - Deleted CrawlURI. |
discoveredUriCount | public long discoveredUriCount()(Code) | | Number of discovered URIs.
That is any URI that has been confirmed be within 'scope'
(i.e. the Frontier decides that it should be processed). This
includes those that have been processed, are being processed
and have finished processing. Does not include URIs that have
been 'forgotten' (deemed out of scope when trying to fetch,
most likely due to operator changing scope definition).
Note: This only counts discovered URIs. Since the same
URI can (at least in most frontiers) be fetched multiple times, this
number may be somewhat lower then the combined queued,
in process and finished items combined due to duplicate
URIs being queued and processed. This variance is likely to be especially
high in Frontiers implementing 'revist' strategies.
Number of discovered URIs. |
disregardedUriCount | public long disregardedUriCount()(Code) | | Number of URIs that were scheduled at one point but have been
disregarded.
Counts any URI that is scheduled only to be disregarded
because it is determined to lie outside the scope of the crawl. Most
commonly this will be due to robots.txt exclusions.
The number of URIs that have been disregarded. |
failedFetchCount | public long failedFetchCount()(Code) | | Number of URIs that failed to process.
URIs that could not be processed because of some error or failure in
the processing chain. Can include failure to acquire prerequisites, to
establish a connection with the host and any number of other problems.
Does not count those that will be retried, only those that have
permenantly failed.
Number of URIs that failed to process. |
finished | public void finished(CrawlURI cURI)(Code) | | Report a URI being processed as having finished processing.
ToeThreads will invoke this method once they have completed work on
their assigned URI.
This method is synchronized.
Parameters: cURI - The URI that has finished processing. |
finishedUriCount | public long finishedUriCount()(Code) | | Number of URIs that have finished processing.
Includes both those that were processed successfully and failed to be
processed (excluding those that failed but will be retried). Does not
include those URIs that have been 'forgotten' (deemed out of scope when
trying to fetch, most likely due to operator changing scope definition).
Number of finished URIs. |
getClassKey | public String getClassKey(CandidateURI cauri)(Code) | | Parameters: cauri - CandidateURI for which we're to calculate andset class key. Classkey for cauri . |
getGroup | public FrontierGroup getGroup(CrawlURI curi)(Code) | | Get the 'frontier group' (usually queue) for the given
CrawlURI.
Parameters: curi - CrawlURI to find matching group FrontierGroup for the CrawlURI |
getInitialMarker | public FrontierMarker getInitialMarker(String regexpr, boolean inCacheOnly)(Code) | | Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier.
Parameters: regexpr - The regular expression that URIs within the frontier mustmatch to be considered within the scope of this marker Parameters: inCacheOnly - If set to true, only those URIs within the frontierthat are stored in cache (usually this means in memoryrather then on disk, but that is an implementationdetail) will be considered. Others will be entierlyignored, as if they dont exist. This is usefull for quickpeeks at the top of the URI list. A URIFrontierMarker that is set for the 'start' of the frontier'sURI list. |
getURIsList | public ArrayList getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException(Code) | | Returns a list of all uncrawled URIs starting from a specified marker
until numberOfMatches is reached.
Any encountered URI that has not been successfully crawled, terminally
failed, disregarded or is currently being processed is included. As
there may be duplicates in the frontier, there may also be duplicates
in the report. Thus this includes both discovered and pending URIs.
The list is a set of strings containing the URI strings. If verbose is
true the string will include some additional information (path to URI
and parent).
The URIFrontierMarker will be advanced to the position at
which it's maximum number of matches found is reached. Reusing it for
subsequent calls will thus effectively get the 'next' batch. Making
any changes to the frontier can invalidate the marker.
While the order returned is consistent, it does not have any
explicit relation to the likely order in which they may be processed.
Warning: It is unsafe to make changes to the frontier while
this method is executing. The crawler should be in a paused state before
invoking it.
Parameters: marker - A marker specifing from what position in the Frontier thelist should begin. Parameters: numberOfMatches - how many URIs to add at most to the list before returning it Parameters: verbose - if set to true the strings returned will contain additionalinformation about each URI beyond their names. a list of all pending URIs falling within the specificationof the marker throws: InvalidFrontierMarkerException - when theURIFronterMarker does not match the internalstate of the frontier. Tolerance for this can varyconsiderably from one URIFrontier implementation to the next. See Also: FrontierMarker See Also: Frontier.getInitialMarker(String,boolean) |
importRecoverLog | public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code) | | Recover earlier state by reading a recovery log.
Some Frontiers are able to write detailed logs that can be loaded
after a system crash to recover the state of the Frontier prior to the
crash. This method is the one used to achive this.
Parameters: pathToLog - The name (with full path) of the recover log. Parameters: retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaningthe corresponding URIs will be retried in the recovered crawl.) throws: IOException - If problems occur reading the recover log. |
initialize | public void initialize(CrawlController c) throws FatalConfigurationException, IOException(Code) | | Initialize the Frontier.
This method is invoked by the CrawlController once it has
created the Frontier. The constructor of the Frontier should
only contain code for setting up it's settings framework. This
method should contain all other 'startup' code.
Parameters: c - The CrawlController that created the Frontier. throws: FatalConfigurationException - If provided settings are illegal orotherwise unusable. throws: IOException - If there is a problem reading settings or seeds filefrom disk. |
isEmpty | boolean isEmpty()(Code) | | Returns true if the frontier contains no more URIs to crawl.
That is to say that there are no more URIs either currently availible
(ready to be emitted), URIs belonging to deferred hosts or pending URIs
in the Frontier. Thus this method may return false even if there is no
currently availible URI.
true if the frontier contains no more URIs to crawl. |
kickUpdate | public void kickUpdate()(Code) | | Notify Frontier that it should consider updating configuration
info that may have changed in external files.
|
loadSeeds | public void loadSeeds()(Code) | | Request that the Frontier load (or reload) crawl seeds,
typically by contacting the Scope.
|
pause | public void pause()(Code) | | Notify Frontier that it should not release any URIs, instead
holding all threads, until instructed otherwise.
|
queuedUriCount | public long queuedUriCount()(Code) | | Number of URIs queued up and waiting for processing.
This includes any URIs that failed but will be retried. Basically this
is any discovered URI that has not either been processed or is
being processed. The same discovered URI can be queued multiple times.
Number of queued URIs. |
schedule | public void schedule(CandidateURI caURI)(Code) | | Schedules a CandidateURI.
This method accepts one URI and schedules it immediately. This has
nothing to do with the priority of the URI being scheduled. Only that
it will be placed in it's respective queue at once. For priority
scheduling see
CandidateURI.setSchedulingDirective(int) This method should be synchronized in all implementing classes.
Parameters: caURI - The URI to schedule. See Also: CandidateURI.setSchedulingDirective(int) |
start | public void start()(Code) | | Request that Frontier allow crawling to begin. Usually
just unpauses Frontier, if paused.
|
succeededFetchCount | public long succeededFetchCount()(Code) | | Number of successfully processed URIs.
Any URI that was processed successfully. This includes URIs that
returned 404s and other error codes that do not originate within the
crawler.
Number of successfully processed URIs. |
terminate | public void terminate()(Code) | | Notify Frontier that it should end the crawl, giving
any worker ToeThread that askss for a next() an
EndedException.
|
totalBytesWritten | public long totalBytesWritten()(Code) | | Total number of bytes contained in all URIs that have been processed.
The total amounts of bytes in all processed URIs. |
unpause | public void unpause()(Code) | | Resumes the release of URIs to crawl, allowing worker
ToeThreads to proceed.
|
|
|