| org.archive.crawler.settings.ModuleType org.archive.crawler.frontier.AdaptiveRevisitFrontier
AdaptiveRevisitFrontier | public class AdaptiveRevisitFrontier extends ModuleType implements Frontier,FetchStatusCodes,CoreAttributeConstants,AdaptiveRevisitAttributeConstants,CrawlStatusListener,HasUriReceiver(Code) | | A Frontier that will repeatedly visit all encountered URIs.
Wait time between visits is configurable and varies based on observed
changes of documents.
The Frontier borrows many things from HostQueuesFrontier, but implements
an entirely different strategy in issuing URIs and consequently in keeping a
record of discovered URIs.
author: Kristinn Sigurdsson |
Method Summary | |
public long | averageDepth() | protected void | batchFlush() | protected void | batchSchedule(CandidateURI caUri) | protected long | calculateSnoozeTime(CrawlURI curi) Calculates how long a host queue needs to be snoozed following the
crawling of a URI. | protected String | canonicalize(UURI uuri) Canonicalize passed uuri. | protected String | canonicalize(CandidateURI cauri) Canonicalize passed CandidateURI. | public float | congestionRatio() | public void | considerIncluded(UURI u) | public void | crawlCheckpoint(File checkpointDir) | public void | crawlEnded(String sExitMessage) | public void | crawlEnding(String sExitMessage) | public void | crawlPaused(String statusMessage) | public void | crawlPausing(String statusMessage) | public void | crawlResuming(String statusMessage) | public void | crawlStarted(String message) | protected UriUniqFilter | createAlreadyIncluded() Create a UriUniqFilter that will serve as record
of already seen URIs. | public long | deepestUri() | public synchronized long | deleteURIs(String match) | public synchronized void | deleted(CrawlURI curi) | public synchronized long | discoveredUriCount() | protected void | disregardDisposition(CrawlURI curi) | public long | disregardedUriCount() | public long | failedFetchCount() | protected void | failureDisposition(CrawlURI curi) The CrawlURI has encountered a problem, and will not
be retried. | public synchronized void | finished(CrawlURI curi) | public long | finishedUriCount() | public String | getClassKey(CandidateURI cauri) | public FrontierJournal | getFrontierJournal() | public FrontierGroup | getGroup(CrawlURI curi) | protected AdaptiveRevisitHostQueue | getHQ(CrawlURI curi) Get the AdaptiveRevisitHostQueue for the given CrawlURI, creating
it if necessary. | public synchronized FrontierMarker | getInitialMarker(String regexpr, boolean inCacheOnly) | public String[] | getReports() | protected CrawlServer | getServer(CrawlURI curi) | public synchronized ArrayList | getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) | public void | importRecoverLog(String pathToLog) Method is not supported by this Frontier implementation.. | public void | importRecoverLog(String pathToLog, boolean retainFailures) | public synchronized void | initialize(CrawlController c) | protected synchronized void | innerFinished(CrawlURI curi) | protected void | innerSchedule(CandidateURI caUri) | protected boolean | isDisregarded(CrawlURI curi) | public boolean | isEmpty() | public void | kickUpdate() | public void | loadSeeds() | protected boolean | needsPromptRetry(CrawlURI curi) | protected boolean | needsRetrying(CrawlURI curi) | public synchronized CrawlURI | next() | public synchronized void | pause() | public synchronized long | queuedUriCount() | public void | receive(CandidateURI item) | public void | reportTo(PrintWriter writer) | public synchronized void | reportTo(String name, PrintWriter writer) | protected void | reschedule(CrawlURI curi, boolean errorWait) Put near top of relevant hostQueue (but behind anything recently
scheduled 'high')..
Parameters: curi - CrawlURI to reschedule. | public void | schedule(CandidateURI caURI) | protected boolean | shouldBeForgotten(CrawlURI curi) Some URIs, if they recur, deserve another
chance at consideration: they might not be too
many hops away via another path, or the scope
may have been updated to allow them passage. | public String | singleLineLegend() | public String | singleLineReport() | public synchronized void | singleLineReportTo(PrintWriter w) | public void | start() | public long | succeededFetchCount() | protected void | successDisposition(CrawlURI curi) The CrawlURI has been successfully crawled. | public synchronized void | terminate() | public long | totalBytesWritten() | public synchronized void | unpause() |
ACCEPTABLE_FORCE_QUEUE | final protected static String ACCEPTABLE_FORCE_QUEUE(Code) | | Acceptable characters in forced queue names.
Word chars, dash, period, comma, colon
|
ATTR_DELAY_FACTOR | final public static String ATTR_DELAY_FACTOR(Code) | | How many multiples of last fetch elapsed time to wait before recontacting
same server
|
ATTR_FORCE_QUEUE | final public static String ATTR_FORCE_QUEUE(Code) | | Queue assignment to force on CrawlURIs. Intended to be used
via overrides
|
ATTR_HOST_VALENCE | final public static String ATTR_HOST_VALENCE(Code) | | Maximum simultaneous requests in process to a host (queue)
|
ATTR_MAX_DELAY | final public static String ATTR_MAX_DELAY(Code) | | Never wait more than this long, regardless of multiple
|
ATTR_MAX_RETRIES | final public static String ATTR_MAX_RETRIES(Code) | | Maximum times to emit a CrawlURI without final disposition
|
ATTR_MIN_DELAY | final public static String ATTR_MIN_DELAY(Code) | | Always wait this long after one completion before recontacting
same server, regardless of multiple
|
ATTR_PREFERENCE_EMBED_HOPS | final public static String ATTR_PREFERENCE_EMBED_HOPS(Code) | | Number of hops of embeds (ERX) to bump to front of host queue
|
ATTR_QUEUE_IGNORE_WWW | final public static String ATTR_QUEUE_IGNORE_WWW(Code) | | Should the queue assignment ignore www in hostnames, effectively
stripping them away.
|
ATTR_RETRY_DELAY | final public static String ATTR_RETRY_DELAY(Code) | | For retryable problems, seconds to wait before a retry
|
ATTR_USE_URI_UNIQ_FILTER | final public static String ATTR_USE_URI_UNIQ_FILTER(Code) | | Should the Frontier use a seperate 'already included' datastructure
or rely on the queues'.
|
DEFAULT_FORCE_QUEUE | final protected static String DEFAULT_FORCE_QUEUE(Code) | | |
DEFAULT_QUEUE_IGNORE_WWW | final protected static Boolean DEFAULT_QUEUE_IGNORE_WWW(Code) | | |
DEFAULT_USE_URI_UNIQ_FILTER | final protected static Boolean DEFAULT_USE_URI_UNIQ_FILTER(Code) | | |
AdaptiveRevisitFrontier | public AdaptiveRevisitFrontier(String name)(Code) | | |
AdaptiveRevisitFrontier | public AdaptiveRevisitFrontier(String name, String description)(Code) | | |
averageDepth | public long averageDepth()(Code) | | |
batchFlush | protected void batchFlush()(Code) | | |
calculateSnoozeTime | protected long calculateSnoozeTime(CrawlURI curi)(Code) | | Calculates how long a host queue needs to be snoozed following the
crawling of a URI.
Parameters: curi - The CrawlURI How long to snooze. |
canonicalize | protected String canonicalize(UURI uuri)(Code) | | Canonicalize passed uuri. Its would be sweeter if this canonicalize
function was encapsulated by that which it canonicalizes but because
settings change with context -- i.e. there may be overrides in operation
for a particular URI -- its not so easy; Each CandidateURI would need a
reference to the settings system. That's awkward to pass in.
Parameters: uuri - Candidate URI to canonicalize. Canonicalized version of passed uuri . |
canonicalize | protected String canonicalize(CandidateURI cauri)(Code) | | Canonicalize passed CandidateURI. This method differs from
AdaptiveRevisitFrontier.canonicalize(UURI) in that it takes a look at
the CandidateURI context possibly overriding any canonicalization effect if
it could make us miss content. If canonicalization produces an URL that
was 'alreadyseen', but the entry in the 'alreadyseen' database did
nothing but redirect to the current URL, we won't get the current URL;
we'll think we've already see it. Examples would be archive.org
redirecting to www.archive.org or the inverse, www.netarkivet.net
redirecting to netarkivet.net (assuming stripWWW rule enabled).
Note, this method under circumstance sets the forceFetch flag.
Parameters: cauri - CandidateURI to examine. Canonicalized cacuri . |
congestionRatio | public float congestionRatio()(Code) | | |
considerIncluded | public void considerIncluded(UURI u)(Code) | | |
crawlEnding | public void crawlEnding(String sExitMessage)(Code) | | |
crawlPaused | public void crawlPaused(String statusMessage)(Code) | | |
crawlPausing | public void crawlPausing(String statusMessage)(Code) | | |
crawlResuming | public void crawlResuming(String statusMessage)(Code) | | |
createAlreadyIncluded | protected UriUniqFilter createAlreadyIncluded() throws IOException(Code) | | Create a UriUniqFilter that will serve as record
of already seen URIs.
A UURISet that will serve as a record of already seen URIs throws: IOException - |
deepestUri | public long deepestUri()(Code) | | |
deleteURIs | public synchronized long deleteURIs(String match)(Code) | | |
discoveredUriCount | public synchronized long discoveredUriCount()(Code) | | |
disregardDisposition | protected void disregardDisposition(CrawlURI curi)(Code) | | |
disregardedUriCount | public long disregardedUriCount()(Code) | | |
failedFetchCount | public long failedFetchCount()(Code) | | |
failureDisposition | protected void failureDisposition(CrawlURI curi)(Code) | | The CrawlURI has encountered a problem, and will not
be retried.
Parameters: curi - The CrawlURI |
finishedUriCount | public long finishedUriCount()(Code) | | |
getServer | protected CrawlServer getServer(CrawlURI curi)(Code) | | Parameters: curi - the CrawlServer to be associated with this CrawlURI |
importRecoverLog | public void importRecoverLog(String pathToLog) throws IOException(Code) | | Method is not supported by this Frontier implementation..
Parameters: pathToLog - throws: IOException - |
importRecoverLog | public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code) | | This method is not supported by this Frontier implementation
Parameters: pathToLog - Parameters: retainFailures - throws: IOException - |
innerFinished | protected synchronized void innerFinished(CrawlURI curi)(Code) | | |
innerSchedule | protected void innerSchedule(CandidateURI caUri)(Code) | | Parameters: caUri - The URI to schedule. |
isEmpty | public boolean isEmpty()(Code) | | |
kickUpdate | public void kickUpdate()(Code) | | |
loadSeeds | public void loadSeeds()(Code) | | Loads the seeds
This method is called by initialize() and kickUpdate()
|
needsPromptRetry | protected boolean needsPromptRetry(CrawlURI curi) throws AttributeNotFoundException(Code) | | Checks if a recently completed CrawlURI that did not finish successfully
needs to be retried immediately (processed again as soon as politeness
allows.)
Parameters: curi - The CrawlURI to check True if we need to retry promptly. throws: AttributeNotFoundException - If problems occur trying to read themaximum number of retries from the settings framework. |
needsRetrying | protected boolean needsRetrying(CrawlURI curi) throws AttributeNotFoundException(Code) | | Checks if a recently completed CrawlURI that did not finish successfully
needs to be retried (processed again after some time elapses)
Parameters: curi - The CrawlURI to check True if we need to retry. throws: AttributeNotFoundException - If problems occur trying to read themaximum number of retries from the settings framework. |
pause | public synchronized void pause()(Code) | | |
queuedUriCount | public synchronized long queuedUriCount()(Code) | | |
reschedule | protected void reschedule(CrawlURI curi, boolean errorWait) throws AttributeNotFoundException(Code) | | Put near top of relevant hostQueue (but behind anything recently
scheduled 'high')..
Parameters: curi - CrawlURI to reschedule. Its time of next processing is notmodified. Parameters: errorWait - signals if there should be a wait before retrying. throws: AttributeNotFoundException - |
shouldBeForgotten | protected boolean shouldBeForgotten(CrawlURI curi)(Code) | | Some URIs, if they recur, deserve another
chance at consideration: they might not be too
many hops away via another path, or the scope
may have been updated to allow them passage.
Parameters: curi - True if curi should be forgotten. |
start | public void start()(Code) | | |
succeededFetchCount | public long succeededFetchCount()(Code) | | |
successDisposition | protected void successDisposition(CrawlURI curi)(Code) | | The CrawlURI has been successfully crawled.
Parameters: curi - The CrawlURI |
terminate | public synchronized void terminate()(Code) | | |
totalBytesWritten | public long totalBytesWritten()(Code) | | |
unpause | public synchronized void unpause()(Code) | | |
|
|