applySpecialHandling(CrawlURI curi) Perform any special handling of the CrawlURI, such as promoting its URI
to seed-status, or preferencing it because it is an embed.
Canonicalize passed uuri. Its would be sweeter if this canonicalize
function was encapsulated by that which it canonicalizes but because
settings change with context -- i.e. there may be overrides in operation
for a particular URI -- its not so easy; Each CandidateURI would need a
reference to the settings system. That's awkward to pass in.
Parameters: uuri - Candidate URI to canonicalize. Canonicalized version of passed uuri.
Canonicalize passed CandidateURI. This method differs from
AbstractFrontier.canonicalize(UURI) in that it takes a look at
the CandidateURI context possibly overriding any canonicalization effect if
it could make us miss content. If canonicalization produces an URL that
was 'alreadyseen', but the entry in the 'alreadyseen' database did
nothing but redirect to the current URL, we won't get the current URL;
we'll think we've already see it. Examples would be archive.org
redirecting to www.archive.org or the inverse, www.netarkivet.net
redirecting to netarkivet.net (assuming stripWWW rule enabled).
Note, this method under circumstance sets the forceFetch flag.
Parameters: cauri - CandidateURI to examine. Canonicalized cacuri.
Increment the running count of queued URIs. Synchronized because
operations on longs are not atomic.
Parameters: increment - amount to increment the queued count
Checks if a recently completed CrawlURI that did not finish successfully
needs to be retried (processed again after some time elapses)
Parameters: curi - The CrawlURI to check True if we need to retry.
Perform fixups on a CrawlURI about to be returned via next().
Parameters: curi - CrawlURI about to be returned by next() Parameters: q - the queue from which the CrawlURI came
Update any scheduling structures with the new information in this
CrawlURI. Chiefly means make necessary arrangements for no other URIs at
the same host to be visited within the appropriate politeness window.
Parameters: curi - The CrawlURI millisecond politeness delay
Utility method to return a scratch dir for the given key's temp files.
Every key gets its own subdir. To avoid having any one directory with
thousands of files, there are also two levels of enclosing directory
named by the least-significant hex digits of the key string's java
hashcode.
Parameters: key - File representing scratch directory