| A simple crawl splitter/mapper, dividing up CandidateURIs/CrawlURIs
between crawlers by diverting some range of URIs to local log files
(which can then be imported to other crawlers).
May operate on a CrawlURI (typically early in the processing chain) or
its CandidateURI outlinks (late in the processing chain, after
LinksScoper), or both (if inserted and configured in both places).
Uses lexical comparisons of classKeys to map URIs to crawlers. The
'map' is specified via either a local or HTTP-fetchable file. Each
line of this file should contain two space-separated tokens, the
first a key and the second a crawler node name (which should be
legal as part of a filename). All URIs will be mapped to the crawler
node name associated with the nearest mapping key equal or subsequent
to the URI's own classKey. If there are no mapping keys equal or
after the classKey, the mapping 'wraps around' to the first mapping key.
One crawler name is distinguished as the 'local name'; URIs mapped to
this name are not diverted, but continue to be processed normally.
For example, assume a SurtAuthorityQueueAssignmentPolicy and
a simple mapping file:
d crawlerA
~ crawlerB
All URIs with "com," classKeys will find the 'd' key as the nearest
subsequent mapping key, and thus be mapped to 'crawlerA'. If that's
the 'local name', the URIs will be processed normally; otherwise, the
URI will be written to a diversion log aimed for 'crawlerA'.
If using the JMX importUris operation importing URLs dropped by
a
LexicalCrawlMapper instance, use recoveryLog style.
author: gojomo version: $Date: 2006-09-26 20:38:48 +0000 (Tue, 26 Sep 2006) $, $Revision: 4667 $ |