| java.lang.Object org.archive.crawler.util.FPMergeUriUniqFilter
All known Subclasses: org.archive.crawler.util.DiskFPMergeUriUniqFilter, org.archive.crawler.util.MemFPMergeUriUniqFilter,
FPMergeUriUniqFilter | abstract public class FPMergeUriUniqFilter implements UriUniqFilter(Code) | | UriUniqFilter based on merging FP arrays (in memory or from disk).
Inspired by the approach in Najork and Heydon, "High-Performance
Web Crawling" (2001), section 3.2, "Efficient Duplicate URL
Eliminators".
author: gojomo |
Inner Class :public class PendingItem implements Comparable | |
DEFAULT_MAX_PENDING | final public static int DEFAULT_MAX_PENDING(Code) | | |
FLUSH_DELAY_FACTOR | final public static long FLUSH_DELAY_FACTOR(Code) | | |
maxPending | protected int maxPending(Code) | | size at which to force flush of pending items
|
mergeDupAtLast | protected long mergeDupAtLast(Code) | | |
mergeDuplicateCount | protected long mergeDuplicateCount(Code) | | |
nextFlushAllowableAfter | protected long nextFlushAllowableAfter(Code) | | time-based throttle on flush-merge operations
|
pendDupAtLast | protected long pendDupAtLast(Code) | | |
pendDuplicateCount | protected long pendDuplicateCount(Code) | | |
pendingSet | protected TreeSet<PendingItem> pendingSet(Code) | | items awaiting merge
TODO: consider only sorting just pre-merge
TODO: consider using a fastutil long->Object class
TODO: consider actually writing items to disk file,
as in Najork/Heydon
|
quickDupAtLast | protected long quickDupAtLast(Code) | | |
quickDuplicateCount | protected long quickDuplicateCount(Code) | | |
receiver | protected HasUriReceiver receiver(Code) | | |
FPMergeUriUniqFilter | public FPMergeUriUniqFilter()(Code) | | |
addNewFp | abstract protected void addNewFp(long fp)(Code) | | Add an FP (which may be an old or new FP) to the new complete
list. Should only be called after beginFpMerge() and before
finishFpMerge().
Parameters: fp - the FP to add |
beginFpMerge | abstract protected LongIterator beginFpMerge()(Code) | | Begin merging pending candidates with complete list. Return an
Iterator which will return all previously-known FPs in turn.
Iterator over all previously-known FPs |
close | public void close()(Code) | | |
createFp | public static long createFp(CharSequence key)(Code) | | Create a fingerprint from the given key
Parameters: key - CharSequence (URI) to fingerprint long fingerprint |
finishFpMerge | abstract protected void finishFpMerge()(Code) | | Complete the merge of candidate and previously-known FPs (closing
files/iterators as appropriate).
|
flush | public synchronized long flush()(Code) | | Perform a merge of all 'pending' items to the overall fingerprint list.
If the pending item is new, and has an associated CandidateURI, pass that
URI along to the 'receiver' (frontier) for queueing.
number of pending items actually added |
pend | protected void pend(long fp, CandidateURI value)(Code) | | Place the given FP/CandidateURI pair into the pending set, awaiting
a merge to determine if it's actually accepted.
Parameters: fp - long fingerprint Parameters: value - CandidateURI or null, if fp only needs merging (as when CandidateURI was already forced in |
pending | public long pending()(Code) | | |
requestFlush | public synchronized long requestFlush()(Code) | | |
setDestination | public void setDestination(HasUriReceiver receiver)(Code) | | |
setMaxPending | public void setMaxPending(int max)(Code) | | |
setProfileLog | public void setProfileLog(File logfile)(Code) | | |
|
|