| java.lang.Object org.archive.crawler.util.SetBasedUriUniqFilter org.archive.crawler.util.BloomUriUniqFilter
BloomUriUniqFilter | public class BloomUriUniqFilter extends SetBasedUriUniqFilter implements Serializable(Code) | | A MG4J BloomFilter-based implementation of an AlreadySeen list.
This implementation performs adequately without blowing out
the heap through to very large numbers of URIs. See
AlreadySeen.
It is inherent to Bloom filters that as they get 'saturated', their
false-positive rate rises. The default parameters used by this class
attempt to maintain a 1-in-4 million (1 in 2^22) false-positive chance
through 125 million unique inserts, which creates a filter structure
about 495MB in size.
You may use the following system properties to tune the size and
false-positive rate of the bloom filter structure used by this class:
org.archive.crawler.util.BloomUriUniqFilter.expected-size (default 125000000)
org.archive.crawler.util.BloomUriUniqFilter.hash-count (default 22)
The resulting filter will take up approximately...
1.44 * expected-size * hash-count / 8
...bytes.
The default size is very close to the maximum practical size of the
Bloom filter implementation, BloomFilter32bitSplit, created in the
initialize() method, due to integer arithmetic limits.
If you need a larger filter, you should edit the initialize
method to intantiate a BloomFilter64bit instead.
author: gojomo version: $Date: 2006-09-22 18:39:39 +0000 (Fri, 22 Sep 2006) $, $Revision: 4647 $ |
EXPECTED_SIZE_KEY | final protected static String EXPECTED_SIZE_KEY(Code) | | |
HASH_COUNT_KEY | final protected static String HASH_COUNT_KEY(Code) | | |
expected_n | protected int expected_n(Code) | | |
BloomUriUniqFilter | public BloomUriUniqFilter()(Code) | | Default constructor
|
BloomUriUniqFilter | public BloomUriUniqFilter(int n, int d)(Code) | | Constructor.
Parameters: n - the expected number of elements. Parameters: d - the number of hash functions; if the filter adds not morethan n elements, false positives will happen withprobability 2-d. |
initialize | protected void initialize(int n, int d)(Code) | | Initializer shared by constructors.
Parameters: n - the expected number of elements. Parameters: d - the number of hash functions; if the filter adds not morethan n elements, false positives will happen withprobability 2-d. |
setCount | protected long setCount()(Code) | | |
|
|