A Bloom filter.
SLIGHTLY ADAPTED VERSION OF MG4J it.unimi.dsi.mg4j.util.BloomFilter
KEY CHANGES:
- Adapted to use 32bit ops as much as possible... may be slightly
faster on 32bit hardware/OS
- NUMBER_OF_WEIGHTS is 2083, to better avoid collisions between
similar strings
- Removed dependence on cern.colt MersenneTwister (replaced with
SecureRandom) and QuickBitVector (replaced with local methods).
Instances of this class represent a set of character sequences (with false positives)
using a Bloom filter. Because of the way Bloom filters work,
you cannot remove elements.
Bloom filters have an expected error rate, depending on the number
of hash functions used, on the filter size and on the number of elements in the filter. This implementation
uses a variable optimal number of hash functions, depending on the expected
number of elements. More precisely, a Bloom
filter for n character sequences with d hash functions will use
ln 2 dn ≈ 1.44 dn bits;
false positives will happen with probability 2-d.
Hash functions are generated at creation time using universal hashing. Each hash function
uses
BloomFilter32bitSplit.NUMBER_OF_WEIGHTS random integers, which are cyclically multiplied by
the character codes in a character sequence. The resulting integers are XOR-ed together.
This class exports access methods that are very similar to those of
java.util.Set ,
but it does not implement that interface, as too many non-optional methods
would be unimplementable (e.g., iterators).
author: Sebastiano Vigna |