| |
|
| java.lang.Object org.archive.crawler.util.SetBasedUriUniqFilter org.archive.crawler.util.BdbUriUniqFilter
BdbUriUniqFilter | public class BdbUriUniqFilter extends SetBasedUriUniqFilter implements Serializable(Code) | | A BDB implementation of an AlreadySeen list.
This implementation performs adequately without blowing out
the heap. See
AlreadySeen.
Makes keys that have URIs from same server close to each other. Mercator
and 2.3.5 'Elminating Already-Visited URLs' in 'Mining the Web' by Soumen
Chakrabarti talk of a two-level key with the first 24 bits a hash of the
host plus port and with the last 40 as a hash of the path. Testing
showed adoption of such a scheme halving lookup times (This implementation
actually concatenates scheme + host in first 24 bits and path + query in
trailing 40 bits).
author: stack version: $Date: 2007-02-21 10:18:39 +0000 (Wed, 21 Feb 2007) $, $Revision: 4927 $ |
Constructor Summary | |
protected | BdbUriUniqFilter() Shutdown default constructor. | public | BdbUriUniqFilter(Environment environment) Constructor. | public | BdbUriUniqFilter(File bdbEnv) Constructor.
Parameters: bdbEnv - The directory that holds the bdb environment. | public | BdbUriUniqFilter(File bdbEnv, int cacheSizePercentage) Constructor.
Parameters: bdbEnv - The directory that holds the bdb environment. |
ZERO_LENGTH_ENTRY | protected static DatabaseEntry ZERO_LENGTH_ENTRY(Code) | | |
alreadySeen | protected transient Database alreadySeen(Code) | | |
count | protected long count(Code) | | |
createdEnvironment | protected boolean createdEnvironment(Code) | | |
lastCacheMiss | protected long lastCacheMiss(Code) | | |
lastCacheMissDiff | protected long lastCacheMissDiff(Code) | | |
BdbUriUniqFilter | protected BdbUriUniqFilter()(Code) | | Shutdown default constructor.
|
BdbUriUniqFilter | public BdbUriUniqFilter(Environment environment) throws IOException(Code) | | Constructor.
Parameters: environment - A bdb environment ready-configured. throws: IOException - |
BdbUriUniqFilter | public BdbUriUniqFilter(File bdbEnv) throws IOException(Code) | | Constructor.
Parameters: bdbEnv - The directory that holds the bdb environment. Willmake a database under here if doesn't already exit. Otherwisereopens any existing dbs. throws: IOException - |
BdbUriUniqFilter | public BdbUriUniqFilter(File bdbEnv, int cacheSizePercentage) throws IOException(Code) | | Constructor.
Parameters: bdbEnv - The directory that holds the bdb environment. Willmake a database under here if doesn't already exit. Otherwisereopens any existing dbs. Parameters: cacheSizePercentage - Percentage of JVM bdb allocates asits cache. Pass -1 to get default cache size. throws: IOException - |
close | public synchronized void close()(Code) | | |
createKey | public static long createKey(CharSequence uri)(Code) | | Create fingerprint.
Pubic access so test code can access createKey.
Parameters: uri - URI to fingerprint. Fingerprint of passed url . |
flush | public long flush()(Code) | | |
getCacheMisses | public synchronized long getCacheMisses() throws DatabaseException(Code) | | |
getDatabaseConfig | protected DatabaseConfig getDatabaseConfig()(Code) | | DatabaseConfig to use |
getLastCacheMissDiff | public long getLastCacheMissDiff()(Code) | | |
initialize | protected void initialize(Environment env) throws DatabaseException(Code) | | Method shared by constructors.
Parameters: env - Environment to use. throws: DatabaseException - |
open | protected void open(Environment env, DatabaseConfig dbConfig) throws DatabaseException(Code) | | |
reopen | public void reopen(Environment env) throws DatabaseException(Code) | | Call after deserializing an instance of this class. Will open the
already seen in passed environment.
Parameters: env - DB Environment to use. throws: DatabaseException - |
setCount | protected long setCount()(Code) | | |
|
|
|