| java.lang.Object org.archive.crawler.frontier.BdbMultipleWorkQueues
BdbMultipleWorkQueues | public class BdbMultipleWorkQueues (Code) | | A BerkeleyDB-database-backed structure for holding ordered
groupings of CrawlURIs. Reading the groupings from specific
per-grouping (per-classKey/per-Host) starting points allows
this to act as a collection of independent queues.
For how the bdb keys are made, see
BdbMultipleWorkQueues.calculateInsertKey(CrawlURI) .
TODO: refactor, improve naming.
author: gojomo |
Constructor Summary | |
public | BdbMultipleWorkQueues(Environment env, StoredClassCatalog classCatalog, boolean recycle) Create the multi queue in the given environment. |
Method Summary | |
public void | addCap(byte[] origin) Add a dummy 'cap' entry at the given insertion key. | static DatabaseEntry | calculateInsertKey(CrawlURI curi) Calculate the insertKey that places a CrawlURI in the
desired spot. | static byte[] | calculateOriginKey(String classKey) Calculate the 'origin' key for a virtual queue of items
with the given classKey. | public void | close() | public void | delete(CrawlURI item) Delete the given CrawlURI from persistent store. | public long | deleteMatchingFromQueue(String match, String queue, DatabaseEntry headKey) Delete all CrawlURIs matching the given expression. | public CrawlURI | get(DatabaseEntry headKey) Get the next nearest item after the given key. | protected DatabaseEntry | getFirstKey() | public List | getFrom(FrontierMarker m, int maxMatches) | public FrontierMarker | getInitialMarker(String regexpr) | protected OperationStatus | getNextNearestItem(DatabaseEntry headKey, DatabaseEntry result) | public void | put(CrawlURI curi, boolean overwriteIfPresent) Put the given CrawlURI in at the appropriate place. | void | sync() Method used by BdbFrontier during checkpointing.
The backing bdbje database has been marked deferred write so we save
on writes to disk. |
BdbMultipleWorkQueues | public BdbMultipleWorkQueues(Environment env, StoredClassCatalog classCatalog, boolean recycle) throws DatabaseException(Code) | | Create the multi queue in the given environment.
Parameters: env - bdb environment to use Parameters: classCatalog - Class catalog to use. Parameters: recycle - True if we are to reuse db content if any. throws: DatabaseException - |
addCap | public void addCap(byte[] origin)(Code) | | Add a dummy 'cap' entry at the given insertion key. Prevents
'seeks' to queue heads from holding lock on last item of
'preceding' queue. See:
http://sourceforge.net/tracker/index.php?func=detail&aid=1262665&group_id=73833&atid=539102
Parameters: origin - key at which to insert the cap |
calculateInsertKey | static DatabaseEntry calculateInsertKey(CrawlURI curi)(Code) | | Calculate the insertKey that places a CrawlURI in the
desired spot. First bytes are always classKey (usu. host)
based -- ensuring grouping by host -- terminated by a zero
byte. Then 8 bytes of data ensuring desired ordering
within that 'queue' are used. The first byte of these 8 is
priority -- allowing 'immediate' and 'soon' items to
sort above regular. Next 1 byte is 'cost'. Last 6 bytes
are ordinal serial number, ensuring earlier-discovered
URIs sort before later.
NOTE: Dangers here are:
(1) priorities or costs over 2^7 (signed byte comparison)
(2) ordinals over 2^48
Package access & static for testing purposes.
Parameters: curi - a DatabaseEntry key for the CrawlURI |
calculateOriginKey | static byte[] calculateOriginKey(String classKey)(Code) | | Calculate the 'origin' key for a virtual queue of items
with the given classKey. This origin key will be a
prefix of the keys for all items in the queue.
Parameters: classKey - String key to derive origin byte key from a byte array key |
close | public void close()(Code) | | clean up
|
delete | public void delete(CrawlURI item) throws DatabaseException(Code) | | Delete the given CrawlURI from persistent store. Requires
the key under which it was stored be available.
Parameters: item - throws: DatabaseException - |
deleteMatchingFromQueue | public long deleteMatchingFromQueue(String match, String queue, DatabaseEntry headKey) throws DatabaseException(Code) | | Delete all CrawlURIs matching the given expression.
Parameters: match - Parameters: queue - Parameters: headKey - count of deleted items throws: DatabaseException - throws: DatabaseException - |
get | public CrawlURI get(DatabaseEntry headKey) throws DatabaseException(Code) | | Get the next nearest item after the given key. Relies on
external discipline -- we'll look at the queues count of how many
items it has -- to avoid asking for something from a
range where there are no associated items --
otherwise could get first item of next 'queue' by mistake.
TODO: hold within a queue's range
Parameters: headKey - Key prefix that demarks the beginning of the rangein pendingUrisDB we're interested in. CrawlURI. throws: DatabaseException - |
getFirstKey | protected DatabaseEntry getFirstKey() throws DatabaseException(Code) | | the key to the first item in the database throws: DatabaseException - |
getFrom | public List getFrom(FrontierMarker m, int maxMatches) throws DatabaseException(Code) | | Parameters: m - marker Parameters: maxMatches - list of matches starting from marker position throws: DatabaseException - |
getInitialMarker | public FrontierMarker getInitialMarker(String regexpr)(Code) | | Get a marker for beginning a scan over all contents
Parameters: regexpr - a marker pointing to the first item |
getNextNearestItem | protected OperationStatus getNextNearestItem(DatabaseEntry headKey, DatabaseEntry result) throws DatabaseException(Code) | | |
put | public void put(CrawlURI curi, boolean overwriteIfPresent) throws DatabaseException(Code) | | Put the given CrawlURI in at the appropriate place.
Parameters: curi - throws: DatabaseException - |
sync | void sync()(Code) | | Method used by BdbFrontier during checkpointing.
The backing bdbje database has been marked deferred write so we save
on writes to disk. Means no guarantees disk will have whats in memory
unless a sync is called (Calling sync on the bdbje Environment is not
sufficent).
Package access only because only Frontiers of this package would ever
need access.
See Also: Deferred Write Databases |
|
|