| java.lang.Object org.archive.crawler.frontier.AdaptiveRevisitHostQueue
AdaptiveRevisitHostQueue | public class AdaptiveRevisitHostQueue implements AdaptiveRevisitAttributeConstants,FrontierGroup(Code) | | A priority based queue of CrawlURIs. Each queue should represent
one host (although this is not enforced in this class). Items are ordered
by the scheduling directive and time of next processing (in that order)
and also indexed by the URI.
The HQ does no calculations on the 'time of next processing.' It always
relies on values already set on the CrawlURI.
Note: Class is not 'thread safe.' In multi threaded environment the caller
must ensure that two threads do not make overlapping calls.
Any BDB DatabaseException will be converted to an IOException by public
methods. This includes preserving the original stacktrace, in favor of the
one created for the IOException, so that the true source of the exception
is not lost.
author: Kristinn Sigurdsson |
Field Summary | |
final public static int | HQSTATE_BUSY HQ has maximum number of CrawlURI currently being processed. | final public static int | HQSTATE_EMPTY HQ contains no queued CrawlURIs elements. | final public static int | HQSTATE_READY | final public static int | HQSTATE_SNOOZED | protected StoredClassCatalog | classCatalog | protected EntryBinding | crawlURIBinding | final String | hostName | long | inProcessing Number of URIs belonging to this queue that are being processed at the
moment. | long | nextReadyTime Time (in milliseconds) when the HQ will next be ready to issue a URI
for processing. | protected EntryBinding | primaryKeyBinding | protected Database | primaryUriDB Database containing the URI priority queue, indexed by the the
URI string. | protected Database | processingUriDB A database containing those URIs that are currently being processed. | protected SecondaryDatabase | secondaryUriDB Secondary index into
AdaptiveRevisitHostQueue.primaryUriDB the primary DB , URIs indexed
by the time when they can next be processed again. | long | size Size of queue. | int | state Last known state of HQ -- ALL methods should use getState() to read
this value, never read it directly. | protected CrawlSubstats | substats | int | valence Number of simultanious connections permitted to this host. | long[] | wakeUpTime Time (in milliseconds) when each URI 'slot' becomes available again.
Any positive value larger then the current time signifies a taken slot
where the URI has completed processing but the politness wait has not
ended. |
Constructor Summary | |
public | AdaptiveRevisitHostQueue(String hostName, Environment env, StoredClassCatalog catalog, int valence) Constructor
Parameters: hostName - Name of the host this queue represents. |
Method Summary | |
public void | add(CrawlURI curi, boolean overrideSetTimeOnDups) Add a CrawlURI to this host queue.
Calls can optionally chose to have the time of next processing value
override existing values for the URI if the existing values are 'later'
then the new ones. | protected void | addInProcessing(CrawlURI curi) Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are
being processed at the moment. | public void | close() Cleanup all open Berkeley Database objects. | protected long | countCrawlURIs() Count all entries in both primaryUriDB and processingUriDB. | protected void | deleteInProcessing(String uri) Removes a URI from the list of URIs belonging to this HQ and are
currently being processed. | protected void | flushProcessingURIs() Flush any CrawlURIs in the processingUriDB into the primaryUriDB. | protected CrawlURI | getCrawlURI(String uri) Returns the CrawlURI associated with the specified URI (string) or null
if no such CrawlURI is queued in this HQ. | public String | getHostName() | public long | getNextReadyTime() Returns the time when the HQ will next be ready to issue a URI. | public long | getSize() Returns the size of the HQ. | public int | getState() Returns the current state of the HQ. | public String | getStateByName() Same as
AdaptiveRevisitHostQueue.getState() getState() except this method returns a
human readable name for the state instead of its constant integer value. | public CrawlSubstats | getSubstats() | protected boolean | inProcessing(String uri) Returns true if this HQ has a CrawlURI matching the uri string currently
being processed. | public CrawlURI | next() Returns the 'top' URI in the AdaptiveRevisitHostQueue. | public CrawlURI | peek() Returns the URI with the earliest time of next processing. | protected void | reorder() Method is called whenever something has been done that might have
changed the value of the 'published' time of next ready. | public String | report(int max) Returns a report detailing the status of this HQ.
Parameters: max - Maximum number of URIs to show. | protected void | setNextReadyTime(long newTime) | public void | setOwner(AdaptiveRevisitQueueList owner) Set the AdaptiveRevisitQueueList object that contains this HQ. | protected OperationStatus | strictAdd(CrawlURI curi, boolean overrideDuplicates) An internal method for adding URIs to the queue. | public void | update(CrawlURI curi, boolean needWait, long wakeupTime) Update CrawlURI that has completed processing.
Parameters: curi - The CrawlURI. | public void | update(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI) Update CrawlURI that has completed processing.
Parameters: curi - The CrawlURI. |
HQSTATE_BUSY | final public static int HQSTATE_BUSY(Code) | | HQ has maximum number of CrawlURI currently being processed. This number
is either equal to the 'valence' (maximum number of simultanious
connections to a host) or (if smaller) the total number of CrawlURIs
in the HQ.
|
HQSTATE_EMPTY | final public static int HQSTATE_EMPTY(Code) | | HQ contains no queued CrawlURIs elements. This state only occurs after
queue creation before the first add. After the first item is added the
state can never become empty again.
|
HQSTATE_READY | final public static int HQSTATE_READY(Code) | | HQ has a CrawlURI ready for processing
|
HQSTATE_SNOOZED | final public static int HQSTATE_SNOOZED(Code) | | HQ is in a suspended state until it can be woken back up
|
classCatalog | protected StoredClassCatalog classCatalog(Code) | | For BDB serialization of objects
|
crawlURIBinding | protected EntryBinding crawlURIBinding(Code) | | A binding for the CrawlURIARWrapper object
|
hostName | final String hostName(Code) | | Name of the host that this AdaptiveRevisitHostQueue represents
|
inProcessing | long inProcessing(Code) | | Number of URIs belonging to this queue that are being processed at the
moment. This number will always be in the range of 0 - valence
|
primaryKeyBinding | protected EntryBinding primaryKeyBinding(Code) | | A binding for the serialization of the primary key (URI string)
|
primaryUriDB | protected Database primaryUriDB(Code) | | Database containing the URI priority queue, indexed by the the
URI string.
|
processingUriDB | protected Database processingUriDB(Code) | | A database containing those URIs that are currently being processed.
|
size | long size(Code) | | Size of queue. That is, the number of CrawlURIs that have been added to
it, including any that are currently being processed.
|
state | int state(Code) | | Last known state of HQ -- ALL methods should use getState() to read
this value, never read it directly.
|
valence | int valence(Code) | | Number of simultanious connections permitted to this host. I.e. this
many URIs can be issued before state of HQ becomes busy until one of
them is returned via the update method.
|
wakeUpTime | long[] wakeUpTime(Code) | | Time (in milliseconds) when each URI 'slot' becomes available again.
Any positive value larger then the current time signifies a taken slot
where the URI has completed processing but the politness wait has not
ended.
A zero or positive value smaller then the current time in milliseconds
signifies an empty slot.
Any negative value signifies a slot for a URI that is being processed.
Methods should never write directly to this, rather use the
AdaptiveRevisitHostQueue.updateWakeUpTimeSlot(long) updateWakeUpTimeSlot() and
AdaptiveRevisitHostQueue.useWakeUpTimeSlot() useWakeUpTimeSlot() methods as needed.
|
AdaptiveRevisitHostQueue | public AdaptiveRevisitHostQueue(String hostName, Environment env, StoredClassCatalog catalog, int valence) throws IOException(Code) | | Constructor
Parameters: hostName - Name of the host this queue represents. This name mustbe unique for all HQs in the same Environment. Parameters: env - Berkeley DB Environment. All BDB databases created will use it. Parameters: catalog - Db for bdb class serialization. Parameters: valence - The total number of simultanous URIs that the HQ can issuefor processing. Once this many URIs have been issued forprocessing, the HQ will go into AdaptiveRevisitHostQueue.HQSTATE_BUSY busystate until at least one of the URI is AdaptiveRevisitHostQueue.update(CrawlURI,boolean,long) updated.Value should be larger then zero. Zero and negative valueswill be treated same as 1. throws: IOException - if an error occurs opening/creating the database |
add | public void add(CrawlURI curi, boolean overrideSetTimeOnDups) throws IOException(Code) | | Add a CrawlURI to this host queue.
Calls can optionally chose to have the time of next processing value
override existing values for the URI if the existing values are 'later'
then the new ones.
Parameters: curi - The CrawlURI to add. Parameters: overrideSetTimeOnDups - If true then the time of next processing forthe supplied URI will override the anyexisting time for it already stored in the HQ.If false, then no changes will be made to anyexisting values of the URI. Note: Will neveroverride with a later time. throws: IOException - When an error occurs accessing the database |
addInProcessing | protected void addInProcessing(CrawlURI curi) throws DatabaseException, IllegalStateException(Code) | | Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are
being processed at the moment.
Parameters: curi - The CrawlURI to add to the list throws: DatabaseException - throws: IllegalStateException - if the CrawlURI is already in the list of URIs beingprocessed. |
close | public void close() throws IOException(Code) | | Cleanup all open Berkeley Database objects.
Does not close the Environment.
throws: IOException - if an error occurs closing a database object |
countCrawlURIs | protected long countCrawlURIs() throws DatabaseException(Code) | | Count all entries in both primaryUriDB and processingUriDB.
This method is needed since BDB does not provide a simple way of counting
entries.
Note: This is an expensive operation, requires a loop through the entire
queue!
the number of distinct CrawlURIs in the HQ. throws: DatabaseException - |
deleteInProcessing | protected void deleteInProcessing(String uri) throws DatabaseException(Code) | | Removes a URI from the list of URIs belonging to this HQ and are
currently being processed.
Returns true if successful, false if the URI was not found.
Parameters: uri - The URI string of the CrawlURI to delete. throws: DatabaseException - throws: IllegalStateException - if the URI was not on the list |
flushProcessingURIs | protected void flushProcessingURIs() throws DatabaseException(Code) | | Flush any CrawlURIs in the processingUriDB into the primaryUriDB. URIs
flushed will have their 'time of next fetch' maintained and the
nextReadyTime will be updated if needed.
No change is made to the list of available slots.
throws: DatabaseException - if one occurs while flushing |
getCrawlURI | protected CrawlURI getCrawlURI(String uri) throws DatabaseException(Code) | | Returns the CrawlURI associated with the specified URI (string) or null
if no such CrawlURI is queued in this HQ. If CrawlURI is being processed
it is not considered to be queued and this method will return
null for any such URIs.
Parameters: uri - A string representing the URI the CrawlURI associated with the specified URI (string) or nullif no such CrawlURI is queued in this HQ. throws: DatabaseException - if a errors occurs reading the database |
getHostName | public String getHostName()(Code) | | Returns the HQ's name
the HQ's name |
getNextReadyTime | public long getNextReadyTime()(Code) | | Returns the time when the HQ will next be ready to issue a URI.
If the queue is in a
AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozed state then this
time will be in the future and reflects either the time when the HQ will
again be able to issue URIs for processing because politness constraints
have ended, or when a URI next becomes available for visit, whichever is
larger.
If the queue is in a
AdaptiveRevisitHostQueue.HQSTATE_READY ready state this time will
be in the past and reflect the earliest time when the HQ had a URI ready
for processing, taking time spent snoozed for politness concerns into
account.
If the HQ is in any other state then the return value of this method is
equal to Long.MAX_VALUE.
This value may change each time a URI is added, issued or updated.
the time when the HQ will next be ready to issue a URI |
getSize | public long getSize()(Code) | | Returns the size of the HQ. That is, the number of URIs queued,
including any that are currently being processed.
the size of the HQ. |
getStateByName | public String getStateByName()(Code) | | Same as
AdaptiveRevisitHostQueue.getState() getState() except this method returns a
human readable name for the state instead of its constant integer value.
Should only be used for reports, error messages and other strings
intended for human eyes.
the human readable name of the current state |
inProcessing | protected boolean inProcessing(String uri) throws DatabaseException(Code) | | Returns true if this HQ has a CrawlURI matching the uri string currently
being processed. False otherwise.
Parameters: uri - Uri to check true if this HQ has a CrawlURI matching the uri string currentlybeing processed. False otherwise. throws: DatabaseException - |
peek | public CrawlURI peek() throws IllegalStateException, IOException(Code) | | Returns the URI with the earliest time of next processing. I.e. the URI
at the head of this host based priority queue.
Note: This method will return the head CrawlURI regardless of wether it
is safe to start processing it or not. CrawlURI will remain in the queue.
The returned CrawlURI should only be used for queue inspection, it can
not be updated and returned to the queue. To get URIs ready for
processing use
AdaptiveRevisitHostQueue.next() next() .
the URI with the earliest time of next processing or null if the queue is empty or all URIs are currently being processed. throws: IllegalStateException - throws: IOException - if an error occurs reading from the database |
reorder | protected void reorder()(Code) | | Method is called whenever something has been done that might have
changed the value of the 'published' time of next ready. If an owner
has been specified it will be notified that the value may have changed..
|
report | public String report(int max)(Code) | | Returns a report detailing the status of this HQ.
Parameters: max - Maximum number of URIs to show. 0 equals no limit. a report detailing the status of this HQ. |
setNextReadyTime | protected void setNextReadyTime(long newTime)(Code) | | Updates nextReadyTime (if smaller) with the supplied value
Parameters: newTime - the new value of nextReady Time; |
strictAdd | protected OperationStatus strictAdd(CrawlURI curi, boolean overrideDuplicates) throws DatabaseException(Code) | | An internal method for adding URIs to the queue.
Parameters: curi - The CrawlURI to add Parameters: overrideDuplicates - If true then any existing CrawlURI in the DBwill be overwritten. If false insert into thequeue is only performed if the key doesn't already exist. The OperationStatus object returned by the put method. throws: DatabaseException - |
update | public void update(CrawlURI curi, boolean needWait, long wakeupTime) throws IllegalStateException, IOException(Code) | | Update CrawlURI that has completed processing.
Parameters: curi - The CrawlURI. This must be a CrawlURI issued by this HQ's AdaptiveRevisitHostQueue.next() next() method. Parameters: needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. Ifvalence is > 1 then seperate times are maintained for each slot. Parameters: wakeupTime - If new state is AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozedthen this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ upagain. Otherwise this parameter will be ignored. throws: IllegalStateException - if the CrawlURIdoes not match a CrawlURI issued for crawling by this HQ'sAdaptiveRevisitHostQueue.next next(). throws: IOException - if an error occurs accessing the database |
update | public void update(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI) throws IllegalStateException, IOException(Code) | | Update CrawlURI that has completed processing.
Parameters: curi - The CrawlURI. This must be a CrawlURI issued by this HQ's AdaptiveRevisitHostQueue.next() next() method. Parameters: needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. Ifvalence is > 1 then seperate times are maintained for each slot. Parameters: wakeupTime - If new state is AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozedthen this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ upagain. Otherwise this parameter will be ignored. Parameters: forgetURI - If true, the URI will be deleted from the queue. throws: IllegalStateException - if the CrawlURIdoes not match a CrawlURI issued for crawling by this HQ'sAdaptiveRevisitHostQueue.next next(). throws: IOException - if an error occurs accessing the database |
|
|