| java.lang.Object org.archive.crawler.admin.CrawlJobHandler
All known Subclasses: org.archive.crawler.selftest.SelfTestCrawlJobHandler,
CrawlJobHandler | public class CrawlJobHandler implements CrawlStatusListener(Code) | | This class manages CrawlJobs. Submitted crawl jobs are queued up and run
in order when the crawler is running.
Basically this provides a layer between any potential user interface and
the CrawlJobs. It keeps the lists of completed jobs, pending jobs, etc.
The jobs managed by the handler can be divided into the following:
-
Pending - Jobs that are ready to run and are waiting their
turn. These can be edited, viewed, deleted etc.
-
Running - Only one job can be running at a time. There may
be no job running. The running job can be viewed
and edited to some extent. It can also be
terminated. This job should have a
StatisticsTracking module attached to it for more
details on the crawl.
Completed - Jobs that have finished crawling or have been
deleted from the pending queue or terminated
while running. They can not be edited but can be
viewed. They retain the StatisticsTracking
module from their run.
-
New job - At any given time their can be one 'new job' the
new job is not considered ready to run. It can
be edited or discarded (in which case it will be
totally destroyed, including any files on disk).
Once an operator deems the job ready to run it
can be moved to the pending queue.
-
Profiles - Jobs under profiles are not actual jobs. They can
be edited normally but can not be submitted to
the pending queue. New jobs can be created
using a profile as it's template.
author: Kristinn Sigurdsson See Also: org.archive.crawler.admin.CrawlJob |
Constructor Summary | |
public | CrawlJobHandler(File jobsDir) Constructor. | public | CrawlJobHandler(File jobsDir, boolean loadJobs, boolean loadProfiles) Constructor allowing for optional loading of profiles and jobs. |
Method Summary | |
public CrawlJob | addJob(CrawlJob job) Submit a job to the handler. | public synchronized void | addProfile(CrawlJob profile) | protected void | checkDirectory(File dir) | public void | checkpointJob() Cause the current job to write a checkpoint to disk. | public void | crawlCheckpoint(File checkpointDir) | public void | crawlEnded(String sExitMessage) | public void | crawlEnding(String sExitMessage) | public void | crawlPaused(String statusMessage) | public void | crawlPausing(String statusMessage) | public void | crawlResuming(String statusMessage) | public void | crawlStarted(String message) | protected CrawlJob | createNewJob(File orderFile, String name, String description, String seeds, int priority) | protected XMLSettingsHandler | createSettingsHandler(File orderFile, String name, String description, String seeds, File newSettingsDir, CrawlJobErrorHandler errorHandler, String filename, String seedfile) Creates a new settings handler based on an existing job. | public void | deleteJob(String jobUID) The specified job will be removed from the pending queue or aborted if
currently running. | public synchronized void | deleteProfile(CrawlJob cj) | public long | deleteURIsFromPending(String regexpr) Delete any URI from the frontier of the current (paused) job that match
the specified regular expression. | public void | discardNewJob() Discard the handler's 'new job'. | protected void | doFlush() If its a HostQueuesFrontier, needs to be flushed for the queued. | public static CrawlJob | ensureNewJobWritten(CrawlJob newJob, String metaname, String description) Ensure order file with new name/desc is written.
See '[ 1066573 ] sometimes job based-on other job uses older job name'.
Parameters: newJob - Newly created job. Parameters: metaname - Metaname for new job. Parameters: description - Description for new job. | public List<CrawlJob> | getCompletedJobs() | public CrawlJob | getCurrentJob() | public synchronized CrawlJob | getDefaultProfile() Returns the default profile. | public FrontierMarker | getInitialMarker(String regexpr, boolean inCacheOnly) Returns a URIFrontierMarker for the current, paused, job. | public CrawlJob | getJob(String jobUID) Return a job with the given UID.
Doesn't matter if it's pending, currently running, has finished running
is new or a profile.
Parameters: jobUID - The unique ID of the job. | public CrawlJob | getNewJob() | public String | getNextJobUID() Returns a unique job ID.
No two calls to this method (on the same instance of this class) can ever
return the same value. | public List<CrawlJob> | getPendingJobs() | public ArrayList | getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) Returns the frontiers URI list based on the provided marker. | public synchronized List<CrawlJob> | getProfiles() Returns a List of all known profiles. | protected File | getStateJobFile(File jobDir) Find the state.job file in the job directory.
Parameters: jobDir - Directory to look in. | public void | importUri(String uri, boolean forceFetch, boolean isSeed) Schedule a uri. | public void | importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush) Schedule a uri.
Parameters: str - String that can be: 1. | public String | importUris(String file, String style, String force) | public String | importUris(String fileOrUrl, String style, boolean forceRevisit) | protected int | importUris(InputStream is, String style, boolean forceRevisit) | public boolean | isCrawling() | public boolean | isRunning() Is the crawler accepting crawl jobs to run?
True if the next availible CrawlJob will be crawled. | public void | kickUpdate() Forward a 'kick' update to current job if any. | protected void | loadJob(File job) Loads a job given a specific job file. | public static ArrayList<String> | loadOptions(String file) Loads options from a file. | protected boolean | loadProfile(File profile) Load one profile.
Parameters: profile - Profile to load. | public CrawlJob | newJob(CrawlJob baseOn, String recovery, String name, String description, String seeds, int priority) Creates a new job. | public CrawlJob | newJob(File orderFile, String name, String description, String seeds) Creates a new job. | public CrawlJob | newProfile(CrawlJob baseOn, String name, String description, String seeds) Creates a new profile. | public void | pauseJob() Cause the current job to pause. | public void | requestCrawlStop() | public void | resumeJob() Cause the current job to resume crawling if it was paused. | public void | setDefaultProfile(CrawlJob profile) Set the default profile.
Parameters: profile - The new default profile. | public void | startCrawler() Allow jobs to be crawled. | final protected void | startNextJob() Start next crawl job. | protected void | startNextJobInternal() | public void | stop() | public void | stopCrawler() Stop future jobs from being crawled. | public boolean | terminateCurrentJob() | protected void | updateRecoveryPaths(File recover, SettingsHandler sh, String jobName) Parameters: recover - Source to use recovering. |
DEFAULT_PROFILE | final public static String DEFAULT_PROFILE(Code) | | Default profile name.
|
DEFAULT_PROFILE_NAME | final public static String DEFAULT_PROFILE_NAME(Code) | | Name of system property whose specification overrides default profile
used.
|
ORDER_FILE_NAME | final public static String ORDER_FILE_NAME(Code) | | |
PROFILES_DIR_NAME | final public static String PROFILES_DIR_NAME(Code) | | Name of the profiles directory.
|
RECOVER_LOG | final public static String RECOVER_LOG(Code) | | String to indicate recovery should be based on the recovery log, not
based on checkpointing.
|
CrawlJobHandler | public CrawlJobHandler(File jobsDir)(Code) | | Constructor.
Parameters: jobsDir - Jobs directory. |
CrawlJobHandler | public CrawlJobHandler(File jobsDir, boolean loadJobs, boolean loadProfiles)(Code) | | Constructor allowing for optional loading of profiles and jobs.
Parameters: jobsDir - Jobs directory. Parameters: loadJobs - If true then any applicable jobs will be loaded. Parameters: loadProfiles - If true then any applicable profiles will be loaded. |
addJob | public CrawlJob addJob(CrawlJob job)(Code) | | Submit a job to the handler. Job will be scheduled for crawling. At
present it will not take the job's priority into consideration.
Parameters: job - A new job for the handler CrawlJob that was added or null. |
addProfile | public synchronized void addProfile(CrawlJob profile)(Code) | | Add a new profile
Parameters: profile - The new profile |
crawlEnding | public void crawlEnding(String sExitMessage)(Code) | | |
crawlPaused | public void crawlPaused(String statusMessage)(Code) | | |
crawlPausing | public void crawlPausing(String statusMessage)(Code) | | |
crawlResuming | public void crawlResuming(String statusMessage)(Code) | | |
createSettingsHandler | protected XMLSettingsHandler createSettingsHandler(File orderFile, String name, String description, String seeds, File newSettingsDir, CrawlJobErrorHandler errorHandler, String filename, String seedfile) throws FatalConfigurationException(Code) | | Creates a new settings handler based on an existing job. Basically all
the settings file for the 'based on' will be copied to the specified
directory.
Parameters: orderFile - Order file to base new order file on. Cannot be null. Parameters: name - Name for the new settings Parameters: description - Description of the new settings. Parameters: seeds - The contents of the new settings' seed file. Parameters: newSettingsDir - Parameters: errorHandler - Parameters: filename - Name of new order file. Parameters: seedfile - Name of new seeds file. The new settings handler. throws: FatalConfigurationException - If there are problems with reading the 'base on'configuration, with writing the new configuration or it'sseed file. |
deleteJob | public void deleteJob(String jobUID)(Code) | | The specified job will be removed from the pending queue or aborted if
currently running. It will be placed in the list of completed jobs with
appropriate status info. If the job is already in the completed list or
no job with the given UID is found, no action will be taken.
Parameters: jobUID - The UID (unique ID) of the job that is to be deleted. |
deleteURIsFromPending | public long deleteURIsFromPending(String regexpr)(Code) | | Delete any URI from the frontier of the current (paused) job that match
the specified regular expression. If the current job is not paused (or
there is no current job) nothing will be done.
Parameters: regexpr - Regular expression to delete URIs by. the number of URIs deleted |
discardNewJob | public void discardNewJob()(Code) | | Discard the handler's 'new job'. This will remove any files/directories
written to disk.
|
doFlush | protected void doFlush()(Code) | | If its a HostQueuesFrontier, needs to be flushed for the queued.
|
ensureNewJobWritten | public static CrawlJob ensureNewJobWritten(CrawlJob newJob, String metaname, String description)(Code) | | Ensure order file with new name/desc is written.
See '[ 1066573 ] sometimes job based-on other job uses older job name'.
Parameters: newJob - Newly created job. Parameters: metaname - Metaname for new job. Parameters: description - Description for new job. newJob |
getCompletedJobs | public List<CrawlJob> getCompletedJobs()(Code) | | A List of all finished jobs. |
getCurrentJob | public CrawlJob getCurrentJob()(Code) | | The job currently being crawled. |
getDefaultProfile | public synchronized CrawlJob getDefaultProfile()(Code) | | Returns the default profile. If no default profile has been set it will
return the first profile that was set/loaded and still exists. If no
profiles exist it will return null
the default profile. |
getJob | public CrawlJob getJob(String jobUID)(Code) | | Return a job with the given UID.
Doesn't matter if it's pending, currently running, has finished running
is new or a profile.
Parameters: jobUID - The unique ID of the job. The job with the UID or null if no such job is found |
getNewJob | public CrawlJob getNewJob()(Code) | | Get the handler's 'new job'
the handler's 'new job' |
getNextJobUID | public String getNextJobUID()(Code) | | Returns a unique job ID.
No two calls to this method (on the same instance of this class) can ever
return the same value.
Currently implemented to return a time stamp. That is subject to change
though.
A unique job ID. See Also: ArchiveUtils.TIMESTAMP17 |
getPendingJobs | public List<CrawlJob> getPendingJobs()(Code) | | A List of all pending jobs
A List of all pending jobs.No promises are made about the order of the list |
getProfiles | public synchronized List<CrawlJob> getProfiles()(Code) | | Returns a List of all known profiles.
a List of all known profiles. |
getStateJobFile | protected File getStateJobFile(File jobDir)(Code) | | Find the state.job file in the job directory.
Parameters: jobDir - Directory to look in. Full path to 'state.job' file or null if none found. |
importUri | public void importUri(String uri, boolean forceFetch, boolean isSeed) throws URIException(Code) | | Schedule a uri.
Parameters: uri - Uri to schedule. Parameters: forceFetch - Should it be forcefetched. Parameters: isSeed - True if seed. throws: URIException - |
importUri | public void importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush) throws URIException(Code) | | Schedule a uri.
Parameters: str - String that can be: 1. a UURI, 2. a snippet of thecrawl.log line, or 3. a snippet from recover log. SeeCrawlJobHandler.importUris(InputStream,String,boolean) for how it subparsesthe lines from crawl.log and recover.log. Parameters: forceFetch - Should it be forcefetched. Parameters: isSeed - True if seed. Parameters: isFlush - If true, flush the frontier IF it implementsflushing. throws: URIException - |
importUris | public String importUris(String fileOrUrl, String style, boolean forceRevisit)(Code) | | Parameters: fileOrUrl - Name of file w/ seeds. Parameters: style - What style of seeds -- crawl log (crawlLog style) or recovery journal (recoveryJournal style), orseeds file style (Pass default style). Parameters: forceRevisit - Should we revisit even if seen before? A display string that has a count of all added. |
isCrawling | public boolean isCrawling()(Code) | | Is a crawl job being crawled?
True if a job is actually being crawled (even if it is paused).False if no job is being crawled. |
isRunning | public boolean isRunning()(Code) | | Is the crawler accepting crawl jobs to run?
True if the next availible CrawlJob will be crawled. False otherwise. |
kickUpdate | public void kickUpdate()(Code) | | Forward a 'kick' update to current job if any.
|
loadJob | protected void loadJob(File job)(Code) | | Loads a job given a specific job file. The loaded job will be placed in
the list of completed jobs or pending queue depending on its status.
Running jobs will have their status set to 'finished abnormally' and put
into the completed list.
Parameters: job - The job file of the job to load. |
loadOptions | public static ArrayList<String> loadOptions(String file) throws IOException(Code) | | Loads options from a file. Typically these are a list of available
modules that can be plugged into some part of the configuration.
For examples Processors, Frontiers, Filters etc. Leading and trailing
spaces are trimmed from each line.
Options are loaded from the CLASSPATH.
Parameters: file - the name of the option file (without path!) The option file with each option line as a seperate entry in theArrayList. throws: IOException - when there is trouble reading the file. |
loadProfile | protected boolean loadProfile(File profile)(Code) | | Load one profile.
Parameters: profile - Profile to load. True if loaded profile was the default profile. |
newJob | public CrawlJob newJob(CrawlJob baseOn, String recovery, String name, String description, String seeds, int priority) throws FatalConfigurationException(Code) | | Creates a new job. The new job will be returned and also registered as
the handler's 'new job'. The new job will be based on the settings
provided but created in a new location on disk.
Parameters: baseOn - A CrawlJob (with a valid settingshandler) to use as thetemplate for the new job. Parameters: recovery - Whether to preinitialize new job as recovery ofbaseOn job. String holds RECOVER_LOG if we are todo the recovery based off the recover.gz log -- See RecoveryJournal inthe frontier package -- or it holds the name ofthe checkpoint we're to use recoverying. Parameters: name - The name of the new job. Parameters: description - Descriptions of the job. Parameters: seeds - The contents of the new settings' seed file. Parameters: priority - The priority of the new job. The new crawl job. throws: FatalConfigurationException - If a problem occurs creating thesettings. |
newJob | public CrawlJob newJob(File orderFile, String name, String description, String seeds) throws FatalConfigurationException(Code) | | Creates a new job. The new job will be returned and also registered as
the handler's 'new job'. The new job will be based on the settings
provided but created in a new location on disk.
Parameters: orderFile - Order file to use as the template for the new job. Parameters: name - The name of the new job. Parameters: description - Descriptions of the job. Parameters: seeds - The contents of the new settings' seed file. The new crawl job. throws: FatalConfigurationException - If a problem occurs creating thesettings. |
newProfile | public CrawlJob newProfile(CrawlJob baseOn, String name, String description, String seeds) throws FatalConfigurationException, IOException(Code) | | Creates a new profile. The new profile will be returned and also
registered as the handler's 'new job'. The new profile will be based on
the settings provided but created in a new location on disk.
Parameters: baseOn - A CrawlJob (with a valid settingshandler) to use as thetemplate for the new profile. Parameters: name - The name of the new profile. Parameters: description - Description of the new profile Parameters: seeds - The contents of the new profiles' seed file The new profile. throws: FatalConfigurationException - throws: IOException - |
pauseJob | public void pauseJob()(Code) | | Cause the current job to pause. If no current job is crawling this
method will have no effect.
|
requestCrawlStop | public void requestCrawlStop()(Code) | | |
resumeJob | public void resumeJob()(Code) | | Cause the current job to resume crawling if it was paused. Will have no
effect if the current job was not paused or if there is no current job.
If the current job is still waiting to pause, this will not take effect
until the job has actually paused. At which time it will immeditatly
resume crawling.
|
setDefaultProfile | public void setDefaultProfile(CrawlJob profile)(Code) | | Set the default profile.
Parameters: profile - The new default profile. The following must apply to it.profile.isProfile() should return true andthis.getProfiles() should contain it. |
startCrawler | public void startCrawler()(Code) | | Allow jobs to be crawled.
|
startNextJob | final protected void startNextJob()(Code) | | Start next crawl job.
If a is job already running this method will do nothing.
|
startNextJobInternal | protected void startNextJobInternal()(Code) | | |
stopCrawler | public void stopCrawler()(Code) | | Stop future jobs from being crawled.
This action will not affect the current job.
|
terminateCurrentJob | public boolean terminateCurrentJob()(Code) | | True if we terminated a current job (False if no job toterminate) |
|
|