Java Doc for CrawlJobHandler.java in » Web-Crawler » heritrix » org » archive » crawler » admin » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.admin

Source Cross Reference Class Diagram Java Document (Java Doc)

java.lang .Object

org.archive.crawler.admin .CrawlJobHandler

All known Subclasses:   org.archive.crawler.selftest .SelfTestCrawlJobHandler,
CrawlJobHandler
public class CrawlJobHandler implements CrawlStatusListener(Code)
This class manages CrawlJobs. Submitted crawl jobs are queued up and run in order when the crawler is running.
Basically this provides a layer between any potential user interface and the CrawlJobs. It keeps the lists of completed jobs, pending jobs, etc.
The jobs managed by the handler can be divided into the following:

Pending - Jobs that are ready to run and are waiting their turn. These can be edited, viewed, deleted etc.
Running - Only one job can be running at a time. There may be no job running. The running job can be viewed and edited to some extent. It can also be terminated. This job should have a StatisticsTracking module attached to it for more details on the crawl.
Completed - Jobs that have finished crawling or have been deleted from the pending queue or terminated while running. They can not be edited but can be viewed. They retain the StatisticsTracking module from their run.
New job - At any given time their can be one 'new job' the new job is not considered ready to run. It can be edited or discarded (in which case it will be totally destroyed, including any files on disk). Once an operator deems the job ready to run it can be moved to the pending queue.
Profiles - Jobs under profiles are not actual jobs. They can be edited normally but can not be submitted to the pending queue. New jobs can be created using a profile as it's template.
author:
   Kristinn Sigurdsson
See Also:   org.archive.crawler.admin.CrawlJob

Field Summary
final public static String DEFAULT_PROFILE
     Default profile name.
final public static String DEFAULT_PROFILE_NAME
     Name of system property whose specification overrides default profile used.
final public static String ORDER_FILE_NAME

final public static String PROFILES_DIR_NAME
     Name of the profiles directory.
final public static String RECOVER_LOG
     String to indicate recovery should be based on the recovery log, not based on checkpointing.

Constructor Summary
public CrawlJobHandler(File jobsDir)
     Constructor.
public CrawlJobHandler(File jobsDir, boolean loadJobs, boolean loadProfiles)
     Constructor allowing for optional loading of profiles and jobs.

Method Summary
public CrawlJob addJob(CrawlJob job)
     Submit a job to the handler.
public synchronized void addProfile(CrawlJob profile)

protected void checkDirectory(File dir)

public void checkpointJob()
     Cause the current job to write a checkpoint to disk.
public void crawlCheckpoint(File checkpointDir)

public void crawlEnded(String sExitMessage)

public void crawlEnding(String sExitMessage)

public void crawlPaused(String statusMessage)

public void crawlPausing(String statusMessage)

public void crawlResuming(String statusMessage)

public void crawlStarted(String message)

protected CrawlJob createNewJob(File orderFile, String name, String description, String seeds, int priority)

protected XMLSettingsHandler createSettingsHandler(File orderFile, String name, String description, String seeds, File newSettingsDir, CrawlJobErrorHandler errorHandler, String filename, String seedfile)
     Creates a new settings handler based on an existing job.
public void deleteJob(String jobUID)
     The specified job will be removed from the pending queue or aborted if currently running.
public synchronized void deleteProfile(CrawlJob cj)

public long deleteURIsFromPending(String regexpr)
     Delete any URI from the frontier of the current (paused) job that match the specified regular expression.
public void discardNewJob()
     Discard the handler's 'new job'.
protected void doFlush()
     If its a HostQueuesFrontier, needs to be flushed for the queued.
public static CrawlJob ensureNewJobWritten(CrawlJob newJob, String metaname, String description)
     Ensure order file with new name/desc is written. See '[ 1066573 ] sometimes job based-on other job uses older job name'.
Parameters:
  newJob - Newly created job.
Parameters:
  metaname - Metaname for new job.
Parameters:
  description - Description for new job.
public List<CrawlJob> getCompletedJobs()

public CrawlJob getCurrentJob()

public synchronized CrawlJob getDefaultProfile()
     Returns the default profile.
public FrontierMarker getInitialMarker(String regexpr, boolean inCacheOnly)
     Returns a URIFrontierMarker for the current, paused, job.
public CrawlJob getJob(String jobUID)
     Return a job with the given UID. Doesn't matter if it's pending, currently running, has finished running is new or a profile.
Parameters:
  jobUID - The unique ID of the job.
public CrawlJob getNewJob()

public String getNextJobUID()
     Returns a unique job ID.
No two calls to this method (on the same instance of this class) can ever return the same value.
public List<CrawlJob> getPendingJobs()

public ArrayList getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
     Returns the frontiers URI list based on the provided marker.
public synchronized List<CrawlJob> getProfiles()
     Returns a List of all known profiles.
protected File getStateJobFile(File jobDir)
     Find the state.job file in the job directory.
Parameters:
  jobDir - Directory to look in.
public void importUri(String uri, boolean forceFetch, boolean isSeed)
     Schedule a uri.
public void importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush)
     Schedule a uri.
Parameters:
  str - String that can be: 1.
public String importUris(String file, String style, String force)

public String importUris(String fileOrUrl, String style, boolean forceRevisit)

protected int importUris(InputStream is, String style, boolean forceRevisit)

public boolean isCrawling()

public boolean isRunning()
     Is the crawler accepting crawl jobs to run? True if the next availible CrawlJob will be crawled.
public void kickUpdate()
     Forward a 'kick' update to current job if any.
protected void loadJob(File job)
     Loads a job given a specific job file.
public static ArrayList<String> loadOptions(String file)
     Loads options from a file.
protected boolean loadProfile(File profile)
     Load one profile.
Parameters:
  profile - Profile to load.
public CrawlJob newJob(CrawlJob baseOn, String recovery, String name, String description, String seeds, int priority)
     Creates a new job.
public CrawlJob newJob(File orderFile, String name, String description, String seeds)
     Creates a new job.
public CrawlJob newProfile(CrawlJob baseOn, String name, String description, String seeds)
     Creates a new profile.
public void pauseJob()
     Cause the current job to pause.
public void requestCrawlStop()

public void resumeJob()
     Cause the current job to resume crawling if it was paused.
public void setDefaultProfile(CrawlJob profile)
     Set the default profile.
Parameters:
  profile - The new default profile.
public void startCrawler()
     Allow jobs to be crawled.
final protected void startNextJob()
     Start next crawl job.
protected void startNextJobInternal()

public void stop()

public void stopCrawler()
     Stop future jobs from being crawled.
public boolean terminateCurrentJob()

protected void updateRecoveryPaths(File recover, SettingsHandler sh, String jobName)

Parameters:
  recover - Source to use recovering.

Field Detail
DEFAULT_PROFILE
final public static String DEFAULT_PROFILE(Code)
Default profile name.

DEFAULT_PROFILE_NAME
final public static String DEFAULT_PROFILE_NAME(Code)
Name of system property whose specification overrides default profile used.

ORDER_FILE_NAME
final public static String ORDER_FILE_NAME(Code)

PROFILES_DIR_NAME
final public static String PROFILES_DIR_NAME(Code)
Name of the profiles directory.

RECOVER_LOG
final public static String RECOVER_LOG(Code)
String to indicate recovery should be based on the recovery log, not based on checkpointing.

Constructor Detail
CrawlJobHandler
public CrawlJobHandler(File jobsDir)(Code)
Constructor.
Parameters:
  jobsDir - Jobs directory.

CrawlJobHandler
public CrawlJobHandler(File jobsDir, boolean loadJobs, boolean loadProfiles)(Code)
Constructor allowing for optional loading of profiles and jobs.
Parameters:
  jobsDir - Jobs directory.
Parameters:
  loadJobs - If true then any applicable jobs will be loaded.
Parameters:
  loadProfiles - If true then any applicable profiles will be loaded.

Method Detail
addJob
public CrawlJob addJob(CrawlJob job)(Code)
Submit a job to the handler. Job will be scheduled for crawling. At present it will not take the job's priority into consideration.
Parameters:
  job - A new job for the handler CrawlJob that was added or null.

addProfile
public synchronized void addProfile(CrawlJob profile)(Code)
Add a new profile
Parameters:
  profile - The new profile

checkDirectory
protected void checkDirectory(File dir) throws FatalConfigurationException(Code)

checkpointJob
public void checkpointJob() throws IllegalStateException(Code)
Cause the current job to write a checkpoint to disk. Currently requires job to already be paused.
throws:
  IllegalStateException - Thrown if crawl is not paused.

crawlCheckpoint
public void crawlCheckpoint(File checkpointDir) throws Exception(Code)

crawlEnded
public void crawlEnded(String sExitMessage)(Code)

crawlEnding
public void crawlEnding(String sExitMessage)(Code)

crawlPaused
public void crawlPaused(String statusMessage)(Code)

crawlPausing
public void crawlPausing(String statusMessage)(Code)

crawlResuming
public void crawlResuming(String statusMessage)(Code)

crawlStarted
public void crawlStarted(String message)(Code)

createNewJob
protected CrawlJob createNewJob(File orderFile, String name, String description, String seeds, int priority) throws FatalConfigurationException(Code)

createSettingsHandler
protected XMLSettingsHandler createSettingsHandler(File orderFile, String name, String description, String seeds, File newSettingsDir, CrawlJobErrorHandler errorHandler, String filename, String seedfile) throws FatalConfigurationException(Code)
Creates a new settings handler based on an existing job. Basically all the settings file for the 'based on' will be copied to the specified directory.
Parameters:
  orderFile - Order file to base new order file on. Cannot be null.
Parameters:
  name - Name for the new settings
Parameters:
  description - Description of the new settings.
Parameters:
  seeds - The contents of the new settings' seed file.
Parameters:
  newSettingsDir -
Parameters:
  errorHandler -
Parameters:
  filename - Name of new order file.
Parameters:
  seedfile - Name of new seeds file. The new settings handler.
throws:
  FatalConfigurationException - If there are problems with reading the 'base on'configuration, with writing the new configuration or it'sseed file.

deleteJob
public void deleteJob(String jobUID)(Code)
The specified job will be removed from the pending queue or aborted if currently running. It will be placed in the list of completed jobs with appropriate status info. If the job is already in the completed list or no job with the given UID is found, no action will be taken.
Parameters:
  jobUID - The UID (unique ID) of the job that is to be deleted.

deleteProfile
public synchronized void deleteProfile(CrawlJob cj) throws IOException(Code)

deleteURIsFromPending
public long deleteURIsFromPending(String regexpr)(Code)
Delete any URI from the frontier of the current (paused) job that match the specified regular expression. If the current job is not paused (or there is no current job) nothing will be done.
Parameters:
  regexpr - Regular expression to delete URIs by. the number of URIs deleted

discardNewJob
public void discardNewJob()(Code)
Discard the handler's 'new job'. This will remove any files/directories written to disk.

doFlush
protected void doFlush()(Code)
If its a HostQueuesFrontier, needs to be flushed for the queued.

ensureNewJobWritten
public static CrawlJob ensureNewJobWritten(CrawlJob newJob, String metaname, String description)(Code)
Ensure order file with new name/desc is written. See '[ 1066573 ] sometimes job based-on other job uses older job name'.
Parameters:
  newJob - Newly created job.
Parameters:
  metaname - Metaname for new job.
Parameters:
  description - Description for new job. newJob

getCompletedJobs
public List<CrawlJob> getCompletedJobs()(Code)
A List of all finished jobs.

getCurrentJob
public CrawlJob getCurrentJob()(Code)
The job currently being crawled.

getDefaultProfile
public synchronized CrawlJob getDefaultProfile()(Code)
Returns the default profile. If no default profile has been set it will return the first profile that was set/loaded and still exists. If no profiles exist it will return null the default profile.

getInitialMarker
public FrontierMarker getInitialMarker(String regexpr, boolean inCacheOnly)(Code)
Returns a URIFrontierMarker for the current, paused, job. If there is no current job or it is not paused null will be returned.
Parameters:
  regexpr - A regular expression that each URI must match in order to beconsidered 'within' the marker.
Parameters:
  inCacheOnly - Limit marker scope to 'cached' URIs. a URIFrontierMarker for the current job.
See Also:   CrawlJobHandler.getPendingURIsList(FrontierMarker,int,boolean)
See Also:   org.archive.crawler.framework.Frontier.getInitialMarker(Stringboolean)
See Also:   org.archive.crawler.framework.FrontierMarker

getJob
public CrawlJob getJob(String jobUID)(Code)
Return a job with the given UID. Doesn't matter if it's pending, currently running, has finished running is new or a profile.
Parameters:
  jobUID - The unique ID of the job. The job with the UID or null if no such job is found

getNewJob
public CrawlJob getNewJob()(Code)
Get the handler's 'new job' the handler's 'new job'

getNextJobUID
public String getNextJobUID()(Code)
Returns a unique job ID.
No two calls to this method (on the same instance of this class) can ever return the same value.
Currently implemented to return a time stamp. That is subject to change though. A unique job ID.
See Also:   ArchiveUtils.TIMESTAMP17

getPendingJobs
public List<CrawlJob> getPendingJobs()(Code)
A List of all pending jobs A List of all pending jobs.No promises are made about the order of the list

getPendingURIsList
public ArrayList getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException(Code)
Returns the frontiers URI list based on the provided marker. This method will return null if there is not current job or if the current job is not paused. Only when there is a paused current job will this method return a URI list.
Parameters:
  marker - URIFrontier marker
Parameters:
  numberOfMatches - maximum number of matches to return
Parameters:
  verbose - should detailed info be provided on each URI? the frontiers URI list based on the provided marker
throws:
  InvalidFrontierMarkerException - When marker is inconsistent with the current state of thefrontier.
See Also:   CrawlJobHandler.getInitialMarker(String,boolean)
See Also:   org.archive.crawler.framework.FrontierMarker

getProfiles
public synchronized List<CrawlJob> getProfiles()(Code)
Returns a List of all known profiles. a List of all known profiles.

getStateJobFile
protected File getStateJobFile(File jobDir)(Code)
Find the state.job file in the job directory.
Parameters:
  jobDir - Directory to look in. Full path to 'state.job' file or null if none found.

importUri
public void importUri(String uri, boolean forceFetch, boolean isSeed) throws URIException(Code)
Schedule a uri.
Parameters:
  uri - Uri to schedule.
Parameters:
  forceFetch - Should it be forcefetched.
Parameters:
  isSeed - True if seed.
throws:
  URIException -

importUri
public void importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush) throws URIException(Code)
Schedule a uri.
Parameters:
  str - String that can be: 1. a UURI, 2. a snippet of thecrawl.log line, or 3. a snippet from recover log. SeeCrawlJobHandler.importUris(InputStream,String,boolean) for how it subparsesthe lines from crawl.log and recover.log.
Parameters:
  forceFetch - Should it be forcefetched.
Parameters:
  isSeed - True if seed.
Parameters:
  isFlush - If true, flush the frontier IF it implementsflushing.
throws:
  URIException -

importUris
public String importUris(String file, String style, String force)(Code)

importUris
public String importUris(String fileOrUrl, String style, boolean forceRevisit)(Code)

Parameters:
  fileOrUrl - Name of file w/ seeds.
Parameters:
  style - What style of seeds -- crawl log (crawlLogstyle) or recovery journal (recoveryJournal style), orseeds file style (Pass default style).
Parameters:
  forceRevisit - Should we revisit even if seen before? A display string that has a count of all added.

importUris
protected int importUris(InputStream is, String style, boolean forceRevisit)(Code)

isCrawling
public boolean isCrawling()(Code)
Is a crawl job being crawled? True if a job is actually being crawled (even if it is paused).False if no job is being crawled.

isRunning
public boolean isRunning()(Code)
Is the crawler accepting crawl jobs to run? True if the next availible CrawlJob will be crawled. False otherwise.

kickUpdate
public void kickUpdate()(Code)
Forward a 'kick' update to current job if any.

loadJob
protected void loadJob(File job)(Code)
Loads a job given a specific job file. The loaded job will be placed in the list of completed jobs or pending queue depending on its status. Running jobs will have their status set to 'finished abnormally' and put into the completed list.
Parameters:
  job - The job file of the job to load.

loadOptions
public static ArrayList<String> loadOptions(String file) throws IOException(Code)
Loads options from a file. Typically these are a list of available modules that can be plugged into some part of the configuration. For examples Processors, Frontiers, Filters etc. Leading and trailing spaces are trimmed from each line.
Options are loaded from the CLASSPATH.
Parameters:
  file - the name of the option file (without path!) The option file with each option line as a seperate entry in theArrayList.
throws:
  IOException - when there is trouble reading the file.

loadProfile
protected boolean loadProfile(File profile)(Code)
Load one profile.
Parameters:
  profile - Profile to load. True if loaded profile was the default profile.

newJob
public CrawlJob newJob(CrawlJob baseOn, String recovery, String name, String description, String seeds, int priority) throws FatalConfigurationException(Code)
Creates a new job. The new job will be returned and also registered as the handler's 'new job'. The new job will be based on the settings provided but created in a new location on disk.
Parameters:
  baseOn - A CrawlJob (with a valid settingshandler) to use as thetemplate for the new job.
Parameters:
  recovery - Whether to preinitialize new job as recovery ofbaseOn job. String holds RECOVER_LOG if we are todo the recovery based off the recover.gz log -- See RecoveryJournal inthe frontier package -- or it holds the name ofthe checkpoint we're to use recoverying.
Parameters:
  name - The name of the new job.
Parameters:
  description - Descriptions of the job.
Parameters:
  seeds - The contents of the new settings' seed file.
Parameters:
  priority - The priority of the new job. The new crawl job.
throws:
  FatalConfigurationException - If a problem occurs creating thesettings.

newJob
public CrawlJob newJob(File orderFile, String name, String description, String seeds) throws FatalConfigurationException(Code)
Creates a new job. The new job will be returned and also registered as the handler's 'new job'. The new job will be based on the settings provided but created in a new location on disk.
Parameters:
  orderFile - Order file to use as the template for the new job.
Parameters:
  name - The name of the new job.
Parameters:
  description - Descriptions of the job.
Parameters:
  seeds - The contents of the new settings' seed file. The new crawl job.
throws:
  FatalConfigurationException - If a problem occurs creating thesettings.

newProfile
public CrawlJob newProfile(CrawlJob baseOn, String name, String description, String seeds) throws FatalConfigurationException, IOException(Code)
Creates a new profile. The new profile will be returned and also registered as the handler's 'new job'. The new profile will be based on the settings provided but created in a new location on disk.
Parameters:
  baseOn - A CrawlJob (with a valid settingshandler) to use as thetemplate for the new profile.
Parameters:
  name - The name of the new profile.
Parameters:
  description - Description of the new profile
Parameters:
  seeds - The contents of the new profiles' seed file The new profile.
throws:
  FatalConfigurationException -
throws:
  IOException -

pauseJob
public void pauseJob()(Code)
Cause the current job to pause. If no current job is crawling this method will have no effect.

requestCrawlStop
public void requestCrawlStop()(Code)

resumeJob
public void resumeJob()(Code)
Cause the current job to resume crawling if it was paused. Will have no effect if the current job was not paused or if there is no current job. If the current job is still waiting to pause, this will not take effect until the job has actually paused. At which time it will immeditatly resume crawling.

setDefaultProfile
public void setDefaultProfile(CrawlJob profile)(Code)
Set the default profile.
Parameters:
  profile - The new default profile. The following must apply to it.profile.isProfile() should return true andthis.getProfiles() should contain it.

startCrawler
public void startCrawler()(Code)
Allow jobs to be crawled.

startNextJob
final protected void startNextJob()(Code)
Start next crawl job. If a is job already running this method will do nothing.

startNextJobInternal
protected void startNextJobInternal()(Code)

stop
public void stop()(Code)

stopCrawler
public void stopCrawler()(Code)
Stop future jobs from being crawled. This action will not affect the current job.

terminateCurrentJob
public boolean terminateCurrentJob()(Code)
True if we terminated a current job (False if no job toterminate)

updateRecoveryPaths
protected void updateRecoveryPaths(File recover, SettingsHandler sh, String jobName) throws FatalConfigurationException(Code)

Parameters:
  recover - Source to use recovering. Can be full path to a recovery logor full path to a checkpoint src dir.
Parameters:
  sh - Settings Handler to update.
Parameters:
  jobName - Name of this job.
throws:
  FatalConfigurationException -

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.