| java.lang.Object javax.management.NotificationBroadcasterSupport org.archive.crawler.admin.CrawlJob
Constructor Summary | |
protected | CrawlJob() A shutdown Constructor. | public | CrawlJob(String UID, String name, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler, int priority, File dir) A constructor for jobs.
Create, ready to crawl, jobs.
Parameters: UID - A unique ID for this job. | protected | CrawlJob(String UIDandName, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler) A constructor for profiles.
Any job created with this constructor will be
considered a profile. | public | CrawlJob(String UID, String name, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler, int priority, File dir, String status, boolean isProfile, boolean isNew) | protected | CrawlJob(File jobFile, CrawlJobErrorHandler errorHandler) A constructor for reloading jobs from disk. |
Method Summary | |
protected void | addBdbjeAttributes(List<OpenMBeanAttributeInfo> attributes, List<MBeanAttributeInfo> bdbjeAttributes, List<String> bdbjeNamesToAdd) | protected void | addBdbjeOperations(List<OpenMBeanOperationInfo> operations, List<MBeanOperationInfo> bdbjeOperations, List<String> bdbjeNamesToAdd) | protected void | addCrawlOrderAttributes(ComplexType type, List<OpenMBeanAttributeInfo> attributes) | protected OpenMBeanInfoSupport | buildMBeanInfo() Build up the MBean info for Heritrix main. | protected void | checkpoint() | public void | crawlCheckpoint(File checkpointDir) | public void | crawlEnded(String sExitMessage) | public void | crawlEnding(String sExitMessage) | public void | crawlPaused(String statusMessage) | public void | crawlPausing(String statusMessage) | public void | crawlResuming(String statusMessage) | public void | crawlStarted(String message) | protected CrawlController | createCrawlController() | public long | deleteURIsFromPending(String regexpr) Delete any URI from the frontier of the current (paused) job that match
the specified regular expression. | protected void | flush() If its a HostQueuesFrontier, needs to be flushed for the queued. | public Object | getAttribute(String attribute_name) | public AttributeList | getAttributes(String[] attributeNames) | public CrawlController | getController() | protected Object | getCrawlOrderAttribute(String attribute_name) | protected Object | getCrawlOrderAttribute(String attribute_name, ComplexType ct) | public String | getCrawlStatus() | public File | getDirectory() Returns the path of the job's base directory. | public String | getDisplayName() Return the combination of given name and UID most commonly
used in administrative interface. | public CrawlJobErrorHandler | getErrorHandler() | public String | getErrorMessage() Get the error message associated with this job. | public String | getFrontierOneLine() | public String | getFrontierReport(String reportName) Parameters: reportName - Name of report to write. | protected Heritrix | getHostingHeritrix() | public String | getIgnoredSeeds() Utility method to get the stored list of ignored seed items (if any),
from the last time the seeds were imported to the frontier. | public FrontierMarker | getInitialMarker(String regexpr, boolean inCacheOnly) Returns a URIFrontierMarker for the current, paused, job. | public String | getJmxJobName() | public String | getJobName() Returns this job's 'name'. | public int | getJobPriority() Get this job's level of priority. | public String | getLogPath(String log) Returns the absolute path of the specified log. | public MBeanInfo | getMBeanInfo() | protected ObjectName | getMbeanName() | protected static int | getNotificationsSequenceNumber() | public int | getNumberOfJournalEntries() | public ArrayList | getPendingURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) Returns the frontiers URI list based on the provided marker. | public String | getProcessorsReport() Get the Processors report for the running crawl. | public String | getSettingsDirectory() Returns the directory where the configuration files for this job are
located. | public XMLSettingsHandler | getSettingsHandler() Returns the settings handler for this job. | public StatisticsTracking | getStatisticsTracking() | public String | getStatus() | public String | getThreadOneLine() | public String | getThreadsReport() Get the CrawlControllers ToeThreads report for the running crawl. | public String | getUID() Returns this jobs unique ID (UID) that was issued by the
CrawlJobHandler() when this job was first created. | public void | importUri(String uri, boolean forceFetch, boolean isSeed) Schedule a uri. | public void | importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush) Schedule a uri.
Parameters: str - String that can be: 1. | public String | importUris(String file, String style, String force) | public String | importUris(String fileOrUrl, String style, boolean forceRevisit) | public String | importUris(String fileOrUrl, String style, boolean forceRevisit, boolean areSeeds) | protected int | importUris(InputStream is, String style, boolean forceRevisit) | protected int | importUris(InputStream is, String style, boolean forceRevisit, boolean areSeeds) Import URIs.
Parameters: is - Stream to use as URI source. Parameters: style - Style in which URIs are rendored. | public Object | invoke(String operationName, Object[] params, String[] signature) | public boolean | isCheckpointing() | public boolean | isCrawling() | public boolean | isNew() | public boolean | isProfile() | public boolean | isReadOnly() | public boolean | isRunning() Returns true if the job is being crawled. | public void | kickUpdate() Forward a 'kick' update to current controller if any. | public void | killThread(int threadNumber, boolean replace) Kills a thread. | public void | mustBeCrawling() | protected void | pause() | public void | postDeregister() | public void | postRegister(Boolean registrationDone) | public void | preDeregister() | public ObjectName | preRegister(MBeanServer server, ObjectName on) | protected void | resume() | public Collection | scanCheckpoints() | public void | setAttribute(Attribute attribute) | public AttributeList | setAttributes(AttributeList attributes) | protected void | setCrawlOrderAttribute(String attribute_name, ComplexType ct, Attribute attribute) | public void | setErrorMessage(String string) Set an error message for this job. | public void | setJobPriority(int priority) Set this job's level of priority. | public void | setNew(boolean b) Set if the job is considered a new job or not. | public void | setNumberOfJournalEntries(int numberOfJournalEntries) | public void | setReadOnly() Once called no changes can be made to the settings for this job. | protected void | setRunning(boolean b) Set if job is being crawled. | public void | setStatus(String status) Set the status of this CrawlJob. | protected CrawlController | setupCrawlController() | public void | setupForCrawlStart() | public void | stopCrawling() | protected void | unregisterMBean() | public void | writeFrontierReport(String reportName, PrintWriter writer) | public void | writeThreadsReport(String reportName, PrintWriter writer) |
PRIORITY_AVERAGE | final public static int PRIORITY_AVERAGE(Code) | | average
|
PRIORITY_CRITICAL | final public static int PRIORITY_CRITICAL(Code) | | highest
|
PRIORITY_HIGH | final public static int PRIORITY_HIGH(Code) | | high
|
PRIORITY_LOW | final public static int PRIORITY_LOW(Code) | | low
|
PRIORITY_MINIMAL | final public static int PRIORITY_MINIMAL(Code) | | lowest
|
STATUS_ABORTED | final public static String STATUS_ABORTED(Code) | | Job was terminted by user input while crawling
|
STATUS_CHECKPOINTING | final public static String STATUS_CHECKPOINTING(Code) | | Job is being checkpointed. When finished checkpointing, job is set
back to STATUS_PAUSED (Job must be first paused before checkpointing
will run).
|
STATUS_CREATED | final public static String STATUS_CREATED(Code) | | Inital value. May not be ready to run/incomplete.
|
STATUS_DELETED | final public static String STATUS_DELETED(Code) | | Job was deleted by user, will not be displayed in UI.
|
STATUS_FINISHED | final public static String STATUS_FINISHED(Code) | | Job finished normally having completed its crawl.
|
STATUS_FINISHED_ABNORMAL | final public static String STATUS_FINISHED_ABNORMAL(Code) | | Something went very wrong
|
STATUS_FINISHED_DATA_LIMIT | final public static String STATUS_FINISHED_DATA_LIMIT(Code) | | Job finished normally when the specifed amount of
data (MB) had been downloaded
|
STATUS_FINISHED_DOCUMENT_LIMIT | final public static String STATUS_FINISHED_DOCUMENT_LIMIT(Code) | | Job finished normally when the specified number of documents had been
fetched.
|
STATUS_FINISHED_TIME_LIMIT | final public static String STATUS_FINISHED_TIME_LIMIT(Code) | | Job finished normally when the specified timelimit was hit.
|
STATUS_MISCONFIGURED | final public static String STATUS_MISCONFIGURED(Code) | | Job could not be launced due to an InitializationException
|
STATUS_PAUSED | final public static String STATUS_PAUSED(Code) | | Job was temporarly stopped. State is kept so it can be resumed
|
STATUS_PENDING | final public static String STATUS_PENDING(Code) | | Job has been successfully submitted to a CrawlJobHandler
|
STATUS_PREPARING | final public static String STATUS_PREPARING(Code) | | |
STATUS_PROFILE | final public static String STATUS_PROFILE(Code) | | Job is actually a profile
|
STATUS_RUNNING | final public static String STATUS_RUNNING(Code) | | Job is being crawled
|
STATUS_WAITING_FOR_PAUSE | final public static String STATUS_WAITING_FOR_PAUSE(Code) | | Job is going to be temporarly stopped after active threads are finished.
|
CrawlJob | protected CrawlJob()(Code) | | A shutdown Constructor.
|
CrawlJob | public CrawlJob(String UID, String name, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler, int priority, File dir)(Code) | | A constructor for jobs.
Create, ready to crawl, jobs.
Parameters: UID - A unique ID for this job. Typically emitted by theCrawlJobHandler. Parameters: name - The name of the job Parameters: settingsHandler - The associated settings Parameters: errorHandler - The crawl jobs settings error handler.null means none is set Parameters: priority - job priority. Parameters: dir - The directory that is considered this jobs working directory. |
CrawlJob | protected CrawlJob(String UIDandName, XMLSettingsHandler settingsHandler, CrawlJobErrorHandler errorHandler)(Code) | | A constructor for profiles.
Any job created with this constructor will be
considered a profile. Profiles are not stored on disk (only their
settings files are stored on disk). This is because their data is
predictible given any settings files.
Parameters: UIDandName - A unique ID for this job. For profiles this is the sameas name Parameters: settingsHandler - The associated settings Parameters: errorHandler - The crawl jobs settings error handler.null means none is set |
CrawlJob | protected CrawlJob(File jobFile, CrawlJobErrorHandler errorHandler) throws InvalidJobFileException, IOException(Code) | | A constructor for reloading jobs from disk. Jobs (not profiles) have
their data written to persistent storage in the file system. This method
is used to load the job from such storage. This is done by the
CrawlJobHandler .
Proper structure of a job file (TODO: Maybe one day make this an XML file)
Line 1. UID
Line 2. Job name (string)
Line 3. Job status (string)
Line 4. is job read only (true/false)
Line 5. is job running (true/false)
Line 6. job priority (int)
Line 7. number of journal entries
Line 8. setting file (with path)
Line 9. statistics tracker file (with path)
Line 10-?. error message (String, empty for null), can be many lines
Parameters: jobFile - a file containing information about the job to load. Parameters: errorHandler - The crawl jobs settings error handler.null means none is set throws: InvalidJobFileException - if the specified file does not refer to a valid job file. throws: IOException - if io operations fail |
crawlEnding | public void crawlEnding(String sExitMessage)(Code) | | |
crawlPaused | public void crawlPaused(String statusMessage)(Code) | | |
crawlPausing | public void crawlPausing(String statusMessage)(Code) | | |
crawlResuming | public void crawlResuming(String statusMessage)(Code) | | |
deleteURIsFromPending | public long deleteURIsFromPending(String regexpr)(Code) | | Delete any URI from the frontier of the current (paused) job that match
the specified regular expression. If the current job is not paused (or
there is no current job) nothing will be done.
Parameters: regexpr - Regular expression to delete URIs by. the number of URIs deleted |
flush | protected void flush()(Code) | | If its a HostQueuesFrontier, needs to be flushed for the queued.
|
getCrawlOrderAttribute | protected Object getCrawlOrderAttribute(String attribute_name)(Code) | | |
getCrawlStatus | public String getCrawlStatus()(Code) | | Status of the crawler (Used by JMX). |
getDirectory | public File getDirectory()(Code) | | Returns the path of the job's base directory. For profiles this is always
equal to new File(getSettingsDirectory()) .
the path of the job's base directory. |
getDisplayName | public String getDisplayName()(Code) | | Return the combination of given name and UID most commonly
used in administrative interface.
Job's name with UID notation |
getErrorMessage | public String getErrorMessage()(Code) | | Get the error message associated with this job. Will return null if there
is no error message.
the error message associated with this job |
getFrontierOneLine | public String getFrontierOneLine()(Code) | | One-line Frontier report. |
getFrontierReport | public String getFrontierReport(String reportName)(Code) | | Parameters: reportName - Name of report to write. A report of the frontier's status. |
getHostingHeritrix | protected Heritrix getHostingHeritrix()(Code) | | Heritrix that is hosting this job. |
getIgnoredSeeds | public String getIgnoredSeeds()(Code) | | Utility method to get the stored list of ignored seed items (if any),
from the last time the seeds were imported to the frontier.
String of all ignored seed items, or null if none |
getJmxJobName | public String getJmxJobName()(Code) | | Unique name for job that is safe to use in jmx (Like displayname but without spaces). |
getJobName | public String getJobName()(Code) | | Returns this job's 'name'. The name comes from the settings for this job,
need not be unique and may change. For a unique identifier use
CrawlJob.getUID() getUID() .
The name corrisponds to the value of the 'name' tag in the 'meta' section
of the settings file.
This job's 'name' |
getMBeanInfo | public MBeanInfo getMBeanInfo()(Code) | | Our mbean info (Needed for CrawlJob to qualify as aDynamicMBean). |
getNotificationsSequenceNumber | protected static int getNotificationsSequenceNumber()(Code) | | Notification sequence number (Does increment after each access). |
getNumberOfJournalEntries | public int getNumberOfJournalEntries()(Code) | | Returns the number of journal entries. |
getProcessorsReport | public String getProcessorsReport()(Code) | | Get the Processors report for the running crawl.
The Processors report for the running crawl. |
getSettingsDirectory | public String getSettingsDirectory()(Code) | | Returns the directory where the configuration files for this job are
located.
the directory where the configuration files for this job arelocated |
getSettingsHandler | public XMLSettingsHandler getSettingsHandler()(Code) | | Returns the settings handler for this job. It will have been initialized.
the settings handler for this job. |
getStatisticsTracking | public StatisticsTracking getStatisticsTracking()(Code) | | the statistics tracking instance (of null if none yet available). |
getStatus | public String getStatus()(Code) | | Get the current status of this CrawlJob
The current status of this CrawlJob(see constants defined here beginning with STATUS) |
getThreadOneLine | public String getThreadOneLine()(Code) | | One-line threads report. |
getThreadsReport | public String getThreadsReport()(Code) | | Get the CrawlControllers ToeThreads report for the running crawl.
The CrawlControllers ToeThreads report |
importUri | public void importUri(String uri, boolean forceFetch, boolean isSeed) throws URIException(Code) | | Schedule a uri.
Parameters: uri - Uri to schedule. Parameters: forceFetch - Should it be forcefetched. Parameters: isSeed - True if seed. throws: URIException - |
importUri | public void importUri(String str, boolean forceFetch, boolean isSeed, boolean isFlush) throws URIException(Code) | | Schedule a uri.
Parameters: str - String that can be: 1. a UURI, 2. a snippet of thecrawl.log line, or 3. a snippet from recover log. SeeCrawlJob.importUris(InputStream,String,boolean) for how it subparsesthe lines from crawl.log and recover.log. Parameters: forceFetch - Should it be forcefetched. Parameters: isSeed - True if seed. Parameters: isFlush - If true, flush the frontier IF it implementsflushing. throws: URIException - |
importUris | public String importUris(String fileOrUrl, String style, boolean forceRevisit, boolean areSeeds)(Code) | | Parameters: fileOrUrl - Name of file w/ seeds. Parameters: style - What style of seeds -- crawl log, recovery journal, orseeds file. Parameters: forceRevisit - Should we revisit even if seen before? Parameters: areSeeds - Is the file exclusively seeds? A display string that has a count of all added. |
importUris | protected int importUris(InputStream is, String style, boolean forceRevisit, boolean areSeeds)(Code) | | Import URIs.
Parameters: is - Stream to use as URI source. Parameters: style - Style in which URIs are rendored. Currently support forrecoveryJournal , crawlLog , and seeds fileformat (i.e default ) where default style isa UURI per line (comments allowed). Parameters: forceRevisit - Whether we should revisit this URI even if we'vevisited it previously. Parameters: areSeeds - Are the imported URIs seeds? Count of added URIs. |
isCheckpointing | public boolean isCheckpointing()(Code) | | True if checkpointing. |
isCrawling | public boolean isCrawling()(Code) | | |
isNew | public boolean isNew()(Code) | | Is this a new job?
True if is new. |
isProfile | public boolean isProfile()(Code) | | Set if the job is considered to be a profile
True if is a profile. |
isReadOnly | public boolean isReadOnly()(Code) | | Is job read only?
false until setReadOnly has been invoked, after that it returns true. |
isRunning | public boolean isRunning()(Code) | | Returns true if the job is being crawled.
true if the job is being crawled |
mustBeCrawling | public void mustBeCrawling()(Code) | | |
pause | protected void pause()(Code) | | |
postDeregister | public void postDeregister()(Code) | | |
postRegister | public void postRegister(Boolean registrationDone)(Code) | | |
resume | protected void resume()(Code) | | |
scanCheckpoints | public Collection scanCheckpoints()(Code) | | Read all the checkpoints found in the job's checkpoints
directory into Checkpoint instances
Collection containing list of all checkpoints. |
setErrorMessage | public void setErrorMessage(String string)(Code) | | Set an error message for this job. Generally this only occurs if the job
is misconfigured.
Parameters: string - the error message associated with this job |
setNew | public void setNew(boolean b)(Code) | | Set if the job is considered a new job or not.
Parameters: b - Is the job considered to be new. |
setNumberOfJournalEntries | public void setNumberOfJournalEntries(int numberOfJournalEntries)(Code) | | Parameters: numberOfJournalEntries - The number of journal entries to set. |
setReadOnly | public void setReadOnly()(Code) | | Once called no changes can be made to the settings for this job.
Typically this is done once a crawl is completed and further changes
to the crawl order are therefor meaningless.
|
setRunning | protected void setRunning(boolean b)(Code) | | Set if job is being crawled.
Parameters: b - Is job being crawled. |
setStatus | public void setStatus(String status)(Code) | | Set the status of this CrawlJob.
Parameters: status - Current status of CrawlJob(see constants defined here beginning with STATUS) |
stopCrawling | public void stopCrawling()(Code) | | |
unregisterMBean | protected void unregisterMBean()(Code) | | |
writeFrontierReport | public void writeFrontierReport(String reportName, PrintWriter writer)(Code) | | Write the requested frontier report to the given PrintWriter
Parameters: reportName - Name of report to write. Parameters: writer - Where to write to. |
writeThreadsReport | public void writeThreadsReport(String reportName, PrintWriter writer)(Code) | | Write the requested threads report to the given PrintWriter
Parameters: reportName - Name of report to write. Parameters: writer - Where to write to. |
|
|