| java.lang.Object org.archive.crawler.Heritrix
Heritrix | public class Heritrix implements DynamicMBean,MBeanRegistration(Code) | | Main class for Heritrix crawler.
Heritrix is usually launched by a shell script that backgrounds heritrix
that redirects all stdout and stderr emitted by heritrix to a log file. So
that startup messages emitted subsequent to the redirection of stdout and
stderr show on the console, this class prints usage or startup output
such as where the web UI can be found, etc., to a STARTLOG that the shell
script is waiting on. As soon as the shell script sees output in this file,
it prints its content and breaks out of its wait.
See ${HERITRIX_HOME}/bin/heritrix.
Heritrix can also be embedded or launched by webapp initialization or
by JMX bootstrapping. So far I count 4 methods of instantiation:
- From this classes main -- the method usually used;
- From the Heritrix UI (The local-instances.jsp) page;
- A creation by a JMX agent at the behest of a remote JMX client; and
- A container such as tomcat or jboss.
author: gojomo author: Kristinn Sigurdsson author: Stack |
Method Summary | |
public String | addCrawlJob(String orderPathOrUrl, String name, String description, String seeds) This method is called when we have an order file to hand that we want
to base a job on. | protected String | addCrawlJob(URL url, HttpURLConnection connection, String name, String description, String seeds) | protected String | addCrawlJob(File order, String name, String description, String seeds) | protected CrawlJob | addCrawlJob(CrawlJob job) | public String | addCrawlJobBasedOn(String jobUidOrProfile, String name, String description, String seeds) | protected CrawlJob | addCrawlJobBasedOn(File orderFile, String name, String description, String seeds) | protected String | addCrawlJobBasedonJar(File jarFile, String name, String description, String seeds) Undo jar file and use as basis for a new job. | protected static ObjectName | addGuiPort(ObjectName name) | protected static ObjectName | addVitals(ObjectName name) Add vital stats to passed in ObjectName.
Parameters: name - ObjectName to add to. | protected OpenMBeanInfoSupport | buildMBeanInfo() Build up the MBean info for Heritrix main. | protected String | checkForEmptyPlaceHolder(String str) If passed str has placeholder for the empty string, return the empty
string else return orginal.
Dumb jmx clients can't pass empty string so they'll pass a representation
of empty string such as ' ' or '-'. | protected static void | configureTrustStore() Configure our trust store.
If system property is defined, then use it for our truststore. | protected static void | containerInitialization() Run setup tasks for this 'container'. | protected static CrawlJob | createCrawlJob(CrawlJobHandler handler, File crawlOrderFile, String name) | protected CrawlJob | createCrawlJobBasedOn(File orderFile, String name, String description, String seeds) | protected static void | deregisterJndi(ObjectName name) | public void | destroy() Do inverse of construction. | protected static String | doCmdLineArgs(String[] args) | protected String | doOneCrawl(String crawlOrderFile) Launch the crawler without a web UI and run the passed crawl only. | protected String | doOneCrawl(String crawlOrderFile, CrawlStatusListener listener) Launch the crawler without a web UI and run passed crawl only. | public SinkHandlerLogRecord | getAlert(String id) | public Vector | getAlerts() | public int | getAlertsCount() | public Object | getAttribute(String attribute_name) | public AttributeList | getAttributes(String[] attributeNames) | public static File | getConfdir() Get the configuration directory. | public static File | getConfdir(boolean fail) Get the configuration directory.
Parameters: fail - Throw IOE if can't find directory if true, else justreturn null. | protected String | getCrawlendReport(String jobUid, String reportName) Return named crawl end report for job with passed uid.
Crawler makes reports when its finished its crawl. | protected static File | getHeritrixHome() Exploit -Dheritrix.home if available to us. | public static String | getHeritrixOut() | public static SimpleHttpServer | getHttpServer() Returns the httpServer. | public static Map | getInstances() | public static ObjectName | getJmxObjectName() | public static ObjectName | getJmxObjectName(String name) | public static ObjectName | getJmxObjectName(String name, String type) | protected static ObjectName | getJndiContainerName() Jndi container name -- the name to use for the 'container' thatcan host zero or more heritrix instances (Return a JMX ObjectName. | protected static Context | getJndiContext() | public CrawlJobHandler | getJobHandler() | public static File | getJobsdir() The directory into which we put jobs. | public MBeanInfo | getMBeanInfo() | public ObjectName | getMBeanName() | public static MBeanServer | getMBeanServer() Get MBeanServer.
Currently uses first MBeanServer found. | public Vector | getNewAlerts() | public int | getNewAlertsCount() | protected String | getNoJmxName() | protected static InputStream | getPropertiesInputStream() | protected static Thread | getShutdownThread(boolean sysexit, int exitCode, String name) | public static Heritrix | getSingleInstance() | public String | getStatus() | protected static File | getSubDir(String subdirName) Get and check for existence of expected subdir.
If development flag set, then look for dir under src dir.
Parameters: subdirName - Dir to look for. | protected static File | getSubDir(String subdirName, boolean fail) Get and optionally check for existence of subdir.
If development flag set, then look for dir under src dir.
Parameters: subdirName - Dir to look for. Parameters: fail - True if we are to fail if directory does notexist; false if we are to return false if the directory does not exist. | public static String | getVersion() Get the heritrix version.
The heritrix version. | public static File | getWarsdir() | public String | interrupt(String threadName) | public Object | invoke(String operationName, Object[] params, String[] signature) | public static boolean | isCommandLine() | protected static boolean | isDevelopment() | public static boolean | isSingleInstance() | public boolean | isStarted() | protected static boolean | isValidLoginPasswordString(String str) Test string is valid login/password string.
A valid login/password string has the login and password compounded
w/ a ':' delimiter.
Parameters: str - String to test. | public String | launch() Launch the crawler for a web UI. | public String | launch(String crawlOrderFile, boolean runMode) Launch the crawler for a web UI.
Crawler hangs around waiting on jobs.
Parameters: crawlOrderFile - File to crawl. | protected static Properties | loadProperties() Load the heritrix.properties file. | public static void | main(String[] args) Launch program.
Optionally will launch a web server to host UI. | protected TabularData | makeJobsTabularData(List jobs) | protected static void | patchLogging() If the user hasn't altered the default logging parameters, tighten them
up somewhat: some of our libraries are way too verbose at the INFO or
WARNING levels.
This might be a problem running inside in someone else's
container. | public static void | performHeritrixShutDown() Exit program. | public static void | performHeritrixShutDown(int exitCode) Exit program. | public void | postDeregister() | public void | postRegister(Boolean registrationDone) | public void | preDeregister() | public ObjectName | preRegister(MBeanServer server, ObjectName name) | public static void | prepareHeritrixShutDown() Prepars for program shutdown. | public void | readAlert(String id) | protected static void | registerContainerJndi() | protected static void | registerHeritrix(Heritrix h, String name, boolean jmxregister) Register Heritrix with JNDI, JMX, and with the static hashtable of all
Heritrix instances known to this JVM.
If launched from cmdline, register Heritrix MBean if an agent to register
ourselves with. | protected static void | registerJndi(ObjectName name) | public static MBeanServer | registerMBean(Object objToRegister, String name, String type) | public static MBeanServer | registerMBean(MBeanServer server, Object objToRegister, String name, String type) | public static MBeanServer | registerMBean(MBeanServer server, Object objToRegister, ObjectName objName) | public void | removeAlert(String id) | public static void | resetAuthentication(String newUsername, String newPassword) Replace existing administrator login info with new info. | protected static String | selftest(String oneSelfTestName, int port) | public void | setAttribute(Attribute attribute) | public AttributeList | setAttributes(AttributeList attributes) | public static void | shutdown(int exitCode) Shutdown all running heritrix instances and the JVM. | public static void | shutdown() | public void | start() Start Heritrix.
Used by JMX and webapp initialization for starting Heritrix.
Not by the cmdline launched Heritrix. | public void | startCrawling() | protected static String | startEmbeddedWebserver(int port, boolean lho, String adminLoginPassword) Start up the embedded Jetty webserver instance. | protected static String | startEmbeddedWebserver(Collection<String> hosts, int port, String adminLoginPassword) Start up the embedded Jetty webserver instance. | public void | stop() Stop Heritrix. | public void | stopCrawling() | protected static void | unregisterHeritrix(Heritrix h) | public static void | unregisterMBean(MBeanServer server, String name, String type) | public static void | unregisterMBean(MBeanServer server, ObjectName name) |
DEFAULT_ENCODING | final public static String DEFAULT_ENCODING(Code) | | Default encoding.
Used for content when fetching if none specified.
|
Heritrix | public Heritrix() throws IOException(Code) | | Constructor.
Does not register the created instance with JMX. Assumed this
constructor is used by such as JMX agent creating an instance of
Heritrix at the commmand of a remote client (In this case Heritrix will
be registered by the invoking agent).
throws: IOException - |
Heritrix | public Heritrix(String name, boolean jmxregister) throws IOException(Code) | | Constructor.
Parameters: name - If null, we bring up the default Heritrix instance. Parameters: jmxregister - True if we are to register this instance with JMXagent. throws: IOException - |
Heritrix | public Heritrix(String name, boolean jmxregister, CrawlJobHandler cjh) throws IOException(Code) | | Constructor.
Parameters: name - If null, we bring up the default Heritrix instance. Parameters: jmxregister - True if we are to register this instance with JMXagent. Parameters: cjh - CrawlJobHandler to use. throws: IOException - |
addCrawlJob | public String addCrawlJob(String orderPathOrUrl, String name, String description, String seeds) throws IOException, FatalConfigurationException(Code) | | This method is called when we have an order file to hand that we want
to base a job on. It leaves the order file in place and just starts up
a job that uses all the order points to for locations for logs, etc.
Parameters: orderPathOrUrl - Path to an order file or to a seeds file. Parameters: name - Name to use for this job. Parameters: description - Parameters: seeds - A status string. throws: IOException - throws: FatalConfigurationException - |
buildMBeanInfo | protected OpenMBeanInfoSupport buildMBeanInfo()(Code) | | Build up the MBean info for Heritrix main.
Return created mbean info instance. |
checkForEmptyPlaceHolder | protected String checkForEmptyPlaceHolder(String str)(Code) | | If passed str has placeholder for the empty string, return the empty
string else return orginal.
Dumb jmx clients can't pass empty string so they'll pass a representation
of empty string such as ' ' or '-'. Convert such strings to empty
string.
Parameters: str - String to check. Original str or empty string if str contains a placeholder for the empty-string (e.g. '-', or ' '). |
configureTrustStore | protected static void configureTrustStore()(Code) | | Configure our trust store.
If system property is defined, then use it for our truststore. Otherwise
use the heritrix truststore under conf directory if it exists.
If we're not launched from the command-line, we will not be able
to find our truststore. The truststore is nor normally used so rare
should this be a problem (In case where we don't use find our trust
store, we'll use the 'default' -- either the JVMs or the containers).
|
containerInitialization | protected static void containerInitialization() throws IOException(Code) | | Run setup tasks for this 'container'. Idempotent.
throws: IOException - |
destroy | public void destroy()(Code) | | Do inverse of construction. Used by anyone who does a 'new Heritrix' when
they want to cleanup the instance.
Of note, there may be Heritrix threads still hanging around after the
call to destroy completes. They'll eventually go down after they've
finished their cleanup routines. In particular, if you are watching
Heritrix via JMX, you can see the Heritrix instance JMX bean unregister
ahead of the CrawlJob JMX bean that its hosting.
|
getAlertsCount | public int getAlertsCount()(Code) | | |
getConfdir | public static File getConfdir() throws IOException(Code) | | Get the configuration directory.
The conf directory under HERITRIX_HOME or null if none canbe found. throws: IOException - |
getConfdir | public static File getConfdir(boolean fail) throws IOException(Code) | | Get the configuration directory.
Parameters: fail - Throw IOE if can't find directory if true, else justreturn null. The conf directory under HERITRIX_HOME or null (or an IOE) ifcan't be found. throws: IOException - |
getCrawlendReport | protected String getCrawlendReport(String jobUid, String reportName) throws IOException(Code) | | Return named crawl end report for job with passed uid.
Crawler makes reports when its finished its crawl. Use this method
to get a String version of one of these files.
Parameters: jobUid - The unique ID for the job whose reports you want to see(Must be a completed job). Parameters: reportName - Name of report minus '.txt' (e.g. crawl-report). String version of the on-disk report. throws: IOException - |
getHeritrixHome | protected static File getHeritrixHome() throws IOException(Code) | | Exploit -Dheritrix.home if available to us.
Is current working dir if no heritrix.home property supplied.
Heritrix home directory. throws: IOException - |
getHeritrixOut | public static String getHeritrixOut()(Code) | | The file we dump stdout and stderr into. |
getHttpServer | public static SimpleHttpServer getHttpServer()(Code) | | Returns the httpServer. May be null if one was not started. |
getInstances | public static Map getInstances()(Code) | | Return all registered instances of Heritrix (Rare are there more than one). |
getJobHandler | public CrawlJobHandler getJobHandler()(Code) | | Get the job handler
The CrawlJobHandler being used. |
getJobsdir | public static File getJobsdir() throws IOException(Code) | | The directory into which we put jobs. If the system property'heritrix.jobsdir' is set, we will use its value in place of the default'jobs' directory in the current working directory. throws: IOException - |
getMBeanName | public ObjectName getMBeanName()(Code) | | Name this instance registered in JMX (Only available after JMXregistration). |
getMBeanServer | public static MBeanServer getMBeanServer()(Code) | | Get MBeanServer.
Currently uses first MBeanServer found. This will definetly not be whats
always wanted. TODO: Make which server settable. Also, if none, put up
our own MBeanServer.
An MBeanServer to register with or null. |
getNewAlertsCount | public int getNewAlertsCount()(Code) | | |
getNoJmxName | protected String getNoJmxName()(Code) | | Name to use when no JMX agent available. |
getShutdownThread | protected static Thread getShutdownThread(boolean sysexit, int exitCode, String name)(Code) | | |
getSingleInstance | public static Heritrix getSingleInstance()(Code) | | Returns single instance or null if no instance or multiple. |
getSubDir | protected static File getSubDir(String subdirName) throws IOException(Code) | | Get and check for existence of expected subdir.
If development flag set, then look for dir under src dir.
Parameters: subdirName - Dir to look for. The extant subdir. Otherwise null if we're runningin a webapp context where there is no conf directory available. throws: IOException - if unable to find expected subdir. |
getSubDir | protected static File getSubDir(String subdirName, boolean fail) throws IOException(Code) | | Get and optionally check for existence of subdir.
If development flag set, then look for dir under src dir.
Parameters: subdirName - Dir to look for. Parameters: fail - True if we are to fail if directory does notexist; false if we are to return false if the directory does not exist. The extant subdir. Otherwise null if we're runningin a webapp context where there is no subdir directory available. throws: IOException - if unable to find expected subdir. |
getVersion | public static String getVersion()(Code) | | Get the heritrix version.
The heritrix version. May be null. |
getWarsdir | public static File getWarsdir() throws IOException(Code) | | throws: IOException - Returns the directory under which reside the WAR fileswe're to load into the servlet container. |
isCommandLine | public static boolean isCommandLine()(Code) | | Returns true if Heritrix was launched from the command line.(When launched from command line, we do stuff like put up a web serverto manage our web interface and we register ourselves with the firstavailable jmx agent). |
isDevelopment | protected static boolean isDevelopment()(Code) | | |
isSingleInstance | public static boolean isSingleInstance()(Code) | | True if only one instance of Heritrix. |
isStarted | public boolean isStarted()(Code) | | True if heritrix has been started. |
isValidLoginPasswordString | protected static boolean isValidLoginPasswordString(String str)(Code) | | Test string is valid login/password string.
A valid login/password string has the login and password compounded
w/ a ':' delimiter.
Parameters: str - String to test. True if valid password/login string. |
launch | public String launch() throws Exception(Code) | | Launch the crawler for a web UI.
Crawler hangs around waiting on jobs.
exception: Exception - A status string describing how the launch went. throws: Exception - |
launch | public String launch(String crawlOrderFile, boolean runMode) throws Exception(Code) | | Launch the crawler for a web UI.
Crawler hangs around waiting on jobs.
Parameters: crawlOrderFile - File to crawl. May be null. Parameters: runMode - Whether crawler should be set to run mode. exception: Exception - A status string describing how the launch went. |
loadProperties | protected static Properties loadProperties() throws IOException(Code) | | Load the heritrix.properties file.
Adds any property that starts with
HERITRIX_PROPERTIES_PREFIX
or ARCHIVE_PACKAGE
into system properties (except logging '.level' directives).
Loaded properties. throws: IOException - |
main | public static void main(String[] args) throws Exception(Code) | | Launch program.
Optionally will launch a web server to host UI. Will also register
Heritrix MBean with first found JMX Agent (Usually the 1.5.0 JVM
Agent).
Parameters: args - Command line arguments. throws: Exception - |
patchLogging | protected static void patchLogging() throws SecurityException, IOException(Code) | | If the user hasn't altered the default logging parameters, tighten them
up somewhat: some of our libraries are way too verbose at the INFO or
WARNING levels.
This might be a problem running inside in someone else's
container. Container's seem to prefer commons logging so we
ain't messing them doing the below.
throws: IOException - throws: SecurityException - |
performHeritrixShutDown | public static void performHeritrixShutDown()(Code) | | Exit program. Recommended that prepareHeritrixShutDown() be invoked
prior to this method.
|
performHeritrixShutDown | public static void performHeritrixShutDown(int exitCode)(Code) | | Exit program. Recommended that prepareHeritrixShutDown() be invoked
prior to this method.
Parameters: exitCode - Code to pass System.exit. |
postDeregister | public void postDeregister()(Code) | | |
postRegister | public void postRegister(Boolean registrationDone)(Code) | | |
prepareHeritrixShutDown | public static void prepareHeritrixShutDown()(Code) | | Prepars for program shutdown. This method does it's best to prepare the
program so that it can exit normally. It will kill the httpServer and
terminate any running job.
It is advisible to wait a few (~1000) millisec after calling this method
and before calling performHeritrixShutDown() to allow as many threads as
possible to finish what they are doing.
|
registerHeritrix | protected static void registerHeritrix(Heritrix h, String name, boolean jmxregister) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException(Code) | | Register Heritrix with JNDI, JMX, and with the static hashtable of all
Heritrix instances known to this JVM.
If launched from cmdline, register Heritrix MBean if an agent to register
ourselves with. Usually this method will only have effect if we're
running in a 1.5.0 JDK and command line options such as
'-Dcom.sun.management.jmxremote.port=8082
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false' are supplied.
See Monitoring
and Management Using JMX
for more on the command line options and how to connect to the
Heritrix bean using the JDK 1.5.0 jconsole tool. We register currently
with first server we find (TODO: Make configurable).
If we register successfully with a JMX agent, then part of the
registration will include our registering ourselves with JNDI.
Finally, add the heritrix instance to the hashtable of all the
Heritrix instances floating in the current VM. This latter registeration
happens whether or no there is a JMX agent to register with. This is
a list we keep out of convenience so its easy iterating over all
all instances calling stop when main application is going down.
Parameters: h - Instance of heritrix to register. Parameters: name - Name to use for this Heritrix instance. Parameters: jmxregister - True if we are to register this instance with JMX. throws: NullPointerException - throws: MalformedObjectNameException - throws: NotCompliantMBeanException - throws: MBeanRegistrationException - throws: InstanceAlreadyExistsException - |
resetAuthentication | public static void resetAuthentication(String newUsername, String newPassword)(Code) | | Replace existing administrator login info with new info.
Parameters: newUsername - new administrator login username Parameters: newPassword - new administrator login password |
selftest | protected static String selftest(String oneSelfTestName, int port) throws Exception(Code) | | Run the selftest
Parameters: oneSelfTestName - Name of a test if we are to run one only ratherthan the default running all tests. Parameters: port - Port number to use for web UI. exception: Exception - Status of how selftest startup went. |
shutdown | public static void shutdown(int exitCode)(Code) | | Shutdown all running heritrix instances and the JVM.
Assumes stop has already been called.
Parameters: exitCode - Exit code to pass system exit. |
shutdown | public static void shutdown()(Code) | | |
start | public void start()(Code) | | Start Heritrix.
Used by JMX and webapp initialization for starting Heritrix.
Not by the cmdline launched Heritrix. Idempotent.
If start is called by JMX, then new instance of Heritrix is automatically
registered w/ JMX Agent. If started by webapp, need to register the new
Heritrix instance.
|
startCrawling | public void startCrawling()(Code) | | |
startEmbeddedWebserver | protected static String startEmbeddedWebserver(int port, boolean lho, String adminLoginPassword) throws Exception(Code) | | Start up the embedded Jetty webserver instance.
This is done when we're run from the command-line.
Parameters: port - Port number to use for web UI. Parameters: adminLoginPassword - Compound of login and password. throws: Exception - Status on webserver startup. |
startEmbeddedWebserver | protected static String startEmbeddedWebserver(Collection<String> hosts, int port, String adminLoginPassword) throws Exception(Code) | | Start up the embedded Jetty webserver instance.
This is done when we're run from the command-line.
Parameters: hosts - a list of IP addresses or hostnames to bind to, or anempty collection to bind to all available network interfaces Parameters: port - Port number to use for web UI. Parameters: adminLoginPassword - Compound of login and password. throws: Exception - Status on webserver startup. |
stop | public void stop()(Code) | | Stop Heritrix.
Used by JMX and webapp initialization for stopping Heritrix.
|
stopCrawling | public void stopCrawling()(Code) | | |
|
|