Java Doc for WebRobot.java in » Web-Crawler » JoBo » net » matuschek » spider » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Web Crawler » JoBo » net.matuschek.spider

Source Cross Reference

Class Diagram

Java Document (Java Doc)

java.lang .Object

net.matuschek.spider .WebRobot

WebRobot
public class WebRobot implements Runnable,Cloneable(Code)

Field Summary
protected boolean	activatedContentHistory
protected boolean	activatedNewTasks
protected boolean	activatedUrlHistory
protected boolean	allowCaching
protected boolean	allowWholeDomain
protected boolean	allowWholeHost
protected Vector	allowedURLs
protected HashMap	content2UrlMap
long	countCache
long	countNoRefresh
long	countRefresh
long	countWeb
protected HttpDocManager	docManager
protected boolean	duplicateCheck
protected RobotExceptionHandler	exceptionHandler
protected int	expectedDocumentCount
protected long	expirationAge expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.
protected FilterChain	filters
protected boolean	flexibleHostCheck
protected FormFiller	formFiller
boolean	hasFormHandlers
protected HttpTool	httpTool
protected boolean	ignoreRobotsTxt
protected int	iteration
protected Category	log
protected int	maxDepth
protected long	maxDocumentAge
protected int	maxRetries
protected NoRobots	robCheck
protected boolean	sleep
protected int	sleepTime
protected String	startDir
protected String	startReferer
protected long	startTime
protected URL	startURL
protected boolean	stopIt
protected TaskList	todo
protected URLCheck	urlCheck
protected Vector	visitMany
protected TaskList	visited
protected boolean	walkToOtherHosts
protected Vector	wasteParameters
protected WebRobotCallback	webRobotCallback

Constructor Summary
public	WebRobot(int expectedDocumentCount)
public	WebRobot()

Method Summary
protected void	addTask(RobotTask task)
protected void	addTaskAtStart(RobotTask task)
protected boolean	basicURLCheck(URL currURL) Basic URL allow check it is allowed to walk to a new URL if WalkToOtherHost is true.
protected void	cleanUp()
public void	clearCookies()
public void	finish() This method finishes HttpTool, NoRobots, HttpDocManager.
protected void	finishThreads()
public String	getAgentName()
public boolean	getAllowCaching() Gets the AllowCaching value.
public boolean	getAllowWholeDomain() Gets the AllowWholeDomain value. true if the Robot is allowed to travel to the whole domain of the start host, false otherwise.
public boolean	getAllowWholeHost() gets the AllowWholeHost value true if the Robot is allowed to travel to the whole host where it started from, false otherwise.
public Vector	getAllowedURLs()
public int	getBandwidth()
public String	getContentVisitedURL(HttpDoc doc) Method getContentVisitedURL.
public CookieManager	getCookieManager()
public HttpDocManager	getDocManager()
public boolean	getEnableCookies()
public RobotExceptionHandler	getExceptionHandler() Method getExceptionHandler.
public long	getExpirationAge() get expiration age of documents in cache.
public boolean	getFlexibleHostCheck() Gets the state of flexible host checking (enabled or disabled). To find out if a new URL is on the same host, the robot usually compares the host part of both.
public Vector	getFormHandlers()
public boolean	getIgnoreRobotsTxt()
public int	getMaxDepth()
public long	getMaxDocumentAge()
public int	getMaxRetries()
protected String	getMimeTypeForFilename(String filename) Get the Mime type for the given filename.
public NTLMAuthorization	getNtlmAuthorization()
public String	getProxy()
public int	getSleepTime()
public String	getStart() Method getStart.
public String	getStartReferer()
public URL	getStartURL()
public int	getTimeout()
public Vector	getVisitMany()
public boolean	getWalkToOtherHosts()
public Vector	getWasteParameters()
public WebRobotCallback	getWebRobotCallback()
protected void	handleMemoryError(OutOfMemoryError memoryError) Implements OutOfMemory handling strategies.
protected boolean	isAllowed(URL u)
protected boolean	isProcessingAllowed(HttpDoc doc)
public boolean	isSleeping()
public static void	main(String[] args)
protected byte[]	readFileToByteArray(File file) Reads a File to a byte array.
public void	registerToDoList(TaskList todo) Sets the implementation class for the backend task list storage.
public void	registerVisitedList(TaskList visited) Sets the implementation class for the backend task list storage.
public static String	removeParametersFromString(String urlString, Vector wasteParameters)
public URL	removeWasteParameters(URL url) Removes wasteParameters from URL. (eg.
public void	retrieveURL(RobotTask task)
public void	run()
public void	setAgentName(String name) sets the Agent-Name authentication for this robot Parameters: name - a name for this robot (e.g.
public void	setAllowCaching(boolean allowCaching) Sets the AllowCaching status Parameters: allowCaching - if true, the Robot is allows to usecached documents.
public void	setAllowWholeDomain(boolean allowWholeDomain) Sets the AllowWholeDomain status Parameters: allowWholeDomain - if true, the Robot is allows to travelto all hosts in the same domain as the starting host.
public void	setAllowWholeHost(boolean allowWholeHost) sets the AllowWholeHost status Parameters: allowWholeHost - if true, the Robot is allowed totravel to the whole host where it started from.
public void	setAllowedURLs(Vector allowed) Set the list of allowed URLs Parameters: allowed - a Vector containing Strings.
public void	setBandwidth(int bandwidth)
public void	setContentVisitedURL(HttpDoc doc, String url) Method setContentVisitedURL.
public void	setCookieManager(CookieManager cm) Sets the CookieManager used by the HttpTool By default a MemoryCookieManager will be used, but you can use this method to use your own CookieManager implementation.
public void	setDocManager(HttpDocManager docManager) Sets the document manager for this robot Without a document manager, the robot will travel through the web but don't do anything with the retrieved documents (simply forget them).
public void	setDownloadRuleSet(DownloadRuleSet rules)
public void	setEnableCookies(boolean enable)
public void	setExceptionHandler(RobotExceptionHandler newExceptionHandler) Method setExceptionHandler.
public void	setExpirationAge(long age) set expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.
public void	setFilters(FilterChain filters) Sets a FilterChain.
public void	setFlexibleHostCheck(boolean flexibleHostCheck) Defines if the host test should be more flexible. To find out if a new URL is on the same host, the robot usually compares the host part of both.
public void	setFormHandlers(Vector handlers)
public void	setFromAddress(String fromAddress) sets the From: HTTP header this should be a valid email address.
public void	setHttpToolCallback(HttpToolCallback callback)
public void	setIgnoreRobotsTxt(boolean ignoreRobotsTxt)
public void	setMaxDepth(int maxDepth)
public void	setMaxDocumentAge(long maxAge)
public void	setMaxRetries(int maxRetries)
public void	setNtlmAuthorization(NTLMAuthorization ntlmAuthorization)
public void	setProxy(String proxyDescr)
public void	setSleep(boolean sleep) Sets the sleep status for this robot.
public void	setSleepTime(int sleepTime) set the sleeptime after every retrieved document the robot will wait this time before getting the next document.
public void	setStart(String startURL) Method setStart.
public void	setStartReferer(String startReferer) sets the Referer setting for the first HTTP reuest Parameters: startReferer - an URL (e.g.
public void	setStartURL(URL startURL)
public void	setTimeout(int timeout) Sets the timeout for getting data.
public void	setURLCheck(URLCheck check)
public void	setVisitMany(Vector visitMany)
public void	setWalkToOtherHosts(boolean walkToOtherHosts)
public void	setWasteParameters(Vector wasteParameters)
public void	setWebRobotCallback(WebRobotCallback webRobotCallback)
public void	sleepNow() sleep for sleepTime seconds.
protected synchronized void	spawnThread() Start subThreads for spidering.
public void	stopRobot()
protected boolean	taskAddAllowed(RobotTask task)
public void	updateProgressInfo() Inform about spidering progress.
public void	walkTree()
public void	work()

Field Detail

activatedContentHistory
protected boolean activatedContentHistory(Code)
	Are visited contents collected? (may depend on memoryLevel)

activatedNewTasks
protected boolean activatedNewTasks(Code)
	Can new tasks be added? (may depend on memoryLevel)

activatedUrlHistory
protected boolean activatedUrlHistory(Code)
	Are visited URLs collected? (may depend on memoryLevel)

allowCaching
protected boolean allowCaching(Code)
	don't retrieve pages again that are already stored in the DocManager

allowWholeDomain
protected boolean allowWholeDomain(Code)
	allow travelling to all subdomains of the start host ? See Also: WebRobot.setAllowWholeDomain(boolean)

allowWholeHost
protected boolean allowWholeHost(Code)
	allow travelling the whole host ?

allowedURLs
protected Vector allowedURLs(Code)
	list of allowed URLs (even if walkToOtherHosts is false) *

content2UrlMap
protected HashMap content2UrlMap(Code)
	remember visited content here (md5, urlString)

countCache
long countCache(Code)
	counter for pages that were found in cache

countNoRefresh
long countNoRefresh(Code)
	counter for pages that didn�t need a refresh

countRefresh
long countRefresh(Code)
	counter for refreshed pages (=cache+web)

countWeb
long countWeb(Code)
	counter for pages retrieved by web

docManager
protected HttpDocManager docManager(Code)
	DocManager will store or process retrieved documents

duplicateCheck
protected boolean duplicateCheck(Code)
	Check for documents with the same content

exceptionHandler
protected RobotExceptionHandler exceptionHandler(Code)
	the robot exception handler

expectedDocumentCount
protected int expectedDocumentCount(Code)
	expected count of documents

expirationAge
protected long expirationAge(Code)
	expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.

filters
protected FilterChain filters(Code)
	FilterChain to filter the document before storing it

flexibleHostCheck
protected boolean flexibleHostCheck(Code)
	do more flexible tests if the new URL is on the same host See Also: WebRobot.basicURLCheck(URL)

formFiller
protected FormFiller formFiller(Code)
	fill out forms

hasFormHandlers
boolean hasFormHandlers(Code)
	only true if form-handlers are defined

httpTool
protected HttpTool httpTool(Code)
	HttpTool will be used to retrieve documents from a web server

ignoreRobotsTxt
protected boolean ignoreRobotsTxt(Code)
	ignore settings in /robots.txt ?

iteration
protected int iteration(Code)
	counter for calls of retrieveURL

log
protected Category log(Code)
	Log4J category for logging

maxDepth
protected int maxDepth(Code)
	maximal search depth

maxDocumentAge
protected long maxDocumentAge(Code)
	maximum document age in seconds, negative value means no limit

maxRetries
protected int maxRetries(Code)
	number of allowed retries for document retrieval

robCheck
protected NoRobots robCheck(Code)
	test for robots.txt

sleep
protected boolean sleep(Code)
	should the robot suspend the current walk() *

sleepTime
protected int sleepTime(Code)
	sleep that number of seconds after every retrieved document

startDir
protected String startDir(Code)
	the host and directory where retrieval started from

startReferer
protected String startReferer(Code)
	Referer used to retrieve to first document

startTime
protected long startTime(Code)
	time of WebRobot start in milliseconds

startURL
protected URL startURL(Code)
	the URL where the robot walk starts from

stopIt
protected boolean stopIt(Code)
	should we stop robot operation ? *

todo
protected TaskList todo(Code)
	current tasks

urlCheck
protected URLCheck urlCheck(Code)
	to check if it is allowed to travel to a given URL *

visitMany
protected Vector visitMany(Code)
	this URLs can be visited more then once

visited
protected TaskList visited(Code)
	a list of all URLs we got already

walkToOtherHosts
protected boolean walkToOtherHosts(Code)
	is it allowed to walk to other hosts then the starting host ?

wasteParameters
protected Vector wasteParameters(Code)
	list of wasteParameters (will be removed from URLs) *

webRobotCallback
protected WebRobotCallback webRobotCallback(Code)
	for callback to the user interface *

Constructor Detail

WebRobot
public WebRobot(int expectedDocumentCount)(Code)
	initializes the robot with the default implementation of the TaskList interface Parameters: expected - document count

WebRobot
public WebRobot()(Code)
	initializes the robot with the default implementation of the TaskList interface

Method Detail

addTask
protected void addTask(RobotTask task)(Code)
	adds a new task to the task vector but does some checks to

addTaskAtStart
protected void addTaskAtStart(RobotTask task)(Code)
	adds a new tasks at the beginning of the tasks list See Also: WebRobot.addTask(RobotTask)

basicURLCheck
protected boolean basicURLCheck(URL currURL)(Code)
	Basic URL allow check it is allowed to walk to a new URL if WalkToOtherHost is true. In this case there will be no additional tests. The new URL is located below the start URL, e.g. is the start URL is http://localhost/test, the URL http://localhost/test/index.html is allowed, but http://localhost/ is not allowed. AllowWholeHost is true and the new URL is located on the same host as the start URL. FlexibleHostCheck is true and the host part of the current URL is equal to the host part of the start URL modulo the prefix "www." The URL starts with a string in the "AllowedURLs" list.

cleanUp
protected void cleanUp()(Code)
	Clean up temporary data

clearCookies
public void clearCookies()(Code)
	Delete all cookies

finish
public void finish()(Code)
	This method finishes HttpTool, NoRobots, HttpDocManager.

finishThreads
protected void finishThreads()(Code)
	calls webRobotDone and finishes docManager if executed in mainThread

getAgentName
public String getAgentName()(Code)
	Gets the name of the "User-Agent" header that the robot will use the user agent name

getAllowCaching
public boolean getAllowCaching()(Code)
	Gets the AllowCaching value. true if the Robot is allowed to cache documents in thedocManager See Also: WebRobot.setAllowCaching(boolean)

getAllowWholeDomain
public boolean getAllowWholeDomain()(Code)
	Gets the AllowWholeDomain value. true if the Robot is allowed to travel to the whole domain of the start host, false otherwise. See Also: WebRobot.setAllowWholeDomain(boolean)

getAllowWholeHost
public boolean getAllowWholeHost()(Code)
	gets the AllowWholeHost value true if the Robot is allowed to travel to the whole host where it started from, false otherwise. If false, it is onlyallowed to travel to URLs below the start URL

getAllowedURLs
public Vector getAllowedURLs()(Code)
	Gets the list of allowed URLs a Vector containing Strings See Also: WebRobot.setAllowedURLs(Vector)

getBandwidth
public int getBandwidth()(Code)
	Get the value of bandwith of the used HttpTool value of bandwith.

getContentVisitedURL
public String getContentVisitedURL(HttpDoc doc)(Code)
	Method getContentVisitedURL. Checks if the content was visited before and retrieves the corresponding URL. Parameters: content - found url or null if not found

getCookieManager
public CookieManager getCookieManager()(Code)
	Gets the CookieManager used by the HttpTool the CookieManager that will be used by the HttpTool

getDocManager
public HttpDocManager getDocManager()(Code)
	the document manager of this robot See Also: HttpDocManager

getEnableCookies
public boolean getEnableCookies()(Code)
	Get the status of the cookie engine true, if HTTP cookies are enabled, false otherwise

getExceptionHandler
public RobotExceptionHandler getExceptionHandler()(Code)
	Method getExceptionHandler. RobotExceptionHandler the exceptionhandler of the robot

getExpirationAge
public long getExpirationAge()(Code)
	get expiration age of documents in cache. long

getFlexibleHostCheck
public boolean getFlexibleHostCheck()(Code)
	Gets the state of flexible host checking (enabled or disabled). To find out if a new URL is on the same host, the robot usually compares the host part of both. Some web servers have an inconsistent addressing scheme and use the hostname www.domain.com and domain.com. With flexible host check enabled, the robot will consider both hosts as equal. true, if flexible host checking is enabled

getFormHandlers
public Vector getFormHandlers()(Code)
	the list of form handlers See Also: net.matuschek.html.FormHandler See Also: for more information See Also: about form handlers

getIgnoreRobotsTxt
public boolean getIgnoreRobotsTxt()(Code)
	Gets the setting of the IgnoreRobotsTxt property true if robots.txt will be ignored, false otherwise

getMaxDepth
public int getMaxDepth()(Code)
	the maximal allowed search depth

getMaxDocumentAge
public long getMaxDocumentAge()(Code)
	Gets the maximum age of documents to retrieve maximum document age (in seconds), negative value means no limit.

getMaxRetries
public int getMaxRetries()(Code)
	Get allowed retries for document retrieval maxRetries

getMimeTypeForFilename
protected String getMimeTypeForFilename(String filename)(Code)
	Get the Mime type for the given filename. Parameters: filename - Mime type

getNtlmAuthorization
public NTLMAuthorization getNtlmAuthorization()(Code)
	Gets the ntlmAuthentication of the robot the ntlmAuthentication

getProxy
public String getProxy()(Code)
	the current proxy setting in the format host:port

getSleepTime
public int getSleepTime()(Code)
	the sleeptime setting

getStart
public String getStart()(Code)
	Method getStart. gets the start url as string String

getStartReferer
public String getStartReferer()(Code)
	the Referer setting for the first HTTP reuest

getStartURL
public URL getStartURL()(Code)
	the start URL for this robot

getTimeout
public int getTimeout()(Code)
	Gets the timeout for getting data in seconds of the used HttpTool the value of sockerTimeout See Also: WebRobot.setTimeout(int)

getVisitMany
public Vector getVisitMany()(Code)
	Gets a vector of URLs that can be visited more then once a vector containing URLs formated as Strings

getWalkToOtherHosts
public boolean getWalkToOtherHosts()(Code)
	gets the WalkToOtherHost status true if the Robot is allowed to travel to otherhost then the start host, false otherwise

getWasteParameters
public Vector getWasteParameters()(Code)
	Gets the list of wasteParameters (will be removed from URLs) a Vector containing Strings

getWebRobotCallback
public WebRobotCallback getWebRobotCallback()(Code)

handleMemoryError
protected void handleMemoryError(OutOfMemoryError memoryError) throws OutOfMemoryError(Code)
	Implements OutOfMemory handling strategies. Action depends on memoryLevel Parameters: memoryError - throws: OutOfMemoryError -

isAllowed
protected boolean isAllowed(URL u)(Code)
	Is it allowed to travel to this new URL ? Parameters: u - the URL to test true if traveling to this URL is allowed, false otherwise

isProcessingAllowed
protected boolean isProcessingAllowed(HttpDoc doc)(Code)
	Is it allowed to process this document ? Parameters: document - true if processing of this URL is allowed

isSleeping
public boolean isSleeping()(Code)
	Is the robot sleeping ?

main
public static void main(String[] args)(Code)

readFileToByteArray
protected byte[] readFileToByteArray(File file) throws IOException(Code)
	Reads a File to a byte array. Parameters: file - byte[] throws: IOException -

registerToDoList
public void registerToDoList(TaskList todo)(Code)
	Sets the implementation class for the backend task list storage. WebRobot uses the TaskList interface to store future tasks. If you want to use your own TaskList implementation, just call this method. Parameters: todo - TaskList to be used for the "to do" list

registerVisitedList
public void registerVisitedList(TaskList visited)(Code)
	Sets the implementation class for the backend task list storage. WebRobot uses the TaskList interface to store URLs that have been retrieved before. If you want to use your own TaskList implementation, just call this method. Parameters: visited - TaskList to be used for the list of visited URLs

removeParametersFromString
public static String removeParametersFromString(String urlString, Vector wasteParameters)(Code)
	Remove passed Parameters from UrlString Parameters: urlString - Parameters: wasteParameters - String

removeWasteParameters
public URL removeWasteParameters(URL url)(Code)
	Removes wasteParameters from URL. (eg. ID) Parameters: url - URL

retrieveURL
public void retrieveURL(RobotTask task)(Code)
	retrieve the next URL, save it, extract all included links and add those links to the tasks list Parameters: task - task to retrieve, function does nothing if this is null

run
public void run()(Code)
	thread run() method, simply calls work() See Also: WebRobot.work()

setAgentName
public void setAgentName(String name)(Code)
	sets the Agent-Name authentication for this robot Parameters: name - a name for this robot (e.g. "Mozilla 4.0 (compatible; Robot)")

setAllowCaching
public void setAllowCaching(boolean allowCaching)(Code)
	Sets the AllowCaching status Parameters: allowCaching - if true, the Robot is allows to usecached documents. That means it will first try to get teh documentfrom the docManager cache and will only retrieve it if it isnot found in the cache. If the cache returns a document, the robotwill NEVER retrieve it again. Therefore, expiration mechanisms haveto be included in the HttpDocManager method retrieveFromCache. See Also: net.matuschek.http.HttpDocManager.retrieveFromCache(java.net.URL)

setAllowWholeDomain
public void setAllowWholeDomain(boolean allowWholeDomain)(Code)
	Sets the AllowWholeDomain status Parameters: allowWholeDomain - if true, the Robot is allows to travelto all hosts in the same domain as the starting host. E.g. if youstart at www.apache.org, it is also allowed to travel tojakarta.apache.org, xml.apache.org ...

setAllowWholeHost
public void setAllowWholeHost(boolean allowWholeHost)(Code)
	sets the AllowWholeHost status Parameters: allowWholeHost - if true, the Robot is allowed totravel to the whole host where it started from. Otherwise it is onlyallowed to travel to URLs below the start URL.

setAllowedURLs
public void setAllowedURLs(Vector allowed)(Code)
	Set the list of allowed URLs Parameters: allowed - a Vector containing Strings. URLs will be checkedif they begin of a string in this vector

setBandwidth
public void setBandwidth(int bandwidth)(Code)
	Set the value of bandwith of the used HttpTool Parameters: bandwidth - Value to assign to bandwith.

setContentVisitedURL
public void setContentVisitedURL(HttpDoc doc, String url)(Code)
	Method setContentVisitedURL. Makes an URL retrievable by its content by entering it in content2UrlMap. Parameters: content - Parameters: url -

setCookieManager
public void setCookieManager(CookieManager cm)(Code)
	Sets the CookieManager used by the HttpTool By default a MemoryCookieManager will be used, but you can use this method to use your own CookieManager implementation. Parameters: cm - an object that implements the CookieManager interface

setDocManager
public void setDocManager(HttpDocManager docManager)(Code)
	Sets the document manager for this robot Without a document manager, the robot will travel through the web but don't do anything with the retrieved documents (simply forget them). A document manager can store them, extract information or whatever you like. There can be only one document manager, but you are free to combine functionalities of available document managers in a new object (e.g. to store the document and extract meta informations). Parameters: docManager -

setDownloadRuleSet
public void setDownloadRuleSet(DownloadRuleSet rules)(Code)
	Sets the DownloadRule Parameters: rule - the download rule set to use

setEnableCookies
public void setEnableCookies(boolean enable)(Code)
	Enable/disable cookies Parameters: enable - if true, HTTP cookies will be enabled, if falsethe robot will not use cookies

setExceptionHandler
public void setExceptionHandler(RobotExceptionHandler newExceptionHandler)(Code)
	Method setExceptionHandler. sets the exceptionhandler of the robot Parameters: newExceptionHandler - the new exception handler

setExpirationAge
public void setExpirationAge(long age)(Code)
	set expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit. Parameters: age -

setFilters
public void setFilters(FilterChain filters)(Code)
	Sets a FilterChain. If teh WebRobot use a FilterChain it will process any retrieved document by this FilterChain before storing it Parameters: filter - a FilterChain to use for filtering HttpDocs

setFlexibleHostCheck
public void setFlexibleHostCheck(boolean flexibleHostCheck)(Code)
	Defines if the host test should be more flexible. To find out if a new URL is on the same host, the robot usually compares the host part of both. Some web servers have an inconsistent addressing scheme and use the hostname www.domain.com and domain.com. With flexible host check enabled, the robot will consider both hosts as equal. Parameters: flexibleHostCheck - set this true, to enable flexible host checking(disabled by default)

setFormHandlers
public void setFormHandlers(Vector handlers)(Code)
	sets the list of form handlers See Also: net.matuschek.html.FormHandler See Also: for more See Also: information about form handlers

setFromAddress
public void setFromAddress(String fromAddress)(Code)
	sets the From: HTTP header this should be a valid email address. it is not needed for the robot, but you should use it, because the administrator of the web server can contact you if the robot is doing things that he don't want Parameters: fromAdress - an RFC 822 email adress

setHttpToolCallback
public void setHttpToolCallback(HttpToolCallback callback)(Code)

setIgnoreRobotsTxt
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)(Code)
	should we ignore robots.txt Robot Exclusion protocol ? Parameters: ignoreRobotsTxt - if set to true, the robot will ignorethe settings of the /robots.txt file on the webserverKnow what you are doing if you change this setting

setMaxDepth
public void setMaxDepth(int maxDepth)(Code)
	sets the maximal search depth Parameters: maxDepth -

setMaxDocumentAge
public void setMaxDocumentAge(long maxAge)(Code)
	Set the maximum age of documents to retrieve to this number of seconds Parameters: maxAge - integer value of the maximum document age (in seconds), negative value means no limit.

setMaxRetries
public void setMaxRetries(int maxRetries)(Code)
	Set allowed retries for document retrieval Parameters: maxRetries -

setNtlmAuthorization
public void setNtlmAuthorization(NTLMAuthorization ntlmAuthorization)(Code)
	sets a ntlmAuthentication for this robot Parameters: ntlmAuthentication - for this robot

setProxy
public void setProxy(String proxyDescr) throws HttpException(Code)
	sets a proxy to use Parameters: proxyDescr - the Proxy definition in the format host:port

setSleep
public void setSleep(boolean sleep)(Code)
	Sets the sleep status for this robot. If a WebRobot is set to sleep after starting run(), is will wait after retrieving the current document and wait for setSleep(false)

setSleepTime
public void setSleepTime(int sleepTime)(Code)
	set the sleeptime after every retrieved document the robot will wait this time before getting the next document. this allows it to limit the load on the server Parameters: sleeptime - wait time in seconds

setStart
public void setStart(String startURL)(Code)
	Method setStart. sets the start URL Parameters: the - startURL as String

setStartReferer
public void setStartReferer(String startReferer)(Code)
	sets the Referer setting for the first HTTP reuest Parameters: startReferer - an URL (e.g. http://www.matuschek.net)

setStartURL
public void setStartURL(URL startURL)(Code)
	Sets the start URL for this robot Parameters: startURL - the start URL

setTimeout
public void setTimeout(int timeout)(Code)
	Sets the timeout for getting data. If HttpTool can't read data from a remote web server after this number of seconds it will stop the download of the current file Parameters: timeout - Timeout in seconds

setURLCheck
public void setURLCheck(URLCheck check)(Code)
	Sets the URLCheck for this robot Parameters: check -

setVisitMany
public void setVisitMany(Vector visitMany)(Code)

setWalkToOtherHosts
public void setWalkToOtherHosts(boolean walkToOtherHosts)(Code)
	sets the WalkToOtherHosts status Parameters: walkToOtherHosts - true if the Robot is allowed to travel to otherhost then the start host, false otherwise

setWasteParameters
public void setWasteParameters(Vector wasteParameters)(Code)
	Set the list of wasteParameters (will be removed from URLs) Parameters: wasteParameters - if they begin of a string in this vector

setWebRobotCallback
public void setWebRobotCallback(WebRobotCallback webRobotCallback)(Code)

sleepNow
public void sleepNow()(Code)
	sleep for sleepTime seconds.

spawnThread
protected synchronized void spawnThread()(Code)
	Start subThreads for spidering. WARNING: Should only be implemented and used for local spidering purposes!

stopRobot
public void stopRobot()(Code)
	stop the current robot run note that this will not abourt the current download but stop after the current download has finished

taskAddAllowed
protected boolean taskAddAllowed(RobotTask task)(Code)
	Checks if a tasks should be added to the task list Parameters: robotTask - true if this tasks can be added to the task list,false otherwise

updateProgressInfo
public void updateProgressInfo()(Code)
	Inform about spidering progress. May use iteration, startTime, countCache, countWeb, countRefresh, countNoRefresh

walkTree
public void walkTree()(Code)
	do your job !

work
public void work()(Code)
	do your job travel through the web using the configured parameters and retrieve documents

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.