Java Doc for WebRobot.java in  » Web-Crawler » JoBo » net » matuschek » spider » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » JoBo » net.matuschek.spider 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   net.matuschek.spider.WebRobot

WebRobot
public class WebRobot implements Runnable,Cloneable(Code)


Field Summary
protected  booleanactivatedContentHistory
    
protected  booleanactivatedNewTasks
    
protected  booleanactivatedUrlHistory
    
protected  booleanallowCaching
    
protected  booleanallowWholeDomain
    
protected  booleanallowWholeHost
    
protected  VectorallowedURLs
    
protected  HashMapcontent2UrlMap
    
 longcountCache
    
 longcountNoRefresh
    
 longcountRefresh
    
 longcountWeb
    
protected  HttpDocManagerdocManager
    
protected  booleanduplicateCheck
    
protected  RobotExceptionHandlerexceptionHandler
    
protected  intexpectedDocumentCount
    
protected  longexpirationAge
     expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.
protected  FilterChainfilters
    
protected  booleanflexibleHostCheck
    
protected  FormFillerformFiller
    
 booleanhasFormHandlers
    
protected  HttpToolhttpTool
    
protected  booleanignoreRobotsTxt
    
protected  intiteration
    
protected  Categorylog
    
protected  intmaxDepth
    
protected  longmaxDocumentAge
    
protected  intmaxRetries
    
protected  NoRobotsrobCheck
    
protected  booleansleep
    
protected  intsleepTime
    
protected  StringstartDir
    
protected  StringstartReferer
    
protected  longstartTime
    
protected  URLstartURL
    
protected  booleanstopIt
    
protected  TaskListtodo
    
protected  URLCheckurlCheck
    
protected  VectorvisitMany
    
protected  TaskListvisited
    
protected  booleanwalkToOtherHosts
    
protected  VectorwasteParameters
    
protected  WebRobotCallbackwebRobotCallback
    

Constructor Summary
public  WebRobot(int expectedDocumentCount)
    
public  WebRobot()
    

Method Summary
protected  voidaddTask(RobotTask task)
    
protected  voidaddTaskAtStart(RobotTask task)
    
protected  booleanbasicURLCheck(URL currURL)
     Basic URL allow check it is allowed to walk to a new URL if
  • WalkToOtherHost is true.
protected  voidcleanUp()
    
public  voidclearCookies()
    
public  voidfinish()
     This method finishes HttpTool, NoRobots, HttpDocManager.
protected  voidfinishThreads()
    
public  StringgetAgentName()
    
public  booleangetAllowCaching()
     Gets the AllowCaching value.
public  booleangetAllowWholeDomain()
     Gets the AllowWholeDomain value. true if the Robot is allowed to travel to the whole domain of the start host, false otherwise.
public  booleangetAllowWholeHost()
     gets the AllowWholeHost value true if the Robot is allowed to travel to the whole host where it started from, false otherwise.
public  VectorgetAllowedURLs()
    
public  intgetBandwidth()
    
public  StringgetContentVisitedURL(HttpDoc doc)
     Method getContentVisitedURL.
public  CookieManagergetCookieManager()
    
public  HttpDocManagergetDocManager()
    
public  booleangetEnableCookies()
    
public  RobotExceptionHandlergetExceptionHandler()
     Method getExceptionHandler.
public  longgetExpirationAge()
     get expiration age of documents in cache.
public  booleangetFlexibleHostCheck()
     Gets the state of flexible host checking (enabled or disabled). To find out if a new URL is on the same host, the robot usually compares the host part of both.
public  VectorgetFormHandlers()
    
public  booleangetIgnoreRobotsTxt()
    
public  intgetMaxDepth()
    
public  longgetMaxDocumentAge()
    
public  intgetMaxRetries()
    
protected  StringgetMimeTypeForFilename(String filename)
     Get the Mime type for the given filename.
public  NTLMAuthorizationgetNtlmAuthorization()
    
public  StringgetProxy()
    
public  intgetSleepTime()
    
public  StringgetStart()
     Method getStart.
public  StringgetStartReferer()
    
public  URLgetStartURL()
    
public  intgetTimeout()
    
public  VectorgetVisitMany()
    
public  booleangetWalkToOtherHosts()
    
public  VectorgetWasteParameters()
    
public  WebRobotCallbackgetWebRobotCallback()
    
protected  voidhandleMemoryError(OutOfMemoryError memoryError)
     Implements OutOfMemory handling strategies.
protected  booleanisAllowed(URL u)
    
protected  booleanisProcessingAllowed(HttpDoc doc)
    
public  booleanisSleeping()
    
public static  voidmain(String[] args)
    
protected  byte[]readFileToByteArray(File file)
     Reads a File to a byte array.
public  voidregisterToDoList(TaskList todo)
     Sets the implementation class for the backend task list storage.
public  voidregisterVisitedList(TaskList visited)
     Sets the implementation class for the backend task list storage.
public static  StringremoveParametersFromString(String urlString, Vector wasteParameters)
    
public  URLremoveWasteParameters(URL url)
     Removes wasteParameters from URL. (eg.
public  voidretrieveURL(RobotTask task)
    
public  voidrun()
    
public  voidsetAgentName(String name)
     sets the Agent-Name authentication for this robot
Parameters:
  name - a name for this robot (e.g.
public  voidsetAllowCaching(boolean allowCaching)
     Sets the AllowCaching status
Parameters:
  allowCaching - if true, the Robot is allows to usecached documents.
public  voidsetAllowWholeDomain(boolean allowWholeDomain)
     Sets the AllowWholeDomain status
Parameters:
  allowWholeDomain - if true, the Robot is allows to travelto all hosts in the same domain as the starting host.
public  voidsetAllowWholeHost(boolean allowWholeHost)
     sets the AllowWholeHost status
Parameters:
  allowWholeHost - if true, the Robot is allowed totravel to the whole host where it started from.
public  voidsetAllowedURLs(Vector allowed)
     Set the list of allowed URLs
Parameters:
  allowed - a Vector containing Strings.
public  voidsetBandwidth(int bandwidth)
    
public  voidsetContentVisitedURL(HttpDoc doc, String url)
     Method setContentVisitedURL.
public  voidsetCookieManager(CookieManager cm)
     Sets the CookieManager used by the HttpTool By default a MemoryCookieManager will be used, but you can use this method to use your own CookieManager implementation.
public  voidsetDocManager(HttpDocManager docManager)
     Sets the document manager for this robot
Without a document manager, the robot will travel through the web but don't do anything with the retrieved documents (simply forget them).
public  voidsetDownloadRuleSet(DownloadRuleSet rules)
    
public  voidsetEnableCookies(boolean enable)
    
public  voidsetExceptionHandler(RobotExceptionHandler newExceptionHandler)
     Method setExceptionHandler.
public  voidsetExpirationAge(long age)
     set expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.
public  voidsetFilters(FilterChain filters)
     Sets a FilterChain.
public  voidsetFlexibleHostCheck(boolean flexibleHostCheck)
     Defines if the host test should be more flexible. To find out if a new URL is on the same host, the robot usually compares the host part of both.
public  voidsetFormHandlers(Vector handlers)
    
public  voidsetFromAddress(String fromAddress)
     sets the From: HTTP header
this should be a valid email address.
public  voidsetHttpToolCallback(HttpToolCallback callback)
    
public  voidsetIgnoreRobotsTxt(boolean ignoreRobotsTxt)
    
public  voidsetMaxDepth(int maxDepth)
    
public  voidsetMaxDocumentAge(long maxAge)
    
public  voidsetMaxRetries(int maxRetries)
    
public  voidsetNtlmAuthorization(NTLMAuthorization ntlmAuthorization)
    
public  voidsetProxy(String proxyDescr)
    
public  voidsetSleep(boolean sleep)
     Sets the sleep status for this robot.
public  voidsetSleepTime(int sleepTime)
     set the sleeptime
after every retrieved document the robot will wait this time before getting the next document.
public  voidsetStart(String startURL)
     Method setStart.
public  voidsetStartReferer(String startReferer)
     sets the Referer setting for the first HTTP reuest
Parameters:
  startReferer - an URL (e.g.
public  voidsetStartURL(URL startURL)
    
public  voidsetTimeout(int timeout)
     Sets the timeout for getting data.
public  voidsetURLCheck(URLCheck check)
    
public  voidsetVisitMany(Vector visitMany)
    
public  voidsetWalkToOtherHosts(boolean walkToOtherHosts)
    
public  voidsetWasteParameters(Vector wasteParameters)
    
public  voidsetWebRobotCallback(WebRobotCallback webRobotCallback)
    
public  voidsleepNow()
     sleep for sleepTime seconds.
protected synchronized  voidspawnThread()
     Start subThreads for spidering.
public  voidstopRobot()
    
protected  booleantaskAddAllowed(RobotTask task)
    
public  voidupdateProgressInfo()
     Inform about spidering progress.
public  voidwalkTree()
    
public  voidwork()
    

Field Detail
activatedContentHistory
protected boolean activatedContentHistory(Code)
Are visited contents collected? (may depend on memoryLevel)



activatedNewTasks
protected boolean activatedNewTasks(Code)
Can new tasks be added? (may depend on memoryLevel)



activatedUrlHistory
protected boolean activatedUrlHistory(Code)
Are visited URLs collected? (may depend on memoryLevel)



allowCaching
protected boolean allowCaching(Code)
don't retrieve pages again that are already stored in the DocManager



allowWholeDomain
protected boolean allowWholeDomain(Code)
allow travelling to all subdomains of the start host ?
See Also:   WebRobot.setAllowWholeDomain(boolean)



allowWholeHost
protected boolean allowWholeHost(Code)
allow travelling the whole host ?



allowedURLs
protected Vector allowedURLs(Code)
list of allowed URLs (even if walkToOtherHosts is false) *



content2UrlMap
protected HashMap content2UrlMap(Code)
remember visited content here (md5, urlString)



countCache
long countCache(Code)
counter for pages that were found in cache



countNoRefresh
long countNoRefresh(Code)
counter for pages that didn´t need a refresh



countRefresh
long countRefresh(Code)
counter for refreshed pages (=cache+web)



countWeb
long countWeb(Code)
counter for pages retrieved by web



docManager
protected HttpDocManager docManager(Code)
DocManager will store or process retrieved documents



duplicateCheck
protected boolean duplicateCheck(Code)
Check for documents with the same content



exceptionHandler
protected RobotExceptionHandler exceptionHandler(Code)
the robot exception handler



expectedDocumentCount
protected int expectedDocumentCount(Code)
expected count of documents



expirationAge
protected long expirationAge(Code)
expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.



filters
protected FilterChain filters(Code)
FilterChain to filter the document before storing it



flexibleHostCheck
protected boolean flexibleHostCheck(Code)
do more flexible tests if the new URL is on the same host
See Also:   WebRobot.basicURLCheck(URL)



formFiller
protected FormFiller formFiller(Code)
fill out forms



hasFormHandlers
boolean hasFormHandlers(Code)
only true if form-handlers are defined



httpTool
protected HttpTool httpTool(Code)
HttpTool will be used to retrieve documents from a web server



ignoreRobotsTxt
protected boolean ignoreRobotsTxt(Code)
ignore settings in /robots.txt ?



iteration
protected int iteration(Code)
counter for calls of retrieveURL



log
protected Category log(Code)
Log4J category for logging



maxDepth
protected int maxDepth(Code)
maximal search depth



maxDocumentAge
protected long maxDocumentAge(Code)
maximum document age in seconds, negative value means no limit



maxRetries
protected int maxRetries(Code)
number of allowed retries for document retrieval



robCheck
protected NoRobots robCheck(Code)
test for robots.txt



sleep
protected boolean sleep(Code)
should the robot suspend the current walk() *



sleepTime
protected int sleepTime(Code)
sleep that number of seconds after every retrieved document



startDir
protected String startDir(Code)
the host and directory where retrieval started from



startReferer
protected String startReferer(Code)
Referer used to retrieve to first document



startTime
protected long startTime(Code)
time of WebRobot start in milliseconds



startURL
protected URL startURL(Code)
the URL where the robot walk starts from



stopIt
protected boolean stopIt(Code)
should we stop robot operation ? *



todo
protected TaskList todo(Code)
current tasks



urlCheck
protected URLCheck urlCheck(Code)
to check if it is allowed to travel to a given URL *



visitMany
protected Vector visitMany(Code)
this URLs can be visited more then once



visited
protected TaskList visited(Code)
a list of all URLs we got already



walkToOtherHosts
protected boolean walkToOtherHosts(Code)
is it allowed to walk to other hosts then the starting host ?



wasteParameters
protected Vector wasteParameters(Code)
list of wasteParameters (will be removed from URLs) *



webRobotCallback
protected WebRobotCallback webRobotCallback(Code)
for callback to the user interface *




Constructor Detail
WebRobot
public WebRobot(int expectedDocumentCount)(Code)
initializes the robot with the default implementation of the TaskList interface
Parameters:
  expected - document count



WebRobot
public WebRobot()(Code)
initializes the robot with the default implementation of the TaskList interface




Method Detail
addTask
protected void addTask(RobotTask task)(Code)
adds a new task to the task vector but does some checks to



addTaskAtStart
protected void addTaskAtStart(RobotTask task)(Code)
adds a new tasks at the beginning of the tasks list
See Also:   WebRobot.addTask(RobotTask)



basicURLCheck
protected boolean basicURLCheck(URL currURL)(Code)
Basic URL allow check it is allowed to walk to a new URL if
  • WalkToOtherHost is true. In this case there will be no additional tests.
  • The new URL is located below the start URL, e.g. is the start URL is http://localhost/test, the URL http://localhost/test/index.html is allowed, but http://localhost/ is not allowed.
  • AllowWholeHost is true and the new URL is located on the same host as the start URL.
  • FlexibleHostCheck is true and the host part of the current URL is equal to the host part of the start URL modulo the prefix "www."
  • The URL starts with a string in the "AllowedURLs" list.



cleanUp
protected void cleanUp()(Code)
Clean up temporary data



clearCookies
public void clearCookies()(Code)
Delete all cookies



finish
public void finish()(Code)
This method finishes HttpTool, NoRobots, HttpDocManager.



finishThreads
protected void finishThreads()(Code)
calls webRobotDone and finishes docManager if executed in mainThread



getAgentName
public String getAgentName()(Code)
Gets the name of the "User-Agent" header that the robot will use the user agent name



getAllowCaching
public boolean getAllowCaching()(Code)
Gets the AllowCaching value. true if the Robot is allowed to cache documents in thedocManager
See Also:   WebRobot.setAllowCaching(boolean)



getAllowWholeDomain
public boolean getAllowWholeDomain()(Code)
Gets the AllowWholeDomain value. true if the Robot is allowed to travel to the whole domain of the start host, false otherwise.
See Also:   WebRobot.setAllowWholeDomain(boolean)



getAllowWholeHost
public boolean getAllowWholeHost()(Code)
gets the AllowWholeHost value true if the Robot is allowed to travel to the whole host where it started from, false otherwise. If false, it is onlyallowed to travel to URLs below the start URL



getAllowedURLs
public Vector getAllowedURLs()(Code)
Gets the list of allowed URLs a Vector containing Strings
See Also:   WebRobot.setAllowedURLs(Vector)



getBandwidth
public int getBandwidth()(Code)
Get the value of bandwith of the used HttpTool value of bandwith.



getContentVisitedURL
public String getContentVisitedURL(HttpDoc doc)(Code)
Method getContentVisitedURL. Checks if the content was visited before and retrieves the corresponding URL.
Parameters:
  content - found url or null if not found



getCookieManager
public CookieManager getCookieManager()(Code)
Gets the CookieManager used by the HttpTool the CookieManager that will be used by the HttpTool



getDocManager
public HttpDocManager getDocManager()(Code)
the document manager of this robot
See Also:   HttpDocManager



getEnableCookies
public boolean getEnableCookies()(Code)
Get the status of the cookie engine true, if HTTP cookies are enabled, false otherwise



getExceptionHandler
public RobotExceptionHandler getExceptionHandler()(Code)
Method getExceptionHandler. RobotExceptionHandler the exceptionhandler of the robot



getExpirationAge
public long getExpirationAge()(Code)
get expiration age of documents in cache. long



getFlexibleHostCheck
public boolean getFlexibleHostCheck()(Code)
Gets the state of flexible host checking (enabled or disabled). To find out if a new URL is on the same host, the robot usually compares the host part of both. Some web servers have an inconsistent addressing scheme and use the hostname www.domain.com and domain.com. With flexible host check enabled, the robot will consider both hosts as equal. true, if flexible host checking is enabled



getFormHandlers
public Vector getFormHandlers()(Code)
the list of form handlers
See Also:   net.matuschek.html.FormHandler
See Also:    for more information
See Also:   about form handlers



getIgnoreRobotsTxt
public boolean getIgnoreRobotsTxt()(Code)
Gets the setting of the IgnoreRobotsTxt property true if robots.txt will be ignored, false otherwise



getMaxDepth
public int getMaxDepth()(Code)
the maximal allowed search depth



getMaxDocumentAge
public long getMaxDocumentAge()(Code)
Gets the maximum age of documents to retrieve maximum document age (in seconds), negative value means no limit.



getMaxRetries
public int getMaxRetries()(Code)
Get allowed retries for document retrieval maxRetries



getMimeTypeForFilename
protected String getMimeTypeForFilename(String filename)(Code)
Get the Mime type for the given filename.
Parameters:
  filename - Mime type



getNtlmAuthorization
public NTLMAuthorization getNtlmAuthorization()(Code)
Gets the ntlmAuthentication of the robot the ntlmAuthentication



getProxy
public String getProxy()(Code)
the current proxy setting in the format host:port



getSleepTime
public int getSleepTime()(Code)
the sleeptime setting



getStart
public String getStart()(Code)
Method getStart. gets the start url as string String



getStartReferer
public String getStartReferer()(Code)
the Referer setting for the first HTTP reuest



getStartURL
public URL getStartURL()(Code)
the start URL for this robot



getTimeout
public int getTimeout()(Code)
Gets the timeout for getting data in seconds of the used HttpTool the value of sockerTimeout
See Also:   WebRobot.setTimeout(int)



getVisitMany
public Vector getVisitMany()(Code)
Gets a vector of URLs that can be visited more then once a vector containing URLs formated as Strings



getWalkToOtherHosts
public boolean getWalkToOtherHosts()(Code)
gets the WalkToOtherHost status true if the Robot is allowed to travel to otherhost then the start host, false otherwise



getWasteParameters
public Vector getWasteParameters()(Code)
Gets the list of wasteParameters (will be removed from URLs) a Vector containing Strings



getWebRobotCallback
public WebRobotCallback getWebRobotCallback()(Code)



handleMemoryError
protected void handleMemoryError(OutOfMemoryError memoryError) throws OutOfMemoryError(Code)
Implements OutOfMemory handling strategies. Action depends on memoryLevel
Parameters:
  memoryError -
throws:
  OutOfMemoryError -



isAllowed
protected boolean isAllowed(URL u)(Code)
Is it allowed to travel to this new URL ?
Parameters:
  u - the URL to test true if traveling to this URL is allowed, false otherwise



isProcessingAllowed
protected boolean isProcessingAllowed(HttpDoc doc)(Code)
Is it allowed to process this document ?
Parameters:
  document - true if processing of this URL is allowed



isSleeping
public boolean isSleeping()(Code)
Is the robot sleeping ?



main
public static void main(String[] args)(Code)



readFileToByteArray
protected byte[] readFileToByteArray(File file) throws IOException(Code)
Reads a File to a byte array.
Parameters:
  file - byte[]
throws:
  IOException -



registerToDoList
public void registerToDoList(TaskList todo)(Code)
Sets the implementation class for the backend task list storage. WebRobot uses the TaskList interface to store future tasks. If you want to use your own TaskList implementation, just call this method.
Parameters:
  todo - TaskList to be used for the "to do" list



registerVisitedList
public void registerVisitedList(TaskList visited)(Code)
Sets the implementation class for the backend task list storage. WebRobot uses the TaskList interface to store URLs that have been retrieved before. If you want to use your own TaskList implementation, just call this method.
Parameters:
  visited - TaskList to be used for the list of visited URLs



removeParametersFromString
public static String removeParametersFromString(String urlString, Vector wasteParameters)(Code)
Remove passed Parameters from UrlString
Parameters:
  urlString -
Parameters:
  wasteParameters - String



removeWasteParameters
public URL removeWasteParameters(URL url)(Code)
Removes wasteParameters from URL. (eg. ID)
Parameters:
  url - URL



retrieveURL
public void retrieveURL(RobotTask task)(Code)
retrieve the next URL, save it, extract all included links and add those links to the tasks list
Parameters:
  task - task to retrieve, function does nothing if this is null



run
public void run()(Code)
thread run() method, simply calls work()
See Also:   WebRobot.work()



setAgentName
public void setAgentName(String name)(Code)
sets the Agent-Name authentication for this robot
Parameters:
  name - a name for this robot (e.g. "Mozilla 4.0 (compatible; Robot)")



setAllowCaching
public void setAllowCaching(boolean allowCaching)(Code)
Sets the AllowCaching status
Parameters:
  allowCaching - if true, the Robot is allows to usecached documents. That means it will first try to get teh documentfrom the docManager cache and will only retrieve it if it isnot found in the cache. If the cache returns a document, the robotwill NEVER retrieve it again. Therefore, expiration mechanisms haveto be included in the HttpDocManager method retrieveFromCache.
See Also:   net.matuschek.http.HttpDocManager.retrieveFromCache(java.net.URL)



setAllowWholeDomain
public void setAllowWholeDomain(boolean allowWholeDomain)(Code)
Sets the AllowWholeDomain status
Parameters:
  allowWholeDomain - if true, the Robot is allows to travelto all hosts in the same domain as the starting host. E.g. if youstart at www.apache.org, it is also allowed to travel tojakarta.apache.org, xml.apache.org ...



setAllowWholeHost
public void setAllowWholeHost(boolean allowWholeHost)(Code)
sets the AllowWholeHost status
Parameters:
  allowWholeHost - if true, the Robot is allowed totravel to the whole host where it started from. Otherwise it is onlyallowed to travel to URLs below the start URL.



setAllowedURLs
public void setAllowedURLs(Vector allowed)(Code)
Set the list of allowed URLs
Parameters:
  allowed - a Vector containing Strings. URLs will be checkedif they begin of a string in this vector



setBandwidth
public void setBandwidth(int bandwidth)(Code)
Set the value of bandwith of the used HttpTool
Parameters:
  bandwidth - Value to assign to bandwith.



setContentVisitedURL
public void setContentVisitedURL(HttpDoc doc, String url)(Code)
Method setContentVisitedURL. Makes an URL retrievable by its content by entering it in content2UrlMap.
Parameters:
  content -
Parameters:
  url -



setCookieManager
public void setCookieManager(CookieManager cm)(Code)
Sets the CookieManager used by the HttpTool By default a MemoryCookieManager will be used, but you can use this method to use your own CookieManager implementation.
Parameters:
  cm - an object that implements the CookieManager interface



setDocManager
public void setDocManager(HttpDocManager docManager)(Code)
Sets the document manager for this robot
Without a document manager, the robot will travel through the web but don't do anything with the retrieved documents (simply forget them). A document manager can store them, extract information or whatever you like. There can be only one document manager, but you are free to combine functionalities of available document managers in a new object (e.g. to store the document and extract meta informations).
Parameters:
  docManager -



setDownloadRuleSet
public void setDownloadRuleSet(DownloadRuleSet rules)(Code)
Sets the DownloadRule
Parameters:
  rule - the download rule set to use



setEnableCookies
public void setEnableCookies(boolean enable)(Code)
Enable/disable cookies
Parameters:
  enable - if true, HTTP cookies will be enabled, if falsethe robot will not use cookies



setExceptionHandler
public void setExceptionHandler(RobotExceptionHandler newExceptionHandler)(Code)
Method setExceptionHandler. sets the exceptionhandler of the robot
Parameters:
  newExceptionHandler - the new exception handler



setExpirationAge
public void setExpirationAge(long age)(Code)
set expiration age of documents in cache. Documents older than expirationAge will be removed, negative value means no limit.
Parameters:
  age -



setFilters
public void setFilters(FilterChain filters)(Code)
Sets a FilterChain. If teh WebRobot use a FilterChain it will process any retrieved document by this FilterChain before storing it
Parameters:
  filter - a FilterChain to use for filtering HttpDocs



setFlexibleHostCheck
public void setFlexibleHostCheck(boolean flexibleHostCheck)(Code)
Defines if the host test should be more flexible. To find out if a new URL is on the same host, the robot usually compares the host part of both. Some web servers have an inconsistent addressing scheme and use the hostname www.domain.com and domain.com. With flexible host check enabled, the robot will consider both hosts as equal.
Parameters:
  flexibleHostCheck - set this true, to enable flexible host checking(disabled by default)



setFormHandlers
public void setFormHandlers(Vector handlers)(Code)
sets the list of form handlers
See Also:   net.matuschek.html.FormHandler
See Also:    for more
See Also:   information about form handlers



setFromAddress
public void setFromAddress(String fromAddress)(Code)
sets the From: HTTP header
this should be a valid email address. it is not needed for the robot, but you should use it, because the administrator of the web server can contact you if the robot is doing things that he don't want
Parameters:
  fromAdress - an RFC 822 email adress



setHttpToolCallback
public void setHttpToolCallback(HttpToolCallback callback)(Code)



setIgnoreRobotsTxt
public void setIgnoreRobotsTxt(boolean ignoreRobotsTxt)(Code)
should we ignore robots.txt Robot Exclusion protocol ?
Parameters:
  ignoreRobotsTxt - if set to true, the robot will ignorethe settings of the /robots.txt file on the webserverKnow what you are doing if you change this setting



setMaxDepth
public void setMaxDepth(int maxDepth)(Code)
sets the maximal search depth
Parameters:
  maxDepth -



setMaxDocumentAge
public void setMaxDocumentAge(long maxAge)(Code)
Set the maximum age of documents to retrieve to this number of seconds
Parameters:
  maxAge - integer value of the maximum document age (in seconds), negative value means no limit.



setMaxRetries
public void setMaxRetries(int maxRetries)(Code)
Set allowed retries for document retrieval
Parameters:
  maxRetries -



setNtlmAuthorization
public void setNtlmAuthorization(NTLMAuthorization ntlmAuthorization)(Code)
sets a ntlmAuthentication for this robot
Parameters:
  ntlmAuthentication - for this robot



setProxy
public void setProxy(String proxyDescr) throws HttpException(Code)
sets a proxy to use
Parameters:
  proxyDescr - the Proxy definition in the format host:port



setSleep
public void setSleep(boolean sleep)(Code)
Sets the sleep status for this robot. If a WebRobot is set to sleep after starting run(), is will wait after retrieving the current document and wait for setSleep(false)



setSleepTime
public void setSleepTime(int sleepTime)(Code)
set the sleeptime
after every retrieved document the robot will wait this time before getting the next document. this allows it to limit the load on the server
Parameters:
  sleeptime - wait time in seconds



setStart
public void setStart(String startURL)(Code)
Method setStart. sets the start URL
Parameters:
  the - startURL as String



setStartReferer
public void setStartReferer(String startReferer)(Code)
sets the Referer setting for the first HTTP reuest
Parameters:
  startReferer - an URL (e.g. http://www.matuschek.net)



setStartURL
public void setStartURL(URL startURL)(Code)
Sets the start URL for this robot
Parameters:
  startURL - the start URL



setTimeout
public void setTimeout(int timeout)(Code)
Sets the timeout for getting data. If HttpTool can't read data from a remote web server after this number of seconds it will stop the download of the current file
Parameters:
  timeout - Timeout in seconds



setURLCheck
public void setURLCheck(URLCheck check)(Code)
Sets the URLCheck for this robot
Parameters:
  check -



setVisitMany
public void setVisitMany(Vector visitMany)(Code)



setWalkToOtherHosts
public void setWalkToOtherHosts(boolean walkToOtherHosts)(Code)
sets the WalkToOtherHosts status
Parameters:
  walkToOtherHosts - true if the Robot is allowed to travel to otherhost then the start host, false otherwise



setWasteParameters
public void setWasteParameters(Vector wasteParameters)(Code)
Set the list of wasteParameters (will be removed from URLs)
Parameters:
  wasteParameters - if they begin of a string in this vector



setWebRobotCallback
public void setWebRobotCallback(WebRobotCallback webRobotCallback)(Code)



sleepNow
public void sleepNow()(Code)
sleep for sleepTime seconds.



spawnThread
protected synchronized void spawnThread()(Code)
Start subThreads for spidering. WARNING: Should only be implemented and used for local spidering purposes!



stopRobot
public void stopRobot()(Code)
stop the current robot run note that this will not abourt the current download but stop after the current download has finished



taskAddAllowed
protected boolean taskAddAllowed(RobotTask task)(Code)
Checks if a tasks should be added to the task list
Parameters:
  robotTask - true if this tasks can be added to the task list,false otherwise



updateProgressInfo
public void updateProgressInfo()(Code)
Inform about spidering progress. May use iteration, startTime, countCache, countWeb, countRefresh, countNoRefresh



walkTree
public void walkTree()(Code)
do your job !



work
public void work()(Code)
do your job travel through the web using the configured parameters and retrieve documents



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.