Java Doc for AdaptiveRevisitFrontier.java in  » Web-Crawler » heritrix » org » archive » crawler » frontier » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.frontier 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


org.archive.crawler.settings.ModuleType
   org.archive.crawler.frontier.AdaptiveRevisitFrontier

AdaptiveRevisitFrontier
public class AdaptiveRevisitFrontier extends ModuleType implements Frontier,FetchStatusCodes,CoreAttributeConstants,AdaptiveRevisitAttributeConstants,CrawlStatusListener,HasUriReceiver(Code)
A Frontier that will repeatedly visit all encountered URIs.

Wait time between visits is configurable and varies based on observed changes of documents.

The Frontier borrows many things from HostQueuesFrontier, but implements an entirely different strategy in issuing URIs and consequently in keeping a record of discovered URIs.
author:
   Kristinn Sigurdsson



Field Summary
final protected static  StringACCEPTABLE_FORCE_QUEUE
     Acceptable characters in forced queue names.
final public static  StringATTR_DELAY_FACTOR
    
final public static  StringATTR_FORCE_QUEUE
     Queue assignment to force on CrawlURIs.
final public static  StringATTR_HOST_VALENCE
    
final public static  StringATTR_MAX_DELAY
    
final public static  StringATTR_MAX_RETRIES
    
final public static  StringATTR_MIN_DELAY
    
final public static  StringATTR_PREFERENCE_EMBED_HOPS
    
final public static  StringATTR_QUEUE_IGNORE_WWW
     Should the queue assignment ignore www in hostnames, effectively stripping them away.
final public static  StringATTR_RETRY_DELAY
    
final public static  StringATTR_USE_URI_UNIQ_FILTER
     Should the Frontier use a seperate 'already included' datastructure or rely on the queues'.
final protected static  StringDEFAULT_FORCE_QUEUE
    
final protected static  BooleanDEFAULT_QUEUE_IGNORE_WWW
    
final protected static  BooleanDEFAULT_USE_URI_UNIQ_FILTER
    

Constructor Summary
public  AdaptiveRevisitFrontier(String name)
    
public  AdaptiveRevisitFrontier(String name, String description)
    

Method Summary
public  longaverageDepth()
    
protected  voidbatchFlush()
    
protected  voidbatchSchedule(CandidateURI caUri)
    
protected  longcalculateSnoozeTime(CrawlURI curi)
     Calculates how long a host queue needs to be snoozed following the crawling of a URI.
protected  Stringcanonicalize(UURI uuri)
     Canonicalize passed uuri.
protected  Stringcanonicalize(CandidateURI cauri)
     Canonicalize passed CandidateURI.
public  floatcongestionRatio()
    
public  voidconsiderIncluded(UURI u)
    
public  voidcrawlCheckpoint(File checkpointDir)
    
public  voidcrawlEnded(String sExitMessage)
    
public  voidcrawlEnding(String sExitMessage)
    
public  voidcrawlPaused(String statusMessage)
    
public  voidcrawlPausing(String statusMessage)
    
public  voidcrawlResuming(String statusMessage)
    
public  voidcrawlStarted(String message)
    
protected  UriUniqFiltercreateAlreadyIncluded()
     Create a UriUniqFilter that will serve as record of already seen URIs.
public  longdeepestUri()
    
public synchronized  longdeleteURIs(String match)
    
public synchronized  voiddeleted(CrawlURI curi)
    
public synchronized  longdiscoveredUriCount()
    
protected  voiddisregardDisposition(CrawlURI curi)
    
public  longdisregardedUriCount()
    
public  longfailedFetchCount()
    
protected  voidfailureDisposition(CrawlURI curi)
     The CrawlURI has encountered a problem, and will not be retried.
public synchronized  voidfinished(CrawlURI curi)
    
public  longfinishedUriCount()
    
public  StringgetClassKey(CandidateURI cauri)
    
public  FrontierJournalgetFrontierJournal()
    
public  FrontierGroupgetGroup(CrawlURI curi)
    
protected  AdaptiveRevisitHostQueuegetHQ(CrawlURI curi)
     Get the AdaptiveRevisitHostQueue for the given CrawlURI, creating it if necessary.
public synchronized  FrontierMarkergetInitialMarker(String regexpr, boolean inCacheOnly)
    
public  String[]getReports()
    
protected  CrawlServergetServer(CrawlURI curi)
    
public synchronized  ArrayListgetURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
    
public  voidimportRecoverLog(String pathToLog)
     Method is not supported by this Frontier implementation..
public  voidimportRecoverLog(String pathToLog, boolean retainFailures)
    
public synchronized  voidinitialize(CrawlController c)
    
protected synchronized  voidinnerFinished(CrawlURI curi)
    
protected  voidinnerSchedule(CandidateURI caUri)
    
protected  booleanisDisregarded(CrawlURI curi)
    
public  booleanisEmpty()
    
public  voidkickUpdate()
    
public  voidloadSeeds()
    
protected  booleanneedsPromptRetry(CrawlURI curi)
    
protected  booleanneedsRetrying(CrawlURI curi)
    
public synchronized  CrawlURInext()
    
public synchronized  voidpause()
    
public synchronized  longqueuedUriCount()
    
public  voidreceive(CandidateURI item)
    
public  voidreportTo(PrintWriter writer)
    
public synchronized  voidreportTo(String name, PrintWriter writer)
    
protected  voidreschedule(CrawlURI curi, boolean errorWait)
     Put near top of relevant hostQueue (but behind anything recently scheduled 'high')..
Parameters:
  curi - CrawlURI to reschedule.
public  voidschedule(CandidateURI caURI)
    
protected  booleanshouldBeForgotten(CrawlURI curi)
     Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.
public  StringsingleLineLegend()
    
public  StringsingleLineReport()
    
public synchronized  voidsingleLineReportTo(PrintWriter w)
    
public  voidstart()
    
public  longsucceededFetchCount()
    
protected  voidsuccessDisposition(CrawlURI curi)
     The CrawlURI has been successfully crawled.
public synchronized  voidterminate()
    
public  longtotalBytesWritten()
    
public synchronized  voidunpause()
    

Field Detail
ACCEPTABLE_FORCE_QUEUE
final protected static String ACCEPTABLE_FORCE_QUEUE(Code)
Acceptable characters in forced queue names. Word chars, dash, period, comma, colon



ATTR_DELAY_FACTOR
final public static String ATTR_DELAY_FACTOR(Code)
How many multiples of last fetch elapsed time to wait before recontacting same server



ATTR_FORCE_QUEUE
final public static String ATTR_FORCE_QUEUE(Code)
Queue assignment to force on CrawlURIs. Intended to be used via overrides



ATTR_HOST_VALENCE
final public static String ATTR_HOST_VALENCE(Code)
Maximum simultaneous requests in process to a host (queue)



ATTR_MAX_DELAY
final public static String ATTR_MAX_DELAY(Code)
Never wait more than this long, regardless of multiple



ATTR_MAX_RETRIES
final public static String ATTR_MAX_RETRIES(Code)
Maximum times to emit a CrawlURI without final disposition



ATTR_MIN_DELAY
final public static String ATTR_MIN_DELAY(Code)
Always wait this long after one completion before recontacting same server, regardless of multiple



ATTR_PREFERENCE_EMBED_HOPS
final public static String ATTR_PREFERENCE_EMBED_HOPS(Code)
Number of hops of embeds (ERX) to bump to front of host queue



ATTR_QUEUE_IGNORE_WWW
final public static String ATTR_QUEUE_IGNORE_WWW(Code)
Should the queue assignment ignore www in hostnames, effectively stripping them away.



ATTR_RETRY_DELAY
final public static String ATTR_RETRY_DELAY(Code)
For retryable problems, seconds to wait before a retry



ATTR_USE_URI_UNIQ_FILTER
final public static String ATTR_USE_URI_UNIQ_FILTER(Code)
Should the Frontier use a seperate 'already included' datastructure or rely on the queues'.



DEFAULT_FORCE_QUEUE
final protected static String DEFAULT_FORCE_QUEUE(Code)



DEFAULT_QUEUE_IGNORE_WWW
final protected static Boolean DEFAULT_QUEUE_IGNORE_WWW(Code)



DEFAULT_USE_URI_UNIQ_FILTER
final protected static Boolean DEFAULT_USE_URI_UNIQ_FILTER(Code)




Constructor Detail
AdaptiveRevisitFrontier
public AdaptiveRevisitFrontier(String name)(Code)



AdaptiveRevisitFrontier
public AdaptiveRevisitFrontier(String name, String description)(Code)




Method Detail
averageDepth
public long averageDepth()(Code)



batchFlush
protected void batchFlush()(Code)



batchSchedule
protected void batchSchedule(CandidateURI caUri)(Code)



calculateSnoozeTime
protected long calculateSnoozeTime(CrawlURI curi)(Code)
Calculates how long a host queue needs to be snoozed following the crawling of a URI.
Parameters:
  curi - The CrawlURI How long to snooze.



canonicalize
protected String canonicalize(UURI uuri)(Code)
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.
Parameters:
  uuri - Candidate URI to canonicalize. Canonicalized version of passed uuri.



canonicalize
protected String canonicalize(CandidateURI cauri)(Code)
Canonicalize passed CandidateURI. This method differs from AdaptiveRevisitFrontier.canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).

Note, this method under circumstance sets the forceFetch flag.
Parameters:
  cauri - CandidateURI to examine. Canonicalized cacuri.




congestionRatio
public float congestionRatio()(Code)



considerIncluded
public void considerIncluded(UURI u)(Code)



crawlCheckpoint
public void crawlCheckpoint(File checkpointDir) throws Exception(Code)



crawlEnded
public void crawlEnded(String sExitMessage)(Code)



crawlEnding
public void crawlEnding(String sExitMessage)(Code)



crawlPaused
public void crawlPaused(String statusMessage)(Code)



crawlPausing
public void crawlPausing(String statusMessage)(Code)



crawlResuming
public void crawlResuming(String statusMessage)(Code)



crawlStarted
public void crawlStarted(String message)(Code)



createAlreadyIncluded
protected UriUniqFilter createAlreadyIncluded() throws IOException(Code)
Create a UriUniqFilter that will serve as record of already seen URIs. A UURISet that will serve as a record of already seen URIs
throws:
  IOException -



deepestUri
public long deepestUri()(Code)



deleteURIs
public synchronized long deleteURIs(String match)(Code)



deleted
public synchronized void deleted(CrawlURI curi)(Code)



discoveredUriCount
public synchronized long discoveredUriCount()(Code)



disregardDisposition
protected void disregardDisposition(CrawlURI curi)(Code)



disregardedUriCount
public long disregardedUriCount()(Code)



failedFetchCount
public long failedFetchCount()(Code)



failureDisposition
protected void failureDisposition(CrawlURI curi)(Code)
The CrawlURI has encountered a problem, and will not be retried.
Parameters:
  curi - The CrawlURI



finished
public synchronized void finished(CrawlURI curi)(Code)



finishedUriCount
public long finishedUriCount()(Code)



getClassKey
public String getClassKey(CandidateURI cauri)(Code)



getFrontierJournal
public FrontierJournal getFrontierJournal()(Code)



getGroup
public FrontierGroup getGroup(CrawlURI curi)(Code)



getHQ
protected AdaptiveRevisitHostQueue getHQ(CrawlURI curi) throws IOException(Code)
Get the AdaptiveRevisitHostQueue for the given CrawlURI, creating it if necessary.
Parameters:
  curi - CrawlURI for which to get a queue AdaptiveRevisitHostQueue for given CrawlURI
throws:
  IOException -



getInitialMarker
public synchronized FrontierMarker getInitialMarker(String regexpr, boolean inCacheOnly)(Code)



getReports
public String[] getReports()(Code)



getServer
protected CrawlServer getServer(CrawlURI curi)(Code)

Parameters:
  curi - the CrawlServer to be associated with this CrawlURI



getURIsList
public synchronized ArrayList getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException(Code)



importRecoverLog
public void importRecoverLog(String pathToLog) throws IOException(Code)
Method is not supported by this Frontier implementation..
Parameters:
  pathToLog -
throws:
  IOException -



importRecoverLog
public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code)
This method is not supported by this Frontier implementation
Parameters:
  pathToLog -
Parameters:
  retainFailures -
throws:
  IOException -



initialize
public synchronized void initialize(CrawlController c) throws FatalConfigurationException, IOException(Code)



innerFinished
protected synchronized void innerFinished(CrawlURI curi)(Code)



innerSchedule
protected void innerSchedule(CandidateURI caUri)(Code)

Parameters:
  caUri - The URI to schedule.



isDisregarded
protected boolean isDisregarded(CrawlURI curi)(Code)



isEmpty
public boolean isEmpty()(Code)



kickUpdate
public void kickUpdate()(Code)



loadSeeds
public void loadSeeds()(Code)
Loads the seeds

This method is called by initialize() and kickUpdate()




needsPromptRetry
protected boolean needsPromptRetry(CrawlURI curi) throws AttributeNotFoundException(Code)
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.)
Parameters:
  curi - The CrawlURI to check True if we need to retry promptly.
throws:
  AttributeNotFoundException - If problems occur trying to read themaximum number of retries from the settings framework.



needsRetrying
protected boolean needsRetrying(CrawlURI curi) throws AttributeNotFoundException(Code)
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
Parameters:
  curi - The CrawlURI to check True if we need to retry.
throws:
  AttributeNotFoundException - If problems occur trying to read themaximum number of retries from the settings framework.



next
public synchronized CrawlURI next() throws InterruptedException, EndedException(Code)



pause
public synchronized void pause()(Code)



queuedUriCount
public synchronized long queuedUriCount()(Code)



receive
public void receive(CandidateURI item)(Code)



reportTo
public void reportTo(PrintWriter writer) throws IOException(Code)



reportTo
public synchronized void reportTo(String name, PrintWriter writer)(Code)



reschedule
protected void reschedule(CrawlURI curi, boolean errorWait) throws AttributeNotFoundException(Code)
Put near top of relevant hostQueue (but behind anything recently scheduled 'high')..
Parameters:
  curi - CrawlURI to reschedule. Its time of next processing is notmodified.
Parameters:
  errorWait - signals if there should be a wait before retrying.
throws:
  AttributeNotFoundException -



schedule
public void schedule(CandidateURI caURI)(Code)



shouldBeForgotten
protected boolean shouldBeForgotten(CrawlURI curi)(Code)
Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.
Parameters:
  curi - True if curi should be forgotten.



singleLineLegend
public String singleLineLegend()(Code)



singleLineReport
public String singleLineReport()(Code)



singleLineReportTo
public synchronized void singleLineReportTo(PrintWriter w) throws IOException(Code)



start
public void start()(Code)



succeededFetchCount
public long succeededFetchCount()(Code)



successDisposition
protected void successDisposition(CrawlURI curi)(Code)
The CrawlURI has been successfully crawled.
Parameters:
  curi - The CrawlURI



terminate
public synchronized void terminate()(Code)



totalBytesWritten
public long totalBytesWritten()(Code)



unpause
public synchronized void unpause()(Code)



Methods inherited from org.archive.crawler.settings.ModuleType
public Type addElement(CrawlerSettings settings, Type type) throws InvalidAttributeValueException(Code)(Java Doc)
protected void listUsedFiles(List<String> list)(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.