Java Doc for AbstractFrontier.java in  » Web-Crawler » heritrix » org » archive » crawler » frontier » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.frontier 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


org.archive.crawler.settings.ModuleType
   org.archive.crawler.frontier.AbstractFrontier

All known Subclasses:   org.archive.crawler.frontier.WorkQueueFrontier,
AbstractFrontier
abstract public class AbstractFrontier extends ModuleType implements CrawlStatusListener,Frontier,FetchStatusCodes,CoreAttributeConstants,Serializable(Code)
Shared facilities for Frontier implementations.
author:
   gojomo


Field Summary
final protected static  StringACCEPTABLE_FORCE_QUEUE
    
final public static  StringATTR_DELAY_FACTOR
    
final public static  StringATTR_FORCE_QUEUE
    
final public static  StringATTR_MAX_DELAY
    
final public static  StringATTR_MAX_HOST_BANDWIDTH_USAGE
    
final public static  StringATTR_MAX_OVERALL_BANDWIDTH_USAGE
    
final public static  StringATTR_MAX_RETRIES
    
final public static  StringATTR_MIN_DELAY
    
final public static  StringATTR_PAUSE_AT_FINISH
    
final public static  StringATTR_PAUSE_AT_START
    
final public static  StringATTR_PREFERENCE_EMBED_HOPS
    
final public static  StringATTR_QUEUE_ASSIGNMENT_POLICY
    
final protected static  StringATTR_RECOVERY_ENABLED
     Recover log on or off attribute.
final public static  StringATTR_RETRY_DELAY
    
final public static  StringATTR_SOURCE_TAG_SEEDS
    
final protected static  BooleanDEFAULT_ATTR_RECOVERY_ENABLED
    
final protected static  FloatDEFAULT_DELAY_FACTOR
    
final protected static  StringDEFAULT_FORCE_QUEUE
    
final protected static  IntegerDEFAULT_MAX_DELAY
    
final protected static  IntegerDEFAULT_MAX_HOST_BANDWIDTH_USAGE
    
final protected static  IntegerDEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
    
final protected static  IntegerDEFAULT_MAX_RETRIES
    
final protected static  IntegerDEFAULT_MIN_DELAY
    
final protected static  BooleanDEFAULT_PAUSE_AT_FINISH
    
final protected static  BooleanDEFAULT_PAUSE_AT_START
    
final protected static  IntegerDEFAULT_PREFERENCE_EMBED_HOPS
    
final protected static  LongDEFAULT_RETRY_DELAY
    
final protected static  BooleanDEFAULT_SOURCE_TAG_SEEDS
    
final public static  StringIGNORED_SEEDS_FILENAME
    
protected transient  CrawlControllercontroller
    
protected  longdisregardedUriCount
    
protected  longfailedFetchCount
    
protected  intlastMaxBandwidthKB
    
protected  AtomicLongnextOrdinal
    
protected  longprocessedBytesAfterLastEmittedURI
    
protected transient  QueueAssignmentPolicyqueueAssignmentPolicy
    
protected  longqueuedUriCount
    
protected  booleanshouldPause
    
protected transient  booleanshouldTerminate
    
protected  longsucceededFetchCount
    
protected  longtotalProcessedBytes
     Used when bandwidth constraint are used.

Constructor Summary
public  AbstractFrontier(String name, String description)
    

Method Summary
protected  voidapplySpecialHandling(CrawlURI curi)
     Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
protected  CrawlURIasCrawlUri(CandidateURI caUri)
    
protected  Stringcanonicalize(UURI uuri)
     Canonicalize passed uuri.
protected  Stringcanonicalize(CandidateURI cauri)
     Canonicalize passed CandidateURI.
public  voidcrawlCheckpoint(File checkpointDir)
    
public  voidcrawlEnded(String sExitMessage)
    
public  voidcrawlEnding(String sExitMessage)
    
public  voidcrawlPaused(String statusMessage)
    
public  voidcrawlPausing(String statusMessage)
    
public  voidcrawlResuming(String statusMessage)
    
public  voidcrawlStarted(String message)
    
protected synchronized  voiddecrementQueuedCount(long numberOfDeletes)
     Note that a number of queued Uris have been deleted.
public  longdisregardedUriCount()
    
protected  voiddoJournalAdded(CrawlURI c)
    
protected  voiddoJournalEmitted(CrawlURI c)
    
protected  voiddoJournalFinishedFailure(CrawlURI c)
    
protected  voiddoJournalFinishedSuccess(CrawlURI c)
    
protected  voiddoJournalRescheduled(CrawlURI c)
    
public  longfailedFetchCount()
    
public  longfinishedUriCount()
    
public  StringgetClassKey(CandidateURI cauri)
    
Parameters:
  cauri - CrawlURI we're to get a key for.
public  FrontierJournalgetFrontierJournal()
     RecoveryJournal instance.
protected  CrawlServergetServer(CrawlURI curi)
    
public  voidimportRecoverLog(String pathToLog, boolean retainFailures)
    
protected synchronized  voidincrementDisregardedUriCount()
     Increment the running count of disregarded URIs.
protected synchronized  voidincrementFailedFetchCount()
     Increment the running count of failed URIs.
protected synchronized  voidincrementQueuedUriCount()
     Increment the running count of queued URIs.
protected synchronized  voidincrementQueuedUriCount(long increment)
     Increment the running count of queued URIs.
protected synchronized  voidincrementSucceededFetchCount()
     Increment the running count of successfully fetched URIs.
public  voidinitialize(CrawlController c)
    
protected  booleanisDisregarded(CrawlURI curi)
    
public synchronized  booleanisEmpty()
    
public  voidkickUpdate()
    
public  voidloadSeeds()
     Load up the seeds.
protected  voidlog(CrawlURI curi)
    
protected  voidlogLocalizedErrors(CrawlURI curi)
     Take note of any processor-local errors that have been entered into the CrawlURI.
protected  booleanneedsRetrying(CrawlURI curi)
    
protected  voidnoteAboutToEmit(CrawlURI curi, WorkQueue q)
     Perform fixups on a CrawlURI about to be returned via next().
protected  booleanoverMaxRetries(CrawlURI curi)
    
public synchronized  voidpause()
    
protected  longpolitenessDelayFor(CrawlURI curi)
     Update any scheduling structures with the new information in this CrawlURI.
protected synchronized  voidpreNext(long now)
    
public  longqueuedUriCount()
    
public  voidreportTo(PrintWriter writer)
    
protected  longretryDelayFor(CrawlURI curi)
     Return a suitable value to wait before retrying the given URI.
public static  voidsaveIgnoredItems(String ignoredItems, File dir)
     Dump ignored seed items (if any) to disk; delete file otherwise.
protected  FilescratchDirFor(String key)
     Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir.
public  StringsingleLineReport()
    
public  voidstart()
    
public  longsucceededFetchCount()
    
public synchronized  voidterminate()
    
public  longtotalBytesWritten()
    
public synchronized  voidunpause()
    

Field Detail
ACCEPTABLE_FORCE_QUEUE
final protected static String ACCEPTABLE_FORCE_QUEUE(Code)



ATTR_DELAY_FACTOR
final public static String ATTR_DELAY_FACTOR(Code)
how many multiples of last fetch elapsed time to wait before recontacting same server



ATTR_FORCE_QUEUE
final public static String ATTR_FORCE_QUEUE(Code)
queue assignment to force onto CrawlURIs; intended to be overridden



ATTR_MAX_DELAY
final public static String ATTR_MAX_DELAY(Code)
never wait more than this long, regardless of multiple



ATTR_MAX_HOST_BANDWIDTH_USAGE
final public static String ATTR_MAX_HOST_BANDWIDTH_USAGE(Code)
maximum per-host bandwidth usage



ATTR_MAX_OVERALL_BANDWIDTH_USAGE
final public static String ATTR_MAX_OVERALL_BANDWIDTH_USAGE(Code)
maximum overall bandwidth usage



ATTR_MAX_RETRIES
final public static String ATTR_MAX_RETRIES(Code)
maximum times to emit a CrawlURI without final disposition



ATTR_MIN_DELAY
final public static String ATTR_MIN_DELAY(Code)
always wait this long after one completion before recontacting same server, regardless of multiple



ATTR_PAUSE_AT_FINISH
final public static String ATTR_PAUSE_AT_FINISH(Code)
whether pause, rather than finish, when crawl appears done



ATTR_PAUSE_AT_START
final public static String ATTR_PAUSE_AT_START(Code)
whether to pause at crawl start



ATTR_PREFERENCE_EMBED_HOPS
final public static String ATTR_PREFERENCE_EMBED_HOPS(Code)
number of hops of embeds (ERX) to bump to front of host queue



ATTR_QUEUE_ASSIGNMENT_POLICY
final public static String ATTR_QUEUE_ASSIGNMENT_POLICY(Code)



ATTR_RECOVERY_ENABLED
final protected static String ATTR_RECOVERY_ENABLED(Code)
Recover log on or off attribute.



ATTR_RETRY_DELAY
final public static String ATTR_RETRY_DELAY(Code)
for retryable problems, seconds to wait before a retry



ATTR_SOURCE_TAG_SEEDS
final public static String ATTR_SOURCE_TAG_SEEDS(Code)
whether to pause at crawl start



DEFAULT_ATTR_RECOVERY_ENABLED
final protected static Boolean DEFAULT_ATTR_RECOVERY_ENABLED(Code)



DEFAULT_DELAY_FACTOR
final protected static Float DEFAULT_DELAY_FACTOR(Code)



DEFAULT_FORCE_QUEUE
final protected static String DEFAULT_FORCE_QUEUE(Code)



DEFAULT_MAX_DELAY
final protected static Integer DEFAULT_MAX_DELAY(Code)



DEFAULT_MAX_HOST_BANDWIDTH_USAGE
final protected static Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE(Code)



DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
final protected static Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE(Code)



DEFAULT_MAX_RETRIES
final protected static Integer DEFAULT_MAX_RETRIES(Code)



DEFAULT_MIN_DELAY
final protected static Integer DEFAULT_MIN_DELAY(Code)



DEFAULT_PAUSE_AT_FINISH
final protected static Boolean DEFAULT_PAUSE_AT_FINISH(Code)



DEFAULT_PAUSE_AT_START
final protected static Boolean DEFAULT_PAUSE_AT_START(Code)



DEFAULT_PREFERENCE_EMBED_HOPS
final protected static Integer DEFAULT_PREFERENCE_EMBED_HOPS(Code)



DEFAULT_RETRY_DELAY
final protected static Long DEFAULT_RETRY_DELAY(Code)



DEFAULT_SOURCE_TAG_SEEDS
final protected static Boolean DEFAULT_SOURCE_TAG_SEEDS(Code)



IGNORED_SEEDS_FILENAME
final public static String IGNORED_SEEDS_FILENAME(Code)
file collecting report of ignored seed-file entries (if any)



controller
protected transient CrawlController controller(Code)



disregardedUriCount
protected long disregardedUriCount(Code)



failedFetchCount
protected long failedFetchCount(Code)



lastMaxBandwidthKB
protected int lastMaxBandwidthKB(Code)



nextOrdinal
protected AtomicLong nextOrdinal(Code)
ordinal numbers to assign to created CrawlURIs



processedBytesAfterLastEmittedURI
protected long processedBytesAfterLastEmittedURI(Code)



queueAssignmentPolicy
protected transient QueueAssignmentPolicy queueAssignmentPolicy(Code)
Policy for assigning CrawlURIs to named queues



queuedUriCount
protected long queuedUriCount(Code)



shouldPause
protected boolean shouldPause(Code)
should the frontier hold any threads asking for URIs?



shouldTerminate
protected transient boolean shouldTerminate(Code)
should the frontier send an EndedException to any threads asking for URIs?



succeededFetchCount
protected long succeededFetchCount(Code)



totalProcessedBytes
protected long totalProcessedBytes(Code)
Used when bandwidth constraint are used.




Constructor Detail
AbstractFrontier
public AbstractFrontier(String name, String description)(Code)

Parameters:
  name - Name of this frontier.
Parameters:
  description - Description for this frontier.




Method Detail
applySpecialHandling
protected void applySpecialHandling(CrawlURI curi)(Code)
Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
Parameters:
  curi -



asCrawlUri
protected CrawlURI asCrawlUri(CandidateURI caUri)(Code)



canonicalize
protected String canonicalize(UURI uuri)(Code)
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.
Parameters:
  uuri - Candidate URI to canonicalize. Canonicalized version of passed uuri.



canonicalize
protected String canonicalize(CandidateURI cauri)(Code)
Canonicalize passed CandidateURI. This method differs from AbstractFrontier.canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).

Note, this method under circumstance sets the forceFetch flag.
Parameters:
  cauri - CandidateURI to examine. Canonicalized cacuri.




crawlCheckpoint
public void crawlCheckpoint(File checkpointDir) throws Exception(Code)



crawlEnded
public void crawlEnded(String sExitMessage)(Code)



crawlEnding
public void crawlEnding(String sExitMessage)(Code)



crawlPaused
public void crawlPaused(String statusMessage)(Code)



crawlPausing
public void crawlPausing(String statusMessage)(Code)



crawlResuming
public void crawlResuming(String statusMessage)(Code)



crawlStarted
public void crawlStarted(String message)(Code)



decrementQueuedCount
protected synchronized void decrementQueuedCount(long numberOfDeletes)(Code)
Note that a number of queued Uris have been deleted.
Parameters:
  numberOfDeletes -



disregardedUriCount
public long disregardedUriCount()(Code)



doJournalAdded
protected void doJournalAdded(CrawlURI c)(Code)



doJournalEmitted
protected void doJournalEmitted(CrawlURI c)(Code)



doJournalFinishedFailure
protected void doJournalFinishedFailure(CrawlURI c)(Code)



doJournalFinishedSuccess
protected void doJournalFinishedSuccess(CrawlURI c)(Code)



doJournalRescheduled
protected void doJournalRescheduled(CrawlURI c)(Code)



failedFetchCount
public long failedFetchCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.failedFetchCount



finishedUriCount
public long finishedUriCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.finishedUriCount



getClassKey
public String getClassKey(CandidateURI cauri)(Code)

Parameters:
  cauri - CrawlURI we're to get a key for. a String token representing a queue



getFrontierJournal
public FrontierJournal getFrontierJournal()(Code)
RecoveryJournal instance. May be null.



getServer
protected CrawlServer getServer(CrawlURI curi)(Code)

Parameters:
  curi - the CrawlServer to be associated with this CrawlURI



importRecoverLog
public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code)



incrementDisregardedUriCount
protected synchronized void incrementDisregardedUriCount()(Code)
Increment the running count of disregarded URIs. Synchronized because operations on longs are not atomic.



incrementFailedFetchCount
protected synchronized void incrementFailedFetchCount()(Code)
Increment the running count of failed URIs. Synchronized because operations on longs are not atomic.



incrementQueuedUriCount
protected synchronized void incrementQueuedUriCount()(Code)
Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.



incrementQueuedUriCount
protected synchronized void incrementQueuedUriCount(long increment)(Code)
Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.
Parameters:
  increment - amount to increment the queued count



incrementSucceededFetchCount
protected synchronized void incrementSucceededFetchCount()(Code)
Increment the running count of successfully fetched URIs. Synchronized because operations on longs are not atomic.



initialize
public void initialize(CrawlController c) throws FatalConfigurationException, IOException(Code)



isDisregarded
protected boolean isDisregarded(CrawlURI curi)(Code)



isEmpty
public synchronized boolean isEmpty()(Code)
Frontier is empty only if all queues are empty and no URIs are in-process True if queues are empty.



kickUpdate
public void kickUpdate()(Code)



loadSeeds
public void loadSeeds()(Code)
Load up the seeds. This method is called on initialize and inside in the crawlcontroller when it wants to force reloading of configuration.
See Also:   org.archive.crawler.framework.CrawlController.kickUpdate



log
protected void log(CrawlURI curi)(Code)
Log to the main crawl.log
Parameters:
  curi -



logLocalizedErrors
protected void logLocalizedErrors(CrawlURI curi)(Code)
Take note of any processor-local errors that have been entered into the CrawlURI.
Parameters:
  curi -



needsRetrying
protected boolean needsRetrying(CrawlURI curi)(Code)
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
Parameters:
  curi - The CrawlURI to check True if we need to retry.



noteAboutToEmit
protected void noteAboutToEmit(CrawlURI curi, WorkQueue q)(Code)
Perform fixups on a CrawlURI about to be returned via next().
Parameters:
  curi - CrawlURI about to be returned by next()
Parameters:
  q - the queue from which the CrawlURI came



overMaxRetries
protected boolean overMaxRetries(CrawlURI curi)(Code)



pause
public synchronized void pause()(Code)



politenessDelayFor
protected long politenessDelayFor(CrawlURI curi)(Code)
Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.
Parameters:
  curi - The CrawlURI millisecond politeness delay



preNext
protected synchronized void preNext(long now) throws InterruptedException, EndedException(Code)

Parameters:
  now -
throws:
  InterruptedException -
throws:
  EndedException -



queuedUriCount
public long queuedUriCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.queuedUriCount



reportTo
public void reportTo(PrintWriter writer)(Code)



retryDelayFor
protected long retryDelayFor(CrawlURI curi)(Code)
Return a suitable value to wait before retrying the given URI.
Parameters:
  curi - CrawlURI to be retried millisecond delay before retry



saveIgnoredItems
public static void saveIgnoredItems(String ignoredItems, File dir)(Code)
Dump ignored seed items (if any) to disk; delete file otherwise. Static to allow non-derived sibling classes (frontiers not yet subclassed here) to reuse.
Parameters:
  ignoredItems -
Parameters:
  dir -



scratchDirFor
protected File scratchDirFor(String key)(Code)
Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir. To avoid having any one directory with thousands of files, there are also two levels of enclosing directory named by the least-significant hex digits of the key string's java hashcode.
Parameters:
  key - File representing scratch directory



singleLineReport
public String singleLineReport()(Code)



start
public void start()(Code)



succeededFetchCount
public long succeededFetchCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.succeededFetchCount



terminate
public synchronized void terminate()(Code)



totalBytesWritten
public long totalBytesWritten()(Code)



unpause
public synchronized void unpause()(Code)



Methods inherited from org.archive.crawler.settings.ModuleType
public Type addElement(CrawlerSettings settings, Type type) throws InvalidAttributeValueException(Code)(Java Doc)
protected void listUsedFiles(List<String> list)(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.