Java Doc for Frontier.java in  » Web-Crawler » heritrix » org » archive » crawler » framework » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.framework 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


org.archive.crawler.framework.Frontier

All known Subclasses:   org.archive.crawler.frontier.AdaptiveRevisitFrontier,  org.archive.crawler.frontier.WorkQueue,  org.archive.crawler.frontier.AbstractFrontier,
Frontier
public interface Frontier extends Reporter(Code)
An interface for URI Frontiers.

A URI Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. This includes (but is not limited to):

  • What URIs have been discovered
  • What URIs are being processed (fetched)
  • What URIs have been processed
  • In what order unprocessed URIs will be processed

The Frontier is also responsible for enforcing any politeness restrictions that may have been applied to the crawl. Such as limiting simultaneous connection to the same host, server or IP number to 1 (or any other fixed amount), delays between connections etc.

A URIFrontier is created by the org.archive.crawler.framework.CrawlController CrawlController which is in turn responsible for providing access to it. Most significant among those modules interested in the Frontier are the org.archive.crawler.framework.ToeThread ToeThreads who perform the actual work of processing a URI.

The methods defined in this interface are those required to get URIs for processing, report the results of processing back (ToeThreads) and to get access to various statistical data along the way. The statistical data is of interest to org.archive.crawler.framework.StatisticsTrackingStatistics Tracking modules. A couple of additional methods are provided to be able to inspect and manipulate the Frontier at runtime.

The statistical data exposed by this interface is:

In addition the frontier may optionally implement an interface that exposes information about hosts.

Furthermore any implementation of the URI Frontier should trigger org.archive.crawler.event.CrawlURIDispositionListenerCrawlURIDispostionEvents by invoking the proper methods on the org.archive.crawler.framework.CrawlController CrawlController . Doing this allows a custom built org.archive.crawler.framework.StatisticsTrackingStatistics Tracking module to gather any other additional data it might be interested in by examining the completed URIs.

All URI Frontiers inherit from org.archive.crawler.settings.ModuleType ModuleType and therefore creating settings follows the usual pattern of pluggable modules in Heritrix.
author:
   Gordon Mohr
author:
   Kristinn Sigurdsson
See Also:   org.archive.crawler.framework.CrawlController
See Also:   org.archive.crawler.framework.CrawlController.fireCrawledURIDisregardEvent(CrawlURI)
See Also:   org.archive.crawler.framework.CrawlController.fireCrawledURIFailureEvent(CrawlURI)
See Also:   org.archive.crawler.framework.CrawlController.fireCrawledURINeedRetryEvent(CrawlURI)
See Also:   org.archive.crawler.framework.CrawlController.fireCrawledURISuccessfulEvent(CrawlURI)
See Also:   org.archive.crawler.framework.StatisticsTracking
See Also:   org.archive.crawler.framework.ToeThread
See Also:   org.archive.crawler.framework.FrontierHostStatistics
See Also:   org.archive.crawler.settings.ModuleType


Inner Class :public interface FrontierGroup extends CrawlSubstats.HasCrawlSubstats

Field Summary
final public static  StringATTR_NAME
     All URI Frontiers should have the same 'name' attribute.


Method Summary
public  longaverageDepth()
    
public  floatcongestionRatio()
    
public  voidconsiderIncluded(UURI u)
     Notify Frontier that it should consider the given UURI as if already scheduled.
public  longdeepestUri()
    
public  longdeleteURIs(String match)
     Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
public  voiddeleted(CrawlURI curi)
     Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
public  longdiscoveredUriCount()
     Number of discovered URIs.

That is any URI that has been confirmed be within 'scope' (i.e.

public  longdisregardedUriCount()
     Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl.

public  longfailedFetchCount()
     Number of URIs that failed to process.

URIs that could not be processed because of some error or failure in the processing chain.

public  voidfinished(CrawlURI cURI)
     Report a URI being processed as having finished processing.
public  longfinishedUriCount()
     Number of URIs that have finished processing.

Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried).

public  StringgetClassKey(CandidateURI cauri)
    
Parameters:
  cauri - CandidateURI for which we're to calculate andset class key.
public  FrontierJournalgetFrontierJournal()
     Return the instance of FrontierJournal thatthis Frontier is using.
public  FrontierGroupgetGroup(CrawlURI curi)
     Get the 'frontier group' (usually queue) for the given CrawlURI.
public  FrontierMarkergetInitialMarker(String regexpr, boolean inCacheOnly)
     Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
Parameters:
  regexpr - The regular expression that URIs within the frontier mustmatch to be considered within the scope of this marker
Parameters:
  inCacheOnly - If set to true, only those URIs within the frontierthat are stored in cache (usually this means in memoryrather then on disk, but that is an implementationdetail) will be considered.
public  ArrayListgetURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
     Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included.

public  voidimportRecoverLog(String pathToLog, boolean retainFailures)
     Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash.

public  voidinitialize(CrawlController c)
     Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier.

 booleanisEmpty()
     Returns true if the frontier contains no more URIs to crawl.

That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier.

public  voidkickUpdate()
     Notify Frontier that it should consider updating configuration info that may have changed in external files.
public  voidloadSeeds()
     Request that the Frontier load (or reload) crawl seeds, typically by contacting the Scope.
 CrawlURInext()
     Get the next URI that should be processed.
public  voidpause()
     Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
public  longqueuedUriCount()
     Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried.

public  voidschedule(CandidateURI caURI)
     Schedules a CandidateURI.

This method accepts one URI and schedules it immediately.

public  voidstart()
     Request that Frontier allow crawling to begin.
public  longsucceededFetchCount()
     Number of successfully processed URIs.

Any URI that was processed successfully.

public  voidterminate()
     Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
public  longtotalBytesWritten()
     Total number of bytes contained in all URIs that have been processed.
public  voidunpause()
     Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Field Detail
ATTR_NAME
final public static String ATTR_NAME(Code)
All URI Frontiers should have the same 'name' attribute. This constant defines that name. This is a name used to reference the Frontier being used in a given crawl order and since there can only be one Frontier per crawl order a fixed, unique name for Frontiers is optimal.
See Also:   org.archive.crawler.settings.ModuleType.ModuleType(String)





Method Detail
averageDepth
public long averageDepth()(Code)



congestionRatio
public float congestionRatio()(Code)



considerIncluded
public void considerIncluded(UURI u)(Code)
Notify Frontier that it should consider the given UURI as if already scheduled.
Parameters:
  u - UURI instance to add to the Already Included set.



deepestUri
public long deepestUri()(Code)



deleteURIs
public long deleteURIs(String match)(Code)
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
Parameters:
  match - A regular expression, any URIs that matches it will bedeleted. The number of URIs deleted




deleted
public void deleted(CrawlURI curi)(Code)
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
Parameters:
  curi - Deleted CrawlURI.



discoveredUriCount
public long discoveredUriCount()(Code)
Number of discovered URIs.

That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).

Note: This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies. Number of discovered URIs.




disregardedUriCount
public long disregardedUriCount()(Code)
Number of URIs that were scheduled at one point but have been disregarded.

Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions. The number of URIs that have been disregarded.




failedFetchCount
public long failedFetchCount()(Code)
Number of URIs that failed to process.

URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed. Number of URIs that failed to process.




finished
public void finished(CrawlURI cURI)(Code)
Report a URI being processed as having finished processing.

ToeThreads will invoke this method once they have completed work on their assigned URI.

This method is synchronized.
Parameters:
  cURI - The URI that has finished processing.




finishedUriCount
public long finishedUriCount()(Code)
Number of URIs that have finished processing.

Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition). Number of finished URIs.




getClassKey
public String getClassKey(CandidateURI cauri)(Code)

Parameters:
  cauri - CandidateURI for which we're to calculate andset class key. Classkey for cauri.



getFrontierJournal
public FrontierJournal getFrontierJournal()(Code)
Return the instance of FrontierJournal thatthis Frontier is using. May be null if no journaling.



getGroup
public FrontierGroup getGroup(CrawlURI curi)(Code)
Get the 'frontier group' (usually queue) for the given CrawlURI.
Parameters:
  curi - CrawlURI to find matching group FrontierGroup for the CrawlURI



getInitialMarker
public FrontierMarker getInitialMarker(String regexpr, boolean inCacheOnly)(Code)
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
Parameters:
  regexpr - The regular expression that URIs within the frontier mustmatch to be considered within the scope of this marker
Parameters:
  inCacheOnly - If set to true, only those URIs within the frontierthat are stored in cache (usually this means in memoryrather then on disk, but that is an implementationdetail) will be considered. Others will be entierlyignored, as if they dont exist. This is usefull for quickpeeks at the top of the URI list. A URIFrontierMarker that is set for the 'start' of the frontier'sURI list.



getURIsList
public ArrayList getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose) throws InvalidFrontierMarkerException(Code)
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.

The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).

The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.

While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
Parameters:
  marker - A marker specifing from what position in the Frontier thelist should begin.
Parameters:
  numberOfMatches - how many URIs to add at most to the list before returning it
Parameters:
  verbose - if set to true the strings returned will contain additionalinformation about each URI beyond their names. a list of all pending URIs falling within the specificationof the marker
throws:
  InvalidFrontierMarkerException - when theURIFronterMarker does not match the internalstate of the frontier. Tolerance for this can varyconsiderably from one URIFrontier implementation to the next.
See Also:   FrontierMarker
See Also:   Frontier.getInitialMarker(String,boolean)




importRecoverLog
public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code)
Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.
Parameters:
  pathToLog - The name (with full path) of the recover log.
Parameters:
  retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaningthe corresponding URIs will be retried in the recovered crawl.)
throws:
  IOException - If problems occur reading the recover log.




initialize
public void initialize(CrawlController c) throws FatalConfigurationException, IOException(Code)
Initialize the Frontier.

This method is invoked by the CrawlController once it has created the Frontier. The constructor of the Frontier should only contain code for setting up it's settings framework. This method should contain all other 'startup' code.
Parameters:
  c - The CrawlController that created the Frontier.
throws:
  FatalConfigurationException - If provided settings are illegal orotherwise unusable.
throws:
  IOException - If there is a problem reading settings or seeds filefrom disk.




isEmpty
boolean isEmpty()(Code)
Returns true if the frontier contains no more URIs to crawl.

That is to say that there are no more URIs either currently availible (ready to be emitted), URIs belonging to deferred hosts or pending URIs in the Frontier. Thus this method may return false even if there is no currently availible URI. true if the frontier contains no more URIs to crawl.




kickUpdate
public void kickUpdate()(Code)
Notify Frontier that it should consider updating configuration info that may have changed in external files.



loadSeeds
public void loadSeeds()(Code)
Request that the Frontier load (or reload) crawl seeds, typically by contacting the Scope.



next
CrawlURI next() throws InterruptedException, EndedException(Code)
Get the next URI that should be processed. If no URI becomes availible during the time specified null will be returned. the next URI that should be processed.
throws:
  InterruptedException -
throws:
  EndedException -



pause
public void pause()(Code)
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.



queuedUriCount
public long queuedUriCount()(Code)
Number of URIs queued up and waiting for processing.

This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times. Number of queued URIs.




schedule
public void schedule(CandidateURI caURI)(Code)
Schedules a CandidateURI.

This method accepts one URI and schedules it immediately. This has nothing to do with the priority of the URI being scheduled. Only that it will be placed in it's respective queue at once. For priority scheduling see CandidateURI.setSchedulingDirective(int)

This method should be synchronized in all implementing classes.
Parameters:
  caURI - The URI to schedule.
See Also:   CandidateURI.setSchedulingDirective(int)




start
public void start()(Code)
Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.



succeededFetchCount
public long succeededFetchCount()(Code)
Number of successfully processed URIs.

Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler. Number of successfully processed URIs.




terminate
public void terminate()(Code)
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.



totalBytesWritten
public long totalBytesWritten()(Code)
Total number of bytes contained in all URIs that have been processed. The total amounts of bytes in all processed URIs.



unpause
public void unpause()(Code)
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.



www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.