Java Doc for StatisticsTracker.java in  » Web-Crawler » heritrix » org » archive » crawler » admin » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.admin 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


org.archive.crawler.framework.AbstractTracker
   org.archive.crawler.admin.StatisticsTracker

StatisticsTracker
public class StatisticsTracker extends AbstractTracker implements CrawlURIDispositionListener,Serializable(Code)
This is an implementation of the AbstractTracker. It is designed to function with the WUI as well as performing various logging activity.

At the end of each snapshot a line is written to the 'progress-statistics.log' file.

The header of that file is as follows:

 [timestamp] [discovered]    [queued] [downloaded] [doc/s(avg)]  [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
First there is a timestamp, accurate down to 1 second.

discovered, queued, downloaded and dl-failures are (respectively) the discovered URI count, pending URI count, successfully fetched count and failed fetch count from the frontier at the time of the snapshot.

KB/s(avg) is the bandwidth usage. We use the total bytes downloaded to calculate average bandwidth usage (KB/sec). Since we also note the value each time a snapshot is made we can calculate the average bandwidth usage during the last snapshot period to gain a "current" rate. The first number is the current and the average is in parenthesis.

doc/s(avg) works the same way as doc/s except it show the number of documents (URIs) rather then KB downloaded.

busy-threads is the total number of ToeThreads that are not available (and thus presumably busy processing a URI). This information is extracted from the crawl controller.

Finally mem-use-KB is extracted from the run time environment (Runtime.getRuntime().totalMemory()).

In addition to the data collected for the above logs, various other data is gathered and stored by this tracker.

  • Successfully downloaded documents per fetch status code
  • Successfully downloaded documents per document mime type
  • Amount of data per mime type
  • Successfully downloaded documents per host
  • Amount of data per host
  • Disposition of all seeds (this is written to 'reports.log' at end of crawl)
  • Successfully downloaded documents per host per source

author:
   Parker Thompson
author:
   Kristinn Sigurdsson
See Also:   org.archive.crawler.framework.StatisticsTracking
See Also:   org.archive.crawler.framework.AbstractTracker


Field Summary
protected  longaverageDepth
    
protected  intbusyThreads
    
protected  floatcongestionRatio
    
protected  CrawledBytesHistotablecrawledBytes
    
protected  doublecurrentDocsPerSecond
    
protected  intcurrentKBPerSec
    
protected  longdeepestUri
    
protected  longdiscoveredUriCount
    
protected  doubledocsPerSecond
    
protected  longdownloadDisregards
    
protected  longdownloadFailures
    
protected  longdownloadedUriCount
    
protected  longfinishedUriCount
    
protected transient  Map<String, LongWrapper>hostsBytes
    
protected transient  Map<String, LongWrapper>hostsDistribution
     Keep track of hosts.
protected transient  Map<String, Long>hostsLastFinished
    
protected  longlastPagesFetchedCount
    
protected  longlastProcessedBytesCount
    
protected  Hashtable<String, LongWrapper>mimeTypeBytes
    
protected  Hashtable<String, LongWrapper>mimeTypeDistribution
    
protected transient  Map<String, SeedRecord>processedSeedsRecords
     Record of seeds' latest actions.
protected  longqueuedUriCount
    
protected transient  Map<String, HashMap<String, LongWrapper>>sourceHostDistribution
    
protected  Hashtable<String, LongWrapper>statusCodeDistribution
    
protected  longtotalKBPerSec
    
protected  longtotalProcessedBytes
    

Constructor Summary
public  StatisticsTracker(String name)
    

Method Summary
public  intactiveThreadCount()
    
public  longaverageDepth()
     Average depth of the last URI in all eligible queues.
public  floatcongestionRatio()
     Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads.
public  voidcrawlCheckpoint(File cpDir)
    
public  voidcrawlEnded(String message)
    
public  StringcrawledBytesSummary()
    
public  voidcrawledURIDisregard(CrawlURI curi)
    
public  voidcrawledURIFailure(CrawlURI curi)
    
public  voidcrawledURINeedRetry(CrawlURI curi)
    
public  voidcrawledURISuccessful(CrawlURI curi)
    
public  doublecurrentProcessedDocsPerSec()
    
public  intcurrentProcessedKBPerSec()
    
public  longdeepestUri()
     Ordinal position of the 'deepest' URI eligible for crawling.
public  longdiscoveredUriCount()
     Number of discovered URIs.
public  longdisregardedFetchAttempts()
    
public  voiddumpReports()
     Run the reports.
public  longfailedFetchAttempts()
    
protected  voidfinalCleanup()
    
public  longfinishedUriCount()
     Number of URIs that have finished processing.
public  longgetBytesPerFileType(String filetype)
     Returns the accumulated number of bytes from files of a given file type.
Parameters:
  filetype - Filetype to check.
public  longgetBytesPerHost(String host)
     Returns the accumulated number of bytes downloaded from a given host.
public  Hashtable<String, LongWrapper>getFileDistribution()
     Returns a HashMap that contains information about distributions of encountered mime types.
public  longgetHostLastFinished(String host)
     Returns the time (in millisec) when a URI belonging to a given host was last finished processing.
public  Map<String, Number>getProgressStatistics()
    
public  StringgetProgressStatisticsLine(Date now)
    
public  StringgetProgressStatisticsLine()
    
public  TreeMap<String, LongWrapper>getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues)
     Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with LongWrapper.

Elements are sorted by value from largest to smallest.

public  SortedMapgetReverseSortedHostCounts(Map<String, LongWrapper> hostCounts)
     Return a copy of the hosts distribution in reverse-sorted (largest first) order.
public  SortedMapgetReverseSortedHostsDistribution()
     Return a copy of the hosts distribution in reverse-sorted (largest first) order.
public  IteratorgetSeedRecordsSortedByStatusCode()
    
protected  Iterator<SeedRecord>getSeedRecordsSortedByStatusCode(Iterator<String> i)
    
public  Iterator<String>getSeeds()
     Get a seed iterator for the job being monitored.
public  Hashtable<String, LongWrapper>getStatusCodeDistribution()
     Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count.
protected static  voidincrementMapCount(Map<String, LongWrapper> map, String key)
     Increment a counter for a key in a given HashMap.
protected static  voidincrementMapCount(Map<String, LongWrapper> map, String key, long increment)
     Increment a counter for a key in a given HashMap by an arbitrary amount. Used for various aggregate data.
public  voidinitialize(CrawlController c)
    
public  intpercentOfDiscoveredUrisCompleted()
    
public  doubleprocessedDocsPerSec()
    
public  longprocessedKBPerSec()
    
protected synchronized  voidprogressStatisticsEvent(EventObject e)
    
public  longqueuedUriCount()
     Number of URIs queued up and waiting for processing.
protected  voidsaveHostStats(String hostname, long size)
    
protected  voidsaveSourceStats(String source, String hostname)
    
public  longsuccessfullyFetchedCount()
    
public  intthreadCount()
    
public  longtotalBytesCrawled()
    
public  longtotalBytesWritten()
    
public  longtotalCount()
    
protected  voidwriteCrawlReportTo(PrintWriter writer)
    
protected  voidwriteFrontierReportTo(PrintWriter writer)
    
protected  voidwriteHostsReportTo(PrintWriter writer)
    
protected  voidwriteManifestReportTo(PrintWriter writer)
    
protected  voidwriteMimetypesReportTo(PrintWriter writer)
    
protected  voidwriteProcessorsReportTo(PrintWriter writer)
    
protected  voidwriteReportFile(String reportName, String filename)
    
protected  voidwriteResponseCodeReportTo(PrintWriter writer)
    
protected  voidwriteSeedsReportTo(PrintWriter writer)
    
protected  voidwriteSourceReportTo(PrintWriter writer)
    

Field Detail
averageDepth
protected long averageDepth(Code)



busyThreads
protected int busyThreads(Code)



congestionRatio
protected float congestionRatio(Code)



crawledBytes
protected CrawledBytesHistotable crawledBytes(Code)
tally sizes novel, verified (same hash), vouched (not-modified)



currentDocsPerSecond
protected double currentDocsPerSecond(Code)



currentKBPerSec
protected int currentKBPerSec(Code)



deepestUri
protected long deepestUri(Code)



discoveredUriCount
protected long discoveredUriCount(Code)



docsPerSecond
protected double docsPerSecond(Code)



downloadDisregards
protected long downloadDisregards(Code)



downloadFailures
protected long downloadFailures(Code)



downloadedUriCount
protected long downloadedUriCount(Code)



finishedUriCount
protected long finishedUriCount(Code)



hostsBytes
protected transient Map<String, LongWrapper> hostsBytes(Code)



hostsDistribution
protected transient Map<String, LongWrapper> hostsDistribution(Code)
Keep track of hosts. Each of these Maps are individually unsynchronized, and cannot be trivially synchronized with the Collections wrapper. Thus their synchronized access is enforced by this class.

They're transient because usually bigmaps that get reconstituted on recover from checkpoint.




hostsLastFinished
protected transient Map<String, Long> hostsLastFinished(Code)



lastPagesFetchedCount
protected long lastPagesFetchedCount(Code)



lastProcessedBytesCount
protected long lastProcessedBytesCount(Code)



mimeTypeBytes
protected Hashtable<String, LongWrapper> mimeTypeBytes(Code)



mimeTypeDistribution
protected Hashtable<String, LongWrapper> mimeTypeDistribution(Code)
Keep track of the file types we see (mime type -> count)



processedSeedsRecords
protected transient Map<String, SeedRecord> processedSeedsRecords(Code)
Record of seeds' latest actions.



queuedUriCount
protected long queuedUriCount(Code)



sourceHostDistribution
protected transient Map<String, HashMap<String, LongWrapper>> sourceHostDistribution(Code)
Keep track of URL counts per host per seed



statusCodeDistribution
protected Hashtable<String, LongWrapper> statusCodeDistribution(Code)
Keep track of fetch status codes



totalKBPerSec
protected long totalKBPerSec(Code)



totalProcessedBytes
protected long totalProcessedBytes(Code)




Constructor Detail
StatisticsTracker
public StatisticsTracker(String name)(Code)




Method Detail
activeThreadCount
public int activeThreadCount()(Code)
Current thread count (or zero if can't figure it out).



averageDepth
public long averageDepth()(Code)
Average depth of the last URI in all eligible queues. That is, the average length of all eligible queues. long average depth of last URIs in queues



congestionRatio
public float congestionRatio()(Code)
Ratio of number of threads that would theoretically allow maximum crawl progress (if each was as productive as current threads), to current number of threads. float congestion ratio



crawlCheckpoint
public void crawlCheckpoint(File cpDir) throws Exception(Code)



crawlEnded
public void crawlEnded(String message)(Code)



crawledBytesSummary
public String crawledBytesSummary()(Code)



crawledURIDisregard
public void crawledURIDisregard(CrawlURI curi)(Code)



crawledURIFailure
public void crawledURIFailure(CrawlURI curi)(Code)



crawledURINeedRetry
public void crawledURINeedRetry(CrawlURI curi)(Code)



crawledURISuccessful
public void crawledURISuccessful(CrawlURI curi)(Code)



currentProcessedDocsPerSec
public double currentProcessedDocsPerSec()(Code)



currentProcessedKBPerSec
public int currentProcessedKBPerSec()(Code)



deepestUri
public long deepestUri()(Code)
Ordinal position of the 'deepest' URI eligible for crawling. Essentially, the length of the longest frontier internal queue. long URI count to deepest URI



discoveredUriCount
public long discoveredUriCount()(Code)
Number of discovered URIs.

If crawl not running (paused or stopped) this will return the value of the last snapshot. A count of all uris encountered
See Also:   org.archive.crawler.framework.Frontier.discoveredUriCount




disregardedFetchAttempts
public long disregardedFetchAttempts()(Code)
Get the total number of failed fetch attempts (connection failures -> give up, etc) The total number of failed fetch attempts



dumpReports
public void dumpReports()(Code)
Run the reports.



failedFetchAttempts
public long failedFetchAttempts()(Code)
Get the total number of failed fetch attempts (connection failures -> give up, etc) The total number of failed fetch attempts



finalCleanup
protected void finalCleanup()(Code)



finishedUriCount
public long finishedUriCount()(Code)
Number of URIs that have finished processing. Number of URIs that have finished processing
See Also:   org.archive.crawler.framework.Frontier.finishedUriCount



getBytesPerFileType
public long getBytesPerFileType(String filetype)(Code)
Returns the accumulated number of bytes from files of a given file type.
Parameters:
  filetype - Filetype to check. the accumulated number of bytes from files of a given mime type



getBytesPerHost
public long getBytesPerHost(String host)(Code)
Returns the accumulated number of bytes downloaded from a given host.
Parameters:
  host - name of the host the accumulated number of bytes downloaded from a given host



getFileDistribution
public Hashtable<String, LongWrapper> getFileDistribution()(Code)
Returns a HashMap that contains information about distributions of encountered mime types. Key/value pairs represent mime type -> count.

Note: All the values are wrapped with a LongWrapper LongWrapper mimeTypeDistribution




getHostLastFinished
public long getHostLastFinished(String host)(Code)
Returns the time (in millisec) when a URI belonging to a given host was last finished processing.
Parameters:
  host - The host to look up time of last completed URI. Returns the time (in millisec) when a URI belonging to a given host was last finished processing. If no URI has been completed for host-1 will be returned.



getProgressStatistics
public Map<String, Number> getProgressStatistics()(Code)



getProgressStatisticsLine
public String getProgressStatisticsLine(Date now)(Code)
Return one line of current progress-statistics
Parameters:
  now - String of stats



getProgressStatisticsLine
public String getProgressStatisticsLine()(Code)
Return one line of current progress-statistics String of stats



getReverseSortedCopy
public TreeMap<String, LongWrapper> getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues)(Code)
Sort the entries of the given HashMap in descending order by their values, which must be longs wrapped with LongWrapper.

Elements are sorted by value from largest to smallest. Equal values are sorted in an arbitrary, but consistent manner by their keys. Only items with identical value and key are considered equal. If the passed-in map requires access to be synchronized, the caller should ensure this synchronization.
Parameters:
  mapOfLongWrapperValues - Assumes values are wrapped with LongWrapper. a sorted set containing the same elements as the map.




getReverseSortedHostCounts
public SortedMap getReverseSortedHostCounts(Map<String, LongWrapper> hostCounts)(Code)
Return a copy of the hosts distribution in reverse-sorted (largest first) order. SortedMap of hosts distribution



getReverseSortedHostsDistribution
public SortedMap getReverseSortedHostsDistribution()(Code)
Return a copy of the hosts distribution in reverse-sorted (largest first) order. SortedMap of hosts distribution



getSeedRecordsSortedByStatusCode
public Iterator getSeedRecordsSortedByStatusCode()(Code)



getSeedRecordsSortedByStatusCode
protected Iterator<SeedRecord> getSeedRecordsSortedByStatusCode(Iterator<String> i)(Code)



getSeeds
public Iterator<String> getSeeds()(Code)
Get a seed iterator for the job being monitored. Note: This iterator will iterate over a list of strings not UURIs like the Scope seed iterator. The strings are equal to the URIs' getURIString() values. the seed iteratorFIXME: Consider using TransformingIterator here



getStatusCodeDistribution
public Hashtable<String, LongWrapper> getStatusCodeDistribution()(Code)
Return a HashMap representing the distribution of status codes for successfully fetched curis, as represented by a hashmap where key -> val represents (string)code -> (integer)count. Note: All the values are wrapped with a LongWrapper LongWrapper statusCodeDistribution



incrementMapCount
protected static void incrementMapCount(Map<String, LongWrapper> map, String key)(Code)
Increment a counter for a key in a given HashMap. Used for various aggregate data. As this is used to change Maps which depend on StatisticsTracker for their synchronization, this method should only be invoked from a a block synchronized on 'this'.
Parameters:
  map - The HashMap
Parameters:
  key - The key for the counter to be incremented, if it does notexist it will be added (set to 1). If null it willincrement the counter "unknown".



incrementMapCount
protected static void incrementMapCount(Map<String, LongWrapper> map, String key, long increment)(Code)
Increment a counter for a key in a given HashMap by an arbitrary amount. Used for various aggregate data. The increment amount can be negative. As this is used to change Maps which depend on StatisticsTracker for their synchronization, this method should only be invoked from a a block synchronized on 'this'.
Parameters:
  map - The HashMap
Parameters:
  key - The key for the counter to be incremented, if it does not existit will be added (set to equal to increment).If null it will increment the counter "unknown".
Parameters:
  increment - The amount to increment counter related to the key.



initialize
public void initialize(CrawlController c) throws FatalConfigurationException(Code)



percentOfDiscoveredUrisCompleted
public int percentOfDiscoveredUrisCompleted()(Code)
This returns the number of completed URIs as a percentage of the total number of URIs encountered (should be inverse to the discovery curve) The number of completed URIs as a percentage of the totalnumber of URIs encountered



processedDocsPerSec
public double processedDocsPerSec()(Code)



processedKBPerSec
public long processedKBPerSec()(Code)



progressStatisticsEvent
protected synchronized void progressStatisticsEvent(EventObject e)(Code)



queuedUriCount
public long queuedUriCount()(Code)
Number of URIs queued up and waiting for processing.

If crawl not running (paused or stopped) this will return the value of the last snapshot. Number of URIs queued up and waiting for processing.
See Also:   org.archive.crawler.framework.Frontier.queuedUriCount




saveHostStats
protected void saveHostStats(String hostname, long size)(Code)



saveSourceStats
protected void saveSourceStats(String source, String hostname)(Code)



successfullyFetchedCount
public long successfullyFetchedCount()(Code)



threadCount
public int threadCount()(Code)
Get the total number of ToeThreads (sleeping and active) The total number of ToeThreads



totalBytesCrawled
public long totalBytesCrawled()(Code)



totalBytesWritten
public long totalBytesWritten()(Code)



totalCount
public long totalCount()(Code)



writeCrawlReportTo
protected void writeCrawlReportTo(PrintWriter writer)(Code)



writeFrontierReportTo
protected void writeFrontierReportTo(PrintWriter writer)(Code)
Write the Frontier's 'nonempty' report (if available)
Parameters:
  writer - to report to



writeHostsReportTo
protected void writeHostsReportTo(PrintWriter writer)(Code)



writeManifestReportTo
protected void writeManifestReportTo(PrintWriter writer)(Code)

Parameters:
  writer - Where to write.



writeMimetypesReportTo
protected void writeMimetypesReportTo(PrintWriter writer)(Code)



writeProcessorsReportTo
protected void writeProcessorsReportTo(PrintWriter writer)(Code)



writeReportFile
protected void writeReportFile(String reportName, String filename)(Code)



writeResponseCodeReportTo
protected void writeResponseCodeReportTo(PrintWriter writer)(Code)



writeSeedsReportTo
protected void writeSeedsReportTo(PrintWriter writer)(Code)

Parameters:
  writer - Where to write.



writeSourceReportTo
protected void writeSourceReportTo(PrintWriter writer)(Code)



Fields inherited from org.archive.crawler.framework.AbstractTracker
final public static String ATTR_STATS_INTERVAL(Code)(Java Doc)
final public static Integer DEFAULT_STATISTICS_REPORT_INTERVAL(Code)(Java Doc)
protected transient CrawlController controller(Code)(Java Doc)
protected long crawlerEndTime(Code)(Java Doc)
protected long crawlerPauseStarted(Code)(Java Doc)
protected long crawlerStartTime(Code)(Java Doc)
protected long crawlerTotalPausedTime(Code)(Java Doc)
protected long lastLogPointTime(Code)(Java Doc)
protected boolean shouldrun(Code)(Java Doc)

Methods inherited from org.archive.crawler.framework.AbstractTracker
public long crawlDuration()(Code)(Java Doc)
public void crawlEnded(String sExitMessage)(Code)(Java Doc)
public void crawlEnding(String sExitMessage)(Code)(Java Doc)
public void crawlPaused(String statusMessage)(Code)(Java Doc)
public void crawlPausing(String statusMessage)(Code)(Java Doc)
public void crawlResuming(String statusMessage)(Code)(Java Doc)
public void crawlStarted(String message)(Code)(Java Doc)
protected void dumpReports()(Code)(Java Doc)
protected void finalCleanup()(Code)(Java Doc)
public long getCrawlEndTime()(Code)(Java Doc)
public long getCrawlPauseStartedTime()(Code)(Java Doc)
public long getCrawlStartTime()(Code)(Java Doc)
public long getCrawlTotalPauseTime()(Code)(Java Doc)
public long getCrawlerTotalElapsedTime()(Code)(Java Doc)
protected int getLogWriteInterval()(Code)(Java Doc)
public void initialize(CrawlController c) throws FatalConfigurationException(Code)(Java Doc)
protected void logNote(String note)(Code)(Java Doc)
public void noteStart()(Code)(Java Doc)
protected synchronized void progressStatisticsEvent(EventObject e)(Code)(Java Doc)
public String progressStatisticsLegend()(Code)(Java Doc)
public void run()(Code)(Java Doc)
protected void tallyCurrentPause()(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.