| org.archive.crawler.framework.AbstractTracker org.archive.crawler.admin.StatisticsTracker
StatisticsTracker | public class StatisticsTracker extends AbstractTracker implements CrawlURIDispositionListener,Serializable(Code) | | This is an implementation of the AbstractTracker. It is designed to function
with the WUI as well as performing various logging activity.
At the end of each snapshot a line is written to the
'progress-statistics.log' file.
The header of that file is as follows:
[timestamp] [discovered] [queued] [downloaded] [doc/s(avg)] [KB/s(avg)] [dl-failures] [busy-thread] [mem-use-KB]
First there is a timestamp, accurate down to 1 second.
discovered, queued, downloaded and dl-failures
are (respectively) the discovered URI count, pending URI count, successfully
fetched count and failed fetch count from the frontier at the time of the
snapshot.
KB/s(avg) is the bandwidth usage. We use the total bytes downloaded
to calculate average bandwidth usage (KB/sec). Since we also note the value
each time a snapshot is made we can calculate the average bandwidth usage
during the last snapshot period to gain a "current" rate. The first number is
the current and the average is in parenthesis.
doc/s(avg) works the same way as doc/s except it show the number of
documents (URIs) rather then KB downloaded.
busy-threads is the total number of ToeThreads that are not available
(and thus presumably busy processing a URI). This information is extracted
from the crawl controller.
Finally mem-use-KB is extracted from the run time environment
(Runtime.getRuntime().totalMemory() ).
In addition to the data collected for the above logs, various other data
is gathered and stored by this tracker.
- Successfully downloaded documents per fetch status code
- Successfully downloaded documents per document mime type
- Amount of data per mime type
- Successfully downloaded documents per host
- Amount of data per host
- Disposition of all seeds (this is written to 'reports.log' at end of
crawl)
- Successfully downloaded documents per host per source
author: Parker Thompson author: Kristinn Sigurdsson See Also: org.archive.crawler.framework.StatisticsTracking See Also: org.archive.crawler.framework.AbstractTracker |
Method Summary | |
public int | activeThreadCount() | public long | averageDepth() Average depth of the last URI in all eligible queues. | public float | congestionRatio() Ratio of number of threads that would theoretically allow
maximum crawl progress (if each was as productive as current
threads), to current number of threads. | public void | crawlCheckpoint(File cpDir) | public void | crawlEnded(String message) | public String | crawledBytesSummary() | public void | crawledURIDisregard(CrawlURI curi) | public void | crawledURIFailure(CrawlURI curi) | public void | crawledURINeedRetry(CrawlURI curi) | public void | crawledURISuccessful(CrawlURI curi) | public double | currentProcessedDocsPerSec() | public int | currentProcessedKBPerSec() | public long | deepestUri() Ordinal position of the 'deepest' URI eligible
for crawling. | public long | discoveredUriCount() Number of discovered URIs. | public long | disregardedFetchAttempts() | public void | dumpReports() Run the reports. | public long | failedFetchAttempts() | protected void | finalCleanup() | public long | finishedUriCount() Number of URIs that have finished processing. | public long | getBytesPerFileType(String filetype) Returns the accumulated number of bytes from files of a given file type.
Parameters: filetype - Filetype to check. | public long | getBytesPerHost(String host) Returns the accumulated number of bytes downloaded from a given host. | public Hashtable<String, LongWrapper> | getFileDistribution() Returns a HashMap that contains information about distributions of
encountered mime types. | public long | getHostLastFinished(String host) Returns the time (in millisec) when a URI belonging to a given host was
last finished processing. | public Map<String, Number> | getProgressStatistics() | public String | getProgressStatisticsLine(Date now) | public String | getProgressStatisticsLine() | public TreeMap<String, LongWrapper> | getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues) Sort the entries of the given HashMap in descending order by their
values, which must be longs wrapped with LongWrapper .
Elements are sorted by value from largest to smallest. | public SortedMap | getReverseSortedHostCounts(Map<String, LongWrapper> hostCounts) Return a copy of the hosts distribution in reverse-sorted (largest first)
order. | public SortedMap | getReverseSortedHostsDistribution() Return a copy of the hosts distribution in reverse-sorted
(largest first) order. | public Iterator | getSeedRecordsSortedByStatusCode() | protected Iterator<SeedRecord> | getSeedRecordsSortedByStatusCode(Iterator<String> i) | public Iterator<String> | getSeeds() Get a seed iterator for the job being monitored. | public Hashtable<String, LongWrapper> | getStatusCodeDistribution() Return a HashMap representing the distribution of status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count. | protected static void | incrementMapCount(Map<String, LongWrapper> map, String key) Increment a counter for a key in a given HashMap. | protected static void | incrementMapCount(Map<String, LongWrapper> map, String key, long increment) Increment a counter for a key in a given HashMap by an arbitrary amount.
Used for various aggregate data. | public void | initialize(CrawlController c) | public int | percentOfDiscoveredUrisCompleted() | public double | processedDocsPerSec() | public long | processedKBPerSec() | protected synchronized void | progressStatisticsEvent(EventObject e) | public long | queuedUriCount() Number of URIs queued up and waiting for processing. | protected void | saveHostStats(String hostname, long size) | protected void | saveSourceStats(String source, String hostname) | public long | successfullyFetchedCount() | public int | threadCount() | public long | totalBytesCrawled() | public long | totalBytesWritten() | public long | totalCount() | protected void | writeCrawlReportTo(PrintWriter writer) | protected void | writeFrontierReportTo(PrintWriter writer) | protected void | writeHostsReportTo(PrintWriter writer) | protected void | writeManifestReportTo(PrintWriter writer) | protected void | writeMimetypesReportTo(PrintWriter writer) | protected void | writeProcessorsReportTo(PrintWriter writer) | protected void | writeReportFile(String reportName, String filename) | protected void | writeResponseCodeReportTo(PrintWriter writer) | protected void | writeSeedsReportTo(PrintWriter writer) | protected void | writeSourceReportTo(PrintWriter writer) |
averageDepth | protected long averageDepth(Code) | | |
busyThreads | protected int busyThreads(Code) | | |
congestionRatio | protected float congestionRatio(Code) | | |
currentDocsPerSecond | protected double currentDocsPerSecond(Code) | | |
currentKBPerSec | protected int currentKBPerSec(Code) | | |
deepestUri | protected long deepestUri(Code) | | |
discoveredUriCount | protected long discoveredUriCount(Code) | | |
docsPerSecond | protected double docsPerSecond(Code) | | |
downloadDisregards | protected long downloadDisregards(Code) | | |
downloadFailures | protected long downloadFailures(Code) | | |
downloadedUriCount | protected long downloadedUriCount(Code) | | |
finishedUriCount | protected long finishedUriCount(Code) | | |
hostsDistribution | protected transient Map<String, LongWrapper> hostsDistribution(Code) | | Keep track of hosts.
Each of these Maps are individually unsynchronized, and cannot
be trivially synchronized with the Collections wrapper. Thus
their synchronized access is enforced by this class.
They're transient because usually bigmaps that get reconstituted
on recover from checkpoint.
|
lastPagesFetchedCount | protected long lastPagesFetchedCount(Code) | | |
lastProcessedBytesCount | protected long lastProcessedBytesCount(Code) | | |
processedSeedsRecords | protected transient Map<String, SeedRecord> processedSeedsRecords(Code) | | Record of seeds' latest actions.
|
queuedUriCount | protected long queuedUriCount(Code) | | |
totalKBPerSec | protected long totalKBPerSec(Code) | | |
totalProcessedBytes | protected long totalProcessedBytes(Code) | | |
StatisticsTracker | public StatisticsTracker(String name)(Code) | | |
activeThreadCount | public int activeThreadCount()(Code) | | Current thread count (or zero if can't figure it out). |
averageDepth | public long averageDepth()(Code) | | Average depth of the last URI in all eligible queues.
That is, the average length of all eligible queues.
long average depth of last URIs in queues |
congestionRatio | public float congestionRatio()(Code) | | Ratio of number of threads that would theoretically allow
maximum crawl progress (if each was as productive as current
threads), to current number of threads.
float congestion ratio |
crawledBytesSummary | public String crawledBytesSummary()(Code) | | |
crawledURIDisregard | public void crawledURIDisregard(CrawlURI curi)(Code) | | |
crawledURINeedRetry | public void crawledURINeedRetry(CrawlURI curi)(Code) | | |
crawledURISuccessful | public void crawledURISuccessful(CrawlURI curi)(Code) | | |
currentProcessedDocsPerSec | public double currentProcessedDocsPerSec()(Code) | | |
currentProcessedKBPerSec | public int currentProcessedKBPerSec()(Code) | | |
deepestUri | public long deepestUri()(Code) | | Ordinal position of the 'deepest' URI eligible
for crawling. Essentially, the length of the longest
frontier internal queue.
long URI count to deepest URI |
disregardedFetchAttempts | public long disregardedFetchAttempts()(Code) | | Get the total number of failed fetch attempts (connection failures -> give up, etc)
The total number of failed fetch attempts |
dumpReports | public void dumpReports()(Code) | | Run the reports.
|
failedFetchAttempts | public long failedFetchAttempts()(Code) | | Get the total number of failed fetch attempts (connection failures -> give up, etc)
The total number of failed fetch attempts |
finalCleanup | protected void finalCleanup()(Code) | | |
getBytesPerFileType | public long getBytesPerFileType(String filetype)(Code) | | Returns the accumulated number of bytes from files of a given file type.
Parameters: filetype - Filetype to check. the accumulated number of bytes from files of a given mime type |
getBytesPerHost | public long getBytesPerHost(String host)(Code) | | Returns the accumulated number of bytes downloaded from a given host.
Parameters: host - name of the host the accumulated number of bytes downloaded from a given host |
getFileDistribution | public Hashtable<String, LongWrapper> getFileDistribution()(Code) | | Returns a HashMap that contains information about distributions of
encountered mime types. Key/value pairs represent
mime type -> count.
Note: All the values are wrapped with a
LongWrapper LongWrapper mimeTypeDistribution |
getHostLastFinished | public long getHostLastFinished(String host)(Code) | | Returns the time (in millisec) when a URI belonging to a given host was
last finished processing.
Parameters: host - The host to look up time of last completed URI. Returns the time (in millisec) when a URI belonging to a given host was last finished processing. If no URI has been completed for host-1 will be returned. |
getProgressStatisticsLine | public String getProgressStatisticsLine(Date now)(Code) | | Return one line of current progress-statistics
Parameters: now - String of stats |
getProgressStatisticsLine | public String getProgressStatisticsLine()(Code) | | Return one line of current progress-statistics
String of stats |
getReverseSortedCopy | public TreeMap<String, LongWrapper> getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues)(Code) | | Sort the entries of the given HashMap in descending order by their
values, which must be longs wrapped with LongWrapper .
Elements are sorted by value from largest to smallest. Equal values are
sorted in an arbitrary, but consistent manner by their keys. Only items
with identical value and key are considered equal.
If the passed-in map requires access to be synchronized, the caller
should ensure this synchronization.
Parameters: mapOfLongWrapperValues - Assumes values are wrapped with LongWrapper. a sorted set containing the same elements as the map. |
getReverseSortedHostCounts | public SortedMap getReverseSortedHostCounts(Map<String, LongWrapper> hostCounts)(Code) | | Return a copy of the hosts distribution in reverse-sorted (largest first)
order.
SortedMap of hosts distribution |
getReverseSortedHostsDistribution | public SortedMap getReverseSortedHostsDistribution()(Code) | | Return a copy of the hosts distribution in reverse-sorted
(largest first) order.
SortedMap of hosts distribution |
getSeedRecordsSortedByStatusCode | public Iterator getSeedRecordsSortedByStatusCode()(Code) | | |
getSeeds | public Iterator<String> getSeeds()(Code) | | Get a seed iterator for the job being monitored.
Note: This iterator will iterate over a list of strings not
UURIs like the Scope seed iterator. The strings are equal to the URIs'
getURIString() values.
the seed iteratorFIXME: Consider using TransformingIterator here |
getStatusCodeDistribution | public Hashtable<String, LongWrapper> getStatusCodeDistribution()(Code) | | Return a HashMap representing the distribution of status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count.
Note: All the values are wrapped with a
LongWrapper LongWrapper statusCodeDistribution |
incrementMapCount | protected static void incrementMapCount(Map<String, LongWrapper> map, String key)(Code) | | Increment a counter for a key in a given HashMap. Used for various
aggregate data.
As this is used to change Maps which depend on StatisticsTracker
for their synchronization, this method should only be invoked
from a a block synchronized on 'this'.
Parameters: map - The HashMap Parameters: key - The key for the counter to be incremented, if it does notexist it will be added (set to 1). If null it willincrement the counter "unknown". |
incrementMapCount | protected static void incrementMapCount(Map<String, LongWrapper> map, String key, long increment)(Code) | | Increment a counter for a key in a given HashMap by an arbitrary amount.
Used for various aggregate data. The increment amount can be negative.
As this is used to change Maps which depend on StatisticsTracker
for their synchronization, this method should only be invoked
from a a block synchronized on 'this'.
Parameters: map - The HashMap Parameters: key - The key for the counter to be incremented, if it does not existit will be added (set to equal to increment ).If null it will increment the counter "unknown". Parameters: increment - The amount to increment counter related to the key . |
percentOfDiscoveredUrisCompleted | public int percentOfDiscoveredUrisCompleted()(Code) | | This returns the number of completed URIs as a percentage of the total
number of URIs encountered (should be inverse to the discovery curve)
The number of completed URIs as a percentage of the totalnumber of URIs encountered |
processedDocsPerSec | public double processedDocsPerSec()(Code) | | |
processedKBPerSec | public long processedKBPerSec()(Code) | | |
progressStatisticsEvent | protected synchronized void progressStatisticsEvent(EventObject e)(Code) | | |
queuedUriCount | public long queuedUriCount()(Code) | | Number of URIs queued up and waiting for processing.
If crawl not running (paused or stopped) this will return the value
of the last snapshot.
Number of URIs queued up and waiting for processing. See Also: org.archive.crawler.framework.Frontier.queuedUriCount |
saveHostStats | protected void saveHostStats(String hostname, long size)(Code) | | |
successfullyFetchedCount | public long successfullyFetchedCount()(Code) | | |
threadCount | public int threadCount()(Code) | | Get the total number of ToeThreads (sleeping and active)
The total number of ToeThreads |
totalBytesCrawled | public long totalBytesCrawled()(Code) | | |
totalBytesWritten | public long totalBytesWritten()(Code) | | |
totalCount | public long totalCount()(Code) | | |
writeFrontierReportTo | protected void writeFrontierReportTo(PrintWriter writer)(Code) | | Write the Frontier's 'nonempty' report (if available)
Parameters: writer - to report to |
writeManifestReportTo | protected void writeManifestReportTo(PrintWriter writer)(Code) | | Parameters: writer - Where to write. |
writeMimetypesReportTo | protected void writeMimetypesReportTo(PrintWriter writer)(Code) | | |
writeProcessorsReportTo | protected void writeProcessorsReportTo(PrintWriter writer)(Code) | | |
writeResponseCodeReportTo | protected void writeResponseCodeReportTo(PrintWriter writer)(Code) | | |
writeSeedsReportTo | protected void writeSeedsReportTo(PrintWriter writer)(Code) | | Parameters: writer - Where to write. |
|
|