| java.lang.Object org.archive.crawler.admin.StatisticsSummary
StatisticsSummary | public class StatisticsSummary (Code) | | This class provides descriptive statistics of a finished crawl job by
using the crawl report files generated by StatisticsTracker. Any formatting
changes to the way StatisticsTracker writes to the summary crawl reports will
require changes to this class.
The following statistics are accessible from this class:
- Successfully downloaded documents per fetch status code
- Successfully downloaded documents per document mime type
- Amount of data per mime type
- Successfully downloaded documents per host
- Amount of data per host
- Successfully downloaded documents per top-level domain name (TLD)
- Disposition of all seeds
- Successfully downloaded documents per host per source
TODO: Make it so summarizing is not done all in RAM so we avoid
OOME.
author: Frank McCown See Also: org.archive.crawler.admin.StatisticsTracker |
Field Summary | |
protected String | bandwidthKbytesPerSec | protected Hashtable<String, LongWrapper> | dnsStatusCodeDistribution | protected String | durationTime | protected Hashtable<String, LongWrapper> | hostsBytes | protected Hashtable<String, LongWrapper> | hostsDistribution | protected Hashtable<String, LongWrapper> | hostsDnsBytes | protected Hashtable<String, LongWrapper> | hostsDnsDistribution | protected Hashtable<String, LongWrapper> | mimeTypeBytes | protected Hashtable<String, LongWrapper> | mimeTypeDistribution | protected Hashtable<String, LongWrapper> | mimeTypeDnsBytes | protected Hashtable<String, LongWrapper> | mimeTypeDnsDistribution | protected String | processedDocsPerSec | protected transient Map<String, SeedRecord> | processedSeedsRecords | protected Hashtable<String, LongWrapper> | statusCodeDistribution | protected Hashtable<String, LongWrapper> | tldBytes | protected Hashtable<String, LongWrapper> | tldDistribution | protected Hashtable<String, LongWrapper> | tldHostDistribution | protected String | totalDataWritten | protected long | totalDnsHostDocuments | protected long | totalDnsHostSize | protected long | totalDnsMimeSize | protected long | totalDnsMimeTypeDocuments | protected long | totalDnsStatusCodeDocuments | protected long | totalFileTypeDocuments | protected long | totalHostDocuments | protected long | totalHostSize | protected long | totalHosts | protected long | totalMimeSize | protected long | totalMimeTypeDocuments | protected long | totalStatusCodeDocuments | protected long | totalTldDocuments | protected long | totalTldSize |
Method Summary | |
public String | getBandwidthKbytesPerSec() | public long | getBytesPerHost(String host) Returns the accumulated number of bytes downloaded from a given host. | public long | getBytesPerMimeType(String filetype) Returns the accumulated number of bytes from files of a given file type.
Parameters: filetype - Filetype to check. | public long | getBytesPerTld(String tld) Returns the total number of bytes downloaded for a given TLD. | public Hashtable | getDnsMimeDistribution() | public Hashtable | getDnsStatusCodeDistribution() Return a HashMap representing the distribution of DNS status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count. | public String | getDurationTime() | public Hashtable | getHostsDnsDistribution() | public long | getHostsPerTld(String tld) Get the number of hosts with a particular TLD. | public Hashtable | getMimeDistribution() Returns a HashMap that contains information about distributions of
encountered mime types. | public String | getProcessedDocsPerSec() | public TreeMap<String, LongWrapper> | getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues) Sort the entries of the given HashMap in descending order by their
values, which must be longs wrapped with LongWrapper .
Elements are sorted by value from largest to smallest. | public SortedMap | getReverseSortedHostsDistribution() Return a copy of the hosts distribution in reverse-sorted
(largest first) order. | public Iterator<SeedRecord> | getSeedRecordsSortedByStatusCode() Returns sorted Iterator of seeds records based on status code. | public Hashtable | getStatusCodeDistribution() Return a HashMap representing the distribution of HTTP status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count. | public Hashtable | getTldBytes() | public Hashtable | getTldDistribution() | public Hashtable | getTldHostDistribution() | public String | getTotalDataWritten() | public long | getTotalDnsHostDocuments() | public long | getTotalDnsHostSize() | public long | getTotalDnsMimeSize() | public long | getTotalDnsMimeTypeDocuments() | public long | getTotalDnsStatusCodeDocuments() | public long | getTotalHostDnsDocuments() | public long | getTotalHostDocuments() | public long | getTotalHostSize() | public long | getTotalHosts() | public long | getTotalMimeSize() | public long | getTotalMimeTypeDocuments() | public long | getTotalStatusCodeDocuments() | public long | getTotalTldDocuments() | public long | getTotalTldSize() | protected static void | incrementMapCount(Map<String, LongWrapper> map, String key) Increment a counter for a key in a given HashMap. | protected static void | incrementMapCount(Map<String, LongWrapper> map, String key, long increment) Increment a counter for a key in a given HashMap by an arbitrary amount.
Used for various aggregate data. | public boolean | isStats() | public boolean | readCrawlReport() Reads duration time, processed docs/sec, bandwidth, and total size
of crawl from crawl-report.txt. |
bandwidthKbytesPerSec | protected String bandwidthKbytesPerSec(Code) | | |
processedDocsPerSec | protected String processedDocsPerSec(Code) | | |
totalDnsHostDocuments | protected long totalDnsHostDocuments(Code) | | |
totalDnsHostSize | protected long totalDnsHostSize(Code) | | |
totalDnsMimeSize | protected long totalDnsMimeSize(Code) | | |
totalDnsMimeTypeDocuments | protected long totalDnsMimeTypeDocuments(Code) | | |
totalDnsStatusCodeDocuments | protected long totalDnsStatusCodeDocuments(Code) | | |
totalFileTypeDocuments | protected long totalFileTypeDocuments(Code) | | |
totalHostDocuments | protected long totalHostDocuments(Code) | | |
totalHostSize | protected long totalHostSize(Code) | | |
totalHosts | protected long totalHosts(Code) | | |
totalMimeSize | protected long totalMimeSize(Code) | | |
totalMimeTypeDocuments | protected long totalMimeTypeDocuments(Code) | | |
totalStatusCodeDocuments | protected long totalStatusCodeDocuments(Code) | | |
totalTldDocuments | protected long totalTldDocuments(Code) | | |
totalTldSize | protected long totalTldSize(Code) | | |
StatisticsSummary | public StatisticsSummary(CrawlJob cjob)(Code) | | Constructor
Parameters: cjob - Completed crawl job |
getBandwidthKbytesPerSec | public String getBandwidthKbytesPerSec()(Code) | | |
getBytesPerHost | public long getBytesPerHost(String host)(Code) | | Returns the accumulated number of bytes downloaded from a given host.
Parameters: host - name of the host the accumulated number of bytes downloaded from a given host |
getBytesPerMimeType | public long getBytesPerMimeType(String filetype)(Code) | | Returns the accumulated number of bytes from files of a given file type.
Parameters: filetype - Filetype to check. the accumulated number of bytes from files of a given mime type |
getBytesPerTld | public long getBytesPerTld(String tld)(Code) | | Returns the total number of bytes downloaded for a given TLD.
Parameters: tld - TLD the total number of bytes downloaded for a given TLD |
getDnsStatusCodeDistribution | public Hashtable getDnsStatusCodeDistribution()(Code) | | Return a HashMap representing the distribution of DNS status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count.
Note: All the values are wrapped with a
LongWrapper LongWrapper dnsStatusCodeDistribution |
getHostsPerTld | public long getHostsPerTld(String tld)(Code) | | Get the number of hosts with a particular TLD.
Parameters: tld - top-level domain name Total crawled hosts |
getMimeDistribution | public Hashtable getMimeDistribution()(Code) | | Returns a HashMap that contains information about distributions of
encountered mime types. Key/value pairs represent
mime type -> count.
Note: All the values are wrapped with a
LongWrapper LongWrapper mimeTypeDistribution |
getProcessedDocsPerSec | public String getProcessedDocsPerSec()(Code) | | |
getReverseSortedCopy | public TreeMap<String, LongWrapper> getReverseSortedCopy(Map<String, LongWrapper> mapOfLongWrapperValues)(Code) | | Sort the entries of the given HashMap in descending order by their
values, which must be longs wrapped with LongWrapper .
Elements are sorted by value from largest to smallest. Equal values are
sorted in an arbitrary, but consistent manner by their keys. Only items
with identical value and key are considered equal.
If the passed-in map requires access to be synchronized, the caller
should ensure this synchronization.
Parameters: mapOfLongWrapperValues - Assumes values are wrapped with LongWrapper. a sorted set containing the same elements as the map. |
getReverseSortedHostsDistribution | public SortedMap getReverseSortedHostsDistribution()(Code) | | Return a copy of the hosts distribution in reverse-sorted
(largest first) order.
SortedMap of hosts distribution |
getSeedRecordsSortedByStatusCode | public Iterator<SeedRecord> getSeedRecordsSortedByStatusCode()(Code) | | Returns sorted Iterator of seeds records based on status code.
sorted Iterator of seeds records |
getStatusCodeDistribution | public Hashtable getStatusCodeDistribution()(Code) | | Return a HashMap representing the distribution of HTTP status codes for
successfully fetched curis, as represented by a hashmap where key ->
val represents (string)code -> (integer)count.
Note: All the values are wrapped with a
LongWrapper LongWrapper statusCodeDistribution |
getTotalDataWritten | public String getTotalDataWritten()(Code) | | |
getTotalDnsHostDocuments | public long getTotalDnsHostDocuments()(Code) | | |
getTotalDnsHostSize | public long getTotalDnsHostSize()(Code) | | |
getTotalDnsMimeSize | public long getTotalDnsMimeSize()(Code) | | |
getTotalDnsMimeTypeDocuments | public long getTotalDnsMimeTypeDocuments()(Code) | | |
getTotalDnsStatusCodeDocuments | public long getTotalDnsStatusCodeDocuments()(Code) | | |
getTotalHostDnsDocuments | public long getTotalHostDnsDocuments()(Code) | | |
getTotalHostDocuments | public long getTotalHostDocuments()(Code) | | |
getTotalHostSize | public long getTotalHostSize()(Code) | | |
getTotalHosts | public long getTotalHosts()(Code) | | |
getTotalMimeSize | public long getTotalMimeSize()(Code) | | |
getTotalMimeTypeDocuments | public long getTotalMimeTypeDocuments()(Code) | | |
getTotalStatusCodeDocuments | public long getTotalStatusCodeDocuments()(Code) | | |
getTotalTldDocuments | public long getTotalTldDocuments()(Code) | | |
getTotalTldSize | public long getTotalTldSize()(Code) | | |
incrementMapCount | protected static void incrementMapCount(Map<String, LongWrapper> map, String key)(Code) | | Increment a counter for a key in a given HashMap. Used for various
aggregate data.
Parameters: map - The HashMap Parameters: key - The key for the counter to be incremented, if it does notexist it will be added (set to 1). If null it willincrement the counter "unknown". |
incrementMapCount | protected static void incrementMapCount(Map<String, LongWrapper> map, String key, long increment)(Code) | | Increment a counter for a key in a given HashMap by an arbitrary amount.
Used for various aggregate data. The increment amount can be negative.
Parameters: map - The HashMap Parameters: key - The key for the counter to be incremented, if it does notexist it will be added (set to equal toincrement ).If null it will increment the counter "unknown". Parameters: increment - The amount to increment counter related to thekey . |
isStats | public boolean isStats()(Code) | | True if we compiled stats, false if none to compile (e.g.there are no reports files on disk). |
readCrawlReport | public boolean readCrawlReport()(Code) | | Reads duration time, processed docs/sec, bandwidth, and total size
of crawl from crawl-report.txt.
true if stats found. |
|
|