| |
Heritrix Crawlers | License: | GNU Library or Lesser General Public License (LGPL) | URL: | http://crawler.archive.org/ | Description: | Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. |
Package Name | Comment | com.sleepycat.collections | | org.apache.commons.httpclient | | org.apache.commons.httpclient.cookie | | org.apache.commons.pool.impl | | org.archive.crawler |
Introduction to Heritrix.
Heritrix is designed to be easily extensible via 3rd party modules.
Architecture
The software is divided into several packages of varying importance.
The relationship between them will be covered in some greater depth
after their introductions.
The root package (this) contains the executable class
{@link org.archive.crawler.Heritrix Heritrix}.
That class will load the crawler, parsing command line arguments.
If a WUI is to be launched it will launch it. It can also start
jobs (with or without the WUI) that are specified in command line
options.
framework
{@link org.archive.crawler.framework org.archive.crawler.framework}
The framework package contains most of the core classes
for running a crawl. It also contains a number of Interfaces for
extensible items, the implementatations of whom can be found in
other classes.
Heritrix is in effect divided into two types of classes.
- Core classes - these can often be configured but not
replaced.
- Pluggable classes - these must implment a given interface
or extend a specific class but 3rd parties can introduce their own
implementations.
The framework thus contains a selection of the core classes and a number
of the Interfaces and base classes for the pluggable classes.
datamodel
{@link org.archive.crawler.datamodel org.archive.crawler.datamodel}
Contains various classes that make up the crawlers data structure. Including
such essentials as the CandidateURI and CrawlURI classes that wrap the
discovered URIs for processing.
admin
{@link org.archive.crawler.admin org.archive.crawler.admin}
The admin package contains classes that are used by the Web UI.
This includes some core classes and a specific implementation of the
Statistics Tracking interface found in the framework
package that is designed to provide the UI with information about
ongoing crawls.
Pluggable modules
The following is a listing of the types of pluggable modules found in
Heritrix with brief explanations of each and linking to their respective
API documentation.
Frontier
A Frontier maintains the internal state of a crawl while it is
in progress. What URIs have been discovered, which should be crawled next,
etc.
Needless to say this is one of the most important modules in any crawl and
the provided implementation should generally be appropriate unless a very
different strategy for ordering URIs for crawling is desired.
{@link org.archive.crawler.framework.Frontier Frontier} is the interface
that all Frontiers must implement.
{@link org.archive.crawler.frontier org.archive.crawler.frontier} package
contains the provided implementation of a Frontier along with it's
supporting classes.
Processor
When a URI is crawled, a {@link org.archive.crawler.framework.ToeThread
ToeThread} will execute a series of processors on it.
The processors are split into 5 distinct chains that are exectued in sequence:
- Pre-fetch processing chain
- Fetch processing chain
- Extractor processing chain
- Write/Index processing chain
- Post-processing chain
Each of these chains contain any number of processors. The processors
all inherit from a generic {@link org.archive.crawler.framework.Processor
Processor}. While the processors are divided into the five categories above that
is strictly a high level configuration and any processor can be in any chain
(although doing link extraction before fetching a document is clearly of no
use).
Numerous processors are provided with Heritrix in the following packages:
{@link org.archive.crawler.prefetch org.archive.crawler.prefetch} package
contains processors run before the URI is fetched from the Internet.
{@link org.archive.crawler.fetcher org.archive.crawler.fetcher} package
contains processors that fetch URI from the Internet. Typically each
processor handles a different protocol.
{@link org.archive.crawler.extractor org.archive.crawler.extractor} package
contains processors that perform link extractions on various document types.
{@link org.archive.crawler.writer org.archive.crawler.writer} package contains
a processor that writes an ARC file with the fetched document.
{@link org.archive.crawler.postprocessor org.archive.crawler.postprocessor}
package contain processors that do wrapup on the processing, reporting links
back to the Frontier etc.
Filter
Scope
Scopes are special filters that are applied to the crawl as a whole to
define it's scope. Any given crawl will employ exactly one scope
object to define what URIs are considered 'within scope'.
Several implementations covering the most commonly
desired scopes are provided (broad, domain, host etc.). However custom
implementations can be made of these to define any arbitrary scope.
It should be noted though that usually any type of limitations to the scope
of a crawl can be more easily achived using one of the existing scopes and
modifing it with appropriate filters.
{@link org.archive.crawler.framework.CrawlScope CrawlScope} - Base class for
scopes.
{@link org.archive.crawler.scope org.archive.crawler.scope} package. Contains
provided scopes.
Statistics Tracking
Any number of statistics tracking modules can be added to a crawl to gather
run time information about it's progress.
These modules can both interrogate the Frontier for what sparse
date it exposes but they can also subscribe to
{@link org.archive.crawler.event.CrawlURIDispositionListener Crawled URI
Disposition} events to monitor the completion of each URI that is processed.
An interface for {@link org.archive.crawler.framework.StatisticsTracking
statistics tracking} is provided as well as a partial implementation
({@link org.archive.crawler.framework.AbstractTracker AbstractTracker})
that does much of the work common to most statistics tracking modules.
Furthermore the admin package implements a statistics tracking
module ({@link org.archive.crawler.admin.StatisticsTracker StatisticsTracker})
that generates a log of the crawlers progress as well as providing information
that the UI uses. It also compiles end-of-crawl reports that contain all of the
information it has gathered in the course of the crawl.
It is highly recommended that it always be used when running crawls via the UI.
| org.archive.crawler.admin |
org.archive.crawler.admin package
Contains classes that the web UI uses to monitor and control crawls. Some
utilities classes used exclusively or primarily for the UI are also
included.
Most of the heavy duty work is done by the CrawlJobHandler
that manages most of the interaction between the UI and the the
CrawlController . The CrawlJob class serves to
encapsulate the settings needed to launch one crawl.
This package also provides an implementation of the Statistics Tracking
interface that contains useful methods to access progress data. This is
used for monitoring crawls. While it is technically possible to launch
jobs without this statistics tracker, it would render the UI inoperable
as far as monitoring the progress of that crawl.
| org.archive.crawler.admin.ui | | org.archive.crawler.datamodel | | org.archive.crawler.datamodel.credential |
org.archive.io.arc package
Contains html form login and basic and digest credentials
used by Heritrix logging into sites.
To watch credentials running, enable logging setting the following
logging level for FetchHttp class:
org.archive.crawler.fetcher.FetchHTTP.level = FINE
| org.archive.crawler.deciderules |
Provides classes for a simple decision rules framework.
Each 'step' in a decision rule set which can
affect an objects ultimate fate is called a DecideRule .
Each DecideRule renders a decision (possibly neutral) on the
passed objects fate.
Possible decisions are:
- ACCEPT means the object is ruled-in for further processing
- REJECT means the object is ruled-out for further processing
- PASS means this particular DecideRule has no opinion
As previously outlined, each DecideRule is applied in turn;
the last one to express a non-PASS preference wins.
For example, if the rules are:
- AcceptDecideRule -- ACCEPTs all (establishing a default)
- TooManyHopsDecideRule(max-hops=3) -- REJECTS all with
hopsPath.length()>3, PASSes otherwise
- PrerequisiteAcceptDecideRule -- ACCEPTs any with 'P' as
last hop, PASSes otherwise (this allows 'LLL's which
need a 'LLLP' prerequisite a chance to complete)
Then, you have a crawl that will go 3 hops (of any type)
from the seeds, with a special affordance to get prerequisites
of 3-hop items (which may be 4 "hops" out)
To allow this style of decision processing to be plugged into the
existing Filter and Scope slots:
- There's a DecidingFilter which takes an (ordered) map of
DecideRules
- There's a DecidingScope which takes the same
See NewScopingModel
for background.
| org.archive.crawler.deciderules.recrawl | | org.archive.crawler.event | | org.archive.crawler.extractor | | org.archive.crawler.fetcher | | org.archive.crawler.filter | | org.archive.crawler.framework | | org.archive.crawler.framework.exceptions | | org.archive.crawler.frontier | | org.archive.crawler.io | | org.archive.crawler.postprocessor | | org.archive.crawler.prefetch | | org.archive.crawler.processor | | org.archive.crawler.processor.recrawl | | org.archive.crawler.scope | | org.archive.crawler.selftest |
org.archive.crawler.selftest package
Provides the client-side aspect of the heritrix integration self test.
The selftest webapp is the repository for the serverside of the
intergration test. The integration self test is run from the command
line. Invocation makes the crawler go up against itself trawling the
selftest webapp. When done, the product -- arc and log files -- are
analyzed by code herein to verify test pass or fail. The integration
self test is the aggregation of multiple individual tests each testing a
particular crawler aspect. For example, the Robots test validates
the crawler's parse of robots.txt. Each test comprises a directory
under the selftest webapp named for the test into which we put the
server pages that express the scenario to test, and a class from this
package named for test webapp directory w/ a SelfTest suffix.
The selftest class verifies test success. Each selftest class subclasses
org.archive.crawler.selftest.SelfTestCase which is itself
a subclass of org.junit.TestCase ). All tests need to be
registered with the {@link org.archive.crawler.selftest.AllSelfTestCases}
class and must live in the org.archive.crawler.selftest package. The class
{@link org.archive.crawler.selftest.SelfTestCrawlJobHandler}
manages the running of selftest.
Run one test only by passing its name as the option value to the
selftest argument.
The first crop of self tests are
derived from tests developed by Parker Thompson < pt at archive dot org
>. See Tests.
These tests in turn look to have been derived from Testing Search Indexing
Systems1. Adding a Self Test TODO Related
Documentation TODO
| org.archive.crawler.settings |
Provides classes for the settings framework.
The settings framework is designed to be a flexible way to configure a crawl
with special treatment for subparts of the web without adding to much
performance overhead.
At it's core the settings framework is a way to keep persistent, context
sensitive configuration settings for any class in the crawler.
All classes in the crawler that has configurable settings subclasses
{@link org.archive.crawler.settings.ComplexType} or one of its descendants. The {@link org.archive.crawler.settings.ComplexType} implements the
{@link javax.management.DynamicMBean} interface. This gives you a way to ask the object
for what attributes it supports and standard methods for getting and setting
these attributes.
The entry point into the settings framework is the {@link org.archive.crawler.settings.SettingsHandler}. This class
is responsible for loading and saving from persistent storage and for
interconnecting the different parts of the framework.
Figure 1. Schematic view of the Settings Framework
Settings hierarchy
The settings framework supports a hierarchy of settings. This hierarchy is
built by {@link org.archive.crawler.settings.CrawlerSettings} objects. On the top there is a settings object
representing the global settings. This consist of all the settings that a crawl
job needs for running. Beneath this global object there is one "per" settings
object for each host/domain which has settings that should override the order
for that particular host or domain.
When the settings framework is asked for an attribute for a specific host, it
will first try to see if this attribute is set for this particular host. If it
is, the value will be returned. If not, it will go up one level recursively
until it eventually reach the order object and returns the global value. If no
value is set here either (normally it would be), a hard coded default value is
returned.
All per domain/host settings objects only contain those settings which are to
be overridden for that particular domain/host. The convention is to name the
top level object "global settings" and the objects beneath "per settings" or
"overrides" (although the refinements described next, also do overriding).
To further complicate the picture, there is also settings objects called
refinements. An object of this type belongs to a global or per settings object
and overrides the settings in it's owners object if some criteria is met. These
criteria could be that the URI in question conforms to a regular expression or
that it the settings are consulted at a specific time of day limited by a time
span.
ComplexType hierarchy
All the configurable modules in the crawler subclasses {@link org.archive.crawler.settings.ComplexType} or one of
its descendants. The {@link org.archive.crawler.settings.ComplexType} is responsible for keeping the definition of
the configurable attributes of the module. The actual values are stored in an
instance of {@link org.archive.crawler.settings.DataContainer}. The {@link org.archive.crawler.settings.DataContainer} is never accessed directly from
user code. Instead the user accesses the attributes through methods in the
{@link org.archive.crawler.settings.ComplexType}. The attributes are accessed in different ways depending if it is
from the user interface or from inside a running crawl.
When an attribute is accessed from the URI (either reading or writing) you want
to make sure that you are editing the attribute in the right context. When
trying to override an attribute, you don't want the settings framework to
traverse up to effective value for the attribute, but instead want to know that
the attribute is not set on this level. To achieve this, there is
{@link org.archive.crawler.settings.ComplexType#getLocalAttribute(CrawlerSettings settings, String name)} and
{@link org.archive.crawler.settings.ComplexType#setAttribute(CrawlerSettings settings, Attribute attribute)} methods taking a
settings object as a parameter. These methods works only on the supplied
settings object. In addition the methods {@link org.archive.crawler.settings.ComplexType#getAttribute(String)} and
{@link org.archive.crawler.settings.ComplexType#setAttribute(Attribute attribute)} is there for conformance to the Java JMX
specification. The latter two always works on the global settings object.
Getting an attribute within a crawl is different in that you always want to get
a value even if it is not set in it's context. That means that the settings
framework should work its way up the settings hierarchy to find the value in
effect for the context. The method {@link org.archive.crawler.settings.ComplexType#getAttribute(String name, CrawlURI uri)}
should be used to make sure that the right context is used. Figure 2 shows
how the settings framework finds the effective value given a context.
Figure 2. Flow of getting an attribute
The different attributes has a type. The allowed type all subclasses the {@link org.archive.crawler.settings.Type}
class. There are tree main Types:
- {@link org.archive.crawler.settings.SimpleType}
- {@link org.archive.crawler.settings.ListType}
- {@link org.archive.crawler.settings.ComplexType}
Except for the {@link org.archive.crawler.settings.SimpleType}, the actual type used will be a subclass of one of
these main types.
SimpleType
The {@link org.archive.crawler.settings.SimpleType} is mainly for representing Java™ wrappers for the Java™
primitive types. In addition it also handles the {@link java.util.Date} type and a
special Heritrix {@link org.archive.crawler.settings.TextField} type. Overrides of a {@link org.archive.crawler.settings.SimpleType} must be of the same
type as the initial default value for the {@link org.archive.crawler.settings.SimpleType}.
ListType
The {@link org.archive.crawler.settings.ListType} is further subclassed into versions for some of the wrapped Java™
primitive types ({@link org.archive.crawler.settings.DoubleList}, {@link org.archive.crawler.settings.FloatList}, {@link org.archive.crawler.settings.IntegerList}, {@link org.archive.crawler.settings.LongList}, {@link org.archive.crawler.settings.StringList}). A
List holds values in the same order as they were added. If an attribute of type
{@link org.archive.crawler.settings.ListType} is overridden, then the complete list of values is replaced at the
override level.
ComplexType
The {@link org.archive.crawler.settings.ComplexType} is a map of name/value pairs. The values can be any {@link org.archive.crawler.settings.Type}
including new {@link org.archive.crawler.settings.ComplexType MapTypes}. The {@link org.archive.crawler.settings.ComplexType} is defined abstract and you should
use one of the subclasses {@link org.archive.crawler.settings.MapType} or {@link org.archive.crawler.settings.ModuleType}. The {@link org.archive.crawler.settings.MapType} allows adding of
new name/value pairs at runtime, while the {@link org.archive.crawler.settings.ModuleType} only allows the
name/value pairs that it defines at construction time. When overriding the
{@link org.archive.crawler.settings.MapType} the options are either override the value of an already existing
attribute or add a new one. It is not possible in an override to remove an
existing attribute. The {@link org.archive.crawler.settings.ModuleType} doesn't allow additions in overrides, but
the predefined attributes' values might be overridden. Since the {@link org.archive.crawler.settings.ModuleType} is
defined at construction time, it is possible to set more restrictions on each
attribute than in the {@link org.archive.crawler.settings.MapType}. Another consequence of definition at
construction time is that you would normally subclass the {@link org.archive.crawler.settings.ModuleType}, while the
{@link org.archive.crawler.settings.MapType} is usable as it is. It is possible to restrict the {@link org.archive.crawler.settings.MapType} to only
allow attributes of a certain type. There is also a restriction that {@link org.archive.crawler.settings.MapType MapTypes}
can not contain nested {@link org.archive.crawler.settings.MapType MapTypes}.
| org.archive.crawler.settings.refinements | | org.archive.crawler.url | | org.archive.crawler.url.canonicalize | | org.archive.crawler.util | | org.archive.crawler.writer | | org.archive.extractor | | org.archive.httpclient |
org.archive.httpclient package
Provides specializations on
apache jakarta
commons httpclient.
HttpRecorderGetMethod
Class that the passed HttpRecorder w/ boundary between
HTTP header and content. Also forces a close on the response on
call to releaseConnection.
ConfigurableTrustManagerProtocolSocketFactory
A protocol socket factory that allows setting of trust level on
construction.
References
JavaTM Secure Socket Extension (JSSE): Reference Guide
| org.archive.io | | org.archive.io.arc |
org.archive.io.arc package
ARC file reading and writing.
| org.archive.io.warc |
org.archive.io.warc package
Experimental WARC Writer and Readers. Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers. This package
contains prototyping code for revision 0.12 of the WARC specification.
See latest revision
for current state (Version 0.10 code and its documentation has been moved into the
v10 subpackage).
Implementation Notes
Tools
Initial implementations of Arc2Warc and Warc2Arc
tools can be found in the package above this one, at
{@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc}
respectively. Pass --help to learn how to use each tool.
TODO
- Is MIME-Version header needed? MIME Parsers seem fine without (python email
lib and java mail).
- Should we write out a Content-Transfer-Encoding
header (Currently we do not). Need section in spec. explicit about our
interpretation of MIME and deviations (e.g. content-transfer-encoding should
be assumed binary in case of WARCs, multipart is not disallowed but not
encouraged, etc.)
- Minor: Do WARC-Version: 0.12 like MIME-Version: 1.0 rather than
WARC/0.12 for lead in to an ARCRecord?
| org.archive.io.warc.v10 |
org.archive.io.warc package
Experimental WARC Writer and Readers. Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers.
This code, with noted exceptions, is a loose implementation of parts of the
(unreleased and unfinished)
WARC
File Format (Version 0.9). Deviations from 0.9, outlined below in the
section Deviations from Spec., are to be proposed as amendments to the
specification to make a new revision. Since the new spec. revision will likely
be named version 0.10, code in this package writes WARCs of version 0.10 -- not
0.9.
Implementation Notes
Tools
Initial implementations of Arc2Warc and Warc2Arc
tools can be found in the package above this one, at
{@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc}
respectively. Pass --help to learn how to use each tool.
Unique ID Generator
WARC requires a GUID for each record written. A configurable unique ID
{@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate
unique ID generators, was added with a default of
{@link org.archive.uid.UUIDGenerator}. The default implementation generates
UUIDs (using java5
java.util.UUID ) with an urn scheme using the uuid
namespace [See RFC4122].
{@link org.archive.util.anvl ANVL}
The ANVL RFC822-like format is used writing Named Fields in
WARCs and occasionally for metadata. An implementation was added at
{@link org.archive.util.anvl}.
Miscellaneous
Writing WARCs, the response record type is chosen as the core
record that all others associate to: i.e. all others
have a Related-Record-ID that points back to the
response .
The below deviations from spec. 0.9 have been realized in code and are to
be proposed as spec. amendments with new
revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack
to the below at La Honda Meeting, August 8th, 2006).
mimetype in header line
Allow full mimetypes in the header line as per RFC2045 rather than
current, shriveled mimetype that allows only type and subtype. This will mean
mimetypes are allowed parameters: e.g.
text/plain; charset=UTF-8 or
application/http; msgtype=request .
Allowing full mimetypes, we can support the following scenarios without
further amendment to specification and without parsers having to resort to
metadata records or to custom
Named Fields to figure how to interpret payload:
- Consider the case where an archiving organization would store all
related to a capture as one record with a mimetype of
multipart/mixed; boundary=RECORD-ID . An example record
might comprise the parts
Content-Type: application/http; msgtype=request ,
Content-Type: application/http; msgtype=response , and
Content-Type: text/xml+rdf (For metadata).
- Or, an archiving institution would store a capture with
multipart/alternatives ranging from
most basic (or 'desiccated' in Kunze-speak)
-- perhaps a text/plain rendition of a PDF capture -- through to
best , the actual PDF binary itself.
To support full mimetypes, we must allow for whitespace between parameters
and allow that parameter values themselves might include whitespace
('quoted-string'). The WARC Writer converts any embedded carriage-return and
newlines to single space.
Swap position of recordid and mimetype in the header line
Because of the above amendment where we allow full mimetypes on header line,
to ease the parse, since miemtype now may include whitespace, we move the
mimetype to last position on header line and recordid to second-from-last.
Use application/http instead of message/http
message type has line length maximum of 1000 characters absent a
Content-Type-Encoding header set to BINARY .
(See definition of message/http for talk of adherence to MIME
message line limits: See
19.1 Internet Media Type message/http and application/http in
RFC2616).
Suggested Spec. Amendments
Apart from the above listed deviations, the below
changes are also suggested for inclusion in 0.10 spec. revision
Below are mostly suggested edits. Changes are not substantative.
Allow multiple instances of a single Named Parameter
Allow that there may be multiple instances of same Named Parameter
in any one Named Parameter block.
E.g. Multiple Related-Record-ID s could prove of use.
Spec. mentions this in 8.1 HTTP and HTTPS section but better
belongs in the 5.2 Named Parameters preamble.
Related, add to Named Field section note on bidirectional
Related-Record-ID .
Miscellaneous
LaHonda in below is reference to meeting of John, Gordon and Stack at
LaHonda Cafe on 16th St., on August 8th, 2006.
- Leave off 9.2 GZIP extra fields. Big section on implementing an option
that has little to do with WARCing. AGREED at LaHonda.
- But, we need to mark gzipped files as being WARC: i.e. that the
GZIP is a member per resource. Its useful so readers know how to invoke
GZIP (That it has to be done once to get at any record or just need to
do per record). Suggest adding GZIP extra field in HEAD of
GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.
- IP-Address for dns resource is DNS Server. Add note to this effect in
8.2 DNS.
- Section 6. is truncated -- missing text. What was intended here? SEE
ISO DOC.
- In-line ANVL definition (From Kunze). Related, can labels have
CTLs such as CRLF (Shouldn't)? When says 'control-chars', does this include
UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they
are not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda.
Postpone to 0.11 revision.
- Fix examples. Use output of experimental ARC Writer.
- Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in Digital Preservation
using the WARC File Format.
Open Issues
Drop response record type
resource is sufficent. Let mimetype distingush if capture with
response headers or not (As per comment at end of 8.1 HTTP and HTTPS
where it allows that if no response headers, use resource record type and
page mimetype rather than response type plus a mimetype of message/http: The
difference in record types is not needed distingushing between the two
types of capture)
Are there other capture methods that would require a response record,
that don't have a mimetype that includes response headers and content?
SMTP has rich MIME set to describe responses. Its request is
pretty much unrecordable. NNTP and FTP similar. Because of rich MIME, no
need of a special response type here.
Related, do we need the request record?
Only makes sense for HTTP?
This proposal is contentious. Gordon drew scenario where response
would be needed distingushing local from remote capture if an archiving
institution purposefully archived without recording headers or
if the payload itself was an archived record. In opposition, was suggested that
should an institution choose to cature in this 'unusual' mode, crawl metadata
could be used consulted to disambiguate confusion on how capture was done (To
be further investigated. In general, definition of record types is still in
need of work).
subject-url
The ISO revision suggests that the positional parameter
subject-uri be renamed. Suggest record-url.
Other issues
- Should we allow freeform creation of custom Named Fields if
have a MIME-like 'X-' or somesuch prefix?
- Nothing on header-line encoding (Section 11 says UTF-8).
For completeness should be US-ASCII or UTF-8, no control-chars (especially
CR or LF), etc.
warcinfo
- What for a scheme? Using UUID as per G suggestion.
- Also, how to populate description of crawl into warcinfo?
'Documentation'
Named Field with list of URLs that can be assumed
to exist somewhere in the current WARC set (We'd have to make the crawler go
get them at start of a crawl).
- I don't want to repeat crawl description for every WARC. How to have this
warcinfo point at an original?
related-record-id seems
insufficent.
- If the crawler config. changes, can I just write a warcinfo with
differences? How to express? Or better as metadata about a warcinfo?
- In the past
we used to get the filename from this URL header field when we unsure of the
filename or it was unavailable (We're reading a Stream). Won't be able to do
that with UUID for URL. So, introducing new warcinfo Named Field (optional)
'Filename' that will be used when warcinfo is put at start of a file.
Allow warcinfo to have a named parameter 'Filename'?
revisit
- What to write? Use a description field or just expect this info
to be present in the warcinfo? Example has request header
(inside XML). Better to use associated
request record for this
kind of info?
Related-Record-ID (RRID) of original is likely
an onerous requirement. Envisioning an implementation where we'd write
revisit records, we'd write such a record where content was
judged same or where date since last fetch had not changed. If we're to
write the RRID, then we'd have to maintain table keyed by URL with value of
page hash or of last modified-date plus associated RRID (actual RRID
URL, not a hash).
- Should we allow a
Description Named Field .
E.g. I add an order file as a metadata record and associate with a
warcinfo record. Description field could say "This is Heritrix
Order file". Same for seeds. Alternative is custom XML packaging (Scheme
could describe fields such as 'order' file or ANVL packaging using ANVL
'comments'.
- Section 11, why was it we said we don't need a parameter or explicit
subtype for special gzip WARC format? I don't remember? Reader needs to
know when its reading a stream. A client would like to know so it wrote
stream to disk with right suffix? Recap. (Perhaps it was looking at
the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields
that denote it WARC, thats sufficent?).
- Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' --
but allow free form description?
Leave off 'superior method of indicating truncation' paragraph. This qualifier
could be added to all sections of doc -- that a subsequent revision of any
aspect of the doc. will be superior.
Rather than
End-Length , like MIME, last record could have
Segment-Number-Total , a count of all segments that make up
complete record.
From LaHonda, discussion of revisit type. Definition was
tighted some by saying revisit is used when you chose not to store the capture.
Was thought possible that it
NOT require pointer back to an original. Suggested it might have a
similarity judgment header -- similiarity-value -- with values
between 0 and 1. Might also have analysis-method and
description . Possible methods discussed included: URI same,
length same, hash of content same, judgement based off content of HTTP HEAD
request, etc. Possible payloads might be: Nothing, a diff, the hash obtained,
etc.
Unimplemented
- Record Segmentation (4.8
continuation record type
and the 5.2 Segment-* Named Parameters. Future TODO.
- 4.7
conversion type. Future TODO.
TODOs
- unit tests using
multipart/* (JavaMail) reading and
writing records? Try record-id as part boundary.
- Performance: Need to add Record-based buffering. GZIP'd streams
have some buffering because of the deflater but could probably do
w/ more.
| org.archive.net | | org.archive.net.md5 | | org.archive.net.rsync | | org.archive.net.s3 | | org.archive.queue | | org.archive.uid |
org.archive.uid package
A unique ID generator.
Default is {@link org.archive.uid.UUIDGenerator}.
To use another ID Generator, set the System Property
org.archive.uid.GeneratorFactory.generator to point
at an alternate implementation of {@link org.archive.uid.Generator}.
TODO
- MIME boundaries have upper-bound of 70 characters total including
'blank line' (
CRLFCRLF ) and two leading hyphens. Add to
{@link org.archive.uid.Generator}
interface an upper-bound on generated ID length.
- Add example of an actionable uid generator:
e.g.
http://archive.org/UID-SCHEME/ID
where scheme might be UUID and an ID might be
f9472055-fbb6-4810-90e8-68fd39e145a6;type=metadata or,
using ARK:
http://archive.org/ark:/13030/f9472055-fbb6-4810-90e8-68fd39e145a6;type=metadata .
| org.archive.util | | org.archive.util.anvl |
org.archive.util.anvl package
Parsers and Writers for the (expired) Internet-Draft A Name-Value
Language (ANVL). Use {@link org.archive.util.anvl.ANVLRecord}
to create new instances of ANVL Records and for parsing.
Implementation Details
The ANVL Internet-Draft of 14 February, 2005 is inspecific as to the
definition of 'blank line' and 'newline'. This parser implementation
assumes CRNL .
Says "An element consists of a label, a colon, and an optional value".
Should that be: "An element consists of a label and an optional value, or a
comment."
Specification is unclear regards CR or NL in label or
comment (This implementation disallows CR or NL in labels but lets
them pass in comments).
A grammar would help. Here is RFC822:
field = field-name ":" [ field-body ] CRLF
field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">
field-body = field-body-contents
[CRLF LWSP-char field-body]
field-body-contents =
<the ASCII characters making up the field-body, as
defined in the following sections, and consisting
of combinations of atom, quoted-string, and
specials tokens, or else consisting of texts>
| org.archive.util.bdbje | | org.archive.util.fingerprint | | org.archive.util.iterator | | org.archive.util.ms |
Memory-efficient reading of .doc files. To extract the text from a .doc
file, use {@link org.archive.util.ms.Doc#getText(SeekInputStream)}. That's
basically the whole API. The other classes are necessary to make that
method work, and you can probably ignore them.
Implementation/Format Details
These APIs differ from the POI API provided by Apache in that POI wants to
load complete documents into memory. Though POI does provide an "event-driven"
API that is memory efficient, that API cannot be used to scan text across block
or piece boundaries.
This package provides a stream-based API for extracting the text of
a .doc file. At this time, the package does not provide a way to extract
style attributes, embedded images, subdocuments, change tracking information,
and so on.
There are two layers of abstraction between the contents of a .doc
file and reality. The first layer is the Block File System, and
the second layer is the piece table.
The Block File System
All .doc files are secretly file systems, like a .iso file, but insane.
A good overview of how this file system is arranged inside the file is
available at
the Jarkarta POIFS system.
Subfiles and directories in a block file system are represented via the
{@link org.archive.util.ms.Entry} interface. The root directory can be
obtained via the {@link org.archive.util.ms.BlockFileSystem#getRoot()}
method. From there, the child entries can be discovered.
The file system divides its subfiles into 512-byte blocks. Those blocks
are not necessarily stored in a linear order; blocks from different subfiles
may be interspersed with each other. The
{@link org.archive.util.ms.Entry#open()} method returns an input stream that
provides a continuous view of a subfile's contents. It does so by moving
the file pointer of the .doc file behind the scenes.
It's important to keep in mind that any given read on a stream produced
by a BlockFileSystem may involve:
- Moving the file pointer to the start of the file to look up the main
block allocation table.
- Navigation the file pointer through various allocation structures located
throughout the file.
- Finally repositioning the file pointer at the start of the next block
to be read.
So, this package lowers memory consumption at the expense of greater IO
activity. A future version of this package will use internal caches to
minimize IO activity, providing tunable trade-offs between memory and IO.
The Piece Table
The second layer of abstraction between you and the contents of a .doc file
is the piece table. Some .doc files are produced using a "fast-save" feature
that only writes recent changes to the end of the file. In this case, the
text of the document may be fragmented within the document stream itself.
Note that this fragmentation is in addition to the block fragmentation
described above.
A .doc file contains several subfiles within its filesystem. The two
that are important for extracting text are named WordDocument
and 0Table . The WordDocument subfile contains the
text of the document. The 0Table subfile contains supporting
information, including the piece table.
The piece table is a simple map from logical character position to actual
subfile stream position. Additionally, each piece table entry describes whether
or not the piece stores text using 16-bit Unicode, or using 8-bit ANSI
codes. One .doc file can contain both Unicode and ANSI text. A consequence
of this is that every .doc file has a piece table, even those that
were not "fast-saved".
The reader returned by
{@link org.achive.util.ms.Doc#getText(SeekInputStream)} consults the piece
table to determine where in the WordDocument subfile the next piece of text
is located. It also uses the piece table to determine how bytes should be
converted to Unicode characters.
Note, however, that any read from such a reader may involve:
- Moving the file pointer to the piece table.
- Searching the piece table index for the next piece, which may
involve moving the file pointer many times.
- Moving the file pointer to that piece's description in the piece table.
- Moving the file pointer to the start of the piece indicated by the
description.
Since the "file pointer" in this context is the file pointer of the
subfile, each move described above may additionally involve:
- Moving the file pointer to the piece table.
- Searching the piece table index for the next piece, which may
involve moving the file pointer many times.
- Moving the file pointer to that piece's description in the piece table.
- Moving the file pointer to the start of the piece indicated by the
description.
A future implementation will provide an intelligent cache of the piece table,
which will hopefully reduce the IO activity required.
| st.ata.util | |
|