org.archive.io.warc package
Experimental WARC Writer and Readers. Code and specification subject to change
with no guarantees of backward compatibility: i.e. newer readers
may not be able to parse WARCs written with older writers.
This code, with noted exceptions, is a loose implementation of parts of the
(unreleased and unfinished)
WARC
File Format (Version 0.9). Deviations from 0.9, outlined below in the
section Deviations from Spec., are to be proposed as amendments to the
specification to make a new revision. Since the new spec. revision will likely
be named version 0.10, code in this package writes WARCs of version 0.10 -- not
0.9.
Implementation Notes
Tools
Initial implementations of Arc2Warc and Warc2Arc
tools can be found in the package above this one, at
{@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc}
respectively. Pass --help to learn how to use each tool.
Unique ID Generator
WARC requires a GUID for each record written. A configurable unique ID
{@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate
unique ID generators, was added with a default of
{@link org.archive.uid.UUIDGenerator}. The default implementation generates
UUIDs (using java5
java.util.UUID ) with an urn scheme using the uuid
namespace [See RFC4122].
{@link org.archive.util.anvl ANVL}
The ANVL RFC822-like format is used writing Named Fields in
WARCs and occasionally for metadata. An implementation was added at
{@link org.archive.util.anvl}.
Miscellaneous
Writing WARCs, the response record type is chosen as the core
record that all others associate to: i.e. all others
have a Related-Record-ID that points back to the
response .
The below deviations from spec. 0.9 have been realized in code and are to
be proposed as spec. amendments with new
revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack
to the below at La Honda Meeting, August 8th, 2006).
mimetype in header line
Allow full mimetypes in the header line as per RFC2045 rather than
current, shriveled mimetype that allows only type and subtype. This will mean
mimetypes are allowed parameters: e.g.
text/plain; charset=UTF-8 or
application/http; msgtype=request .
Allowing full mimetypes, we can support the following scenarios without
further amendment to specification and without parsers having to resort to
metadata records or to custom
Named Fields to figure how to interpret payload:
- Consider the case where an archiving organization would store all
related to a capture as one record with a mimetype of
multipart/mixed; boundary=RECORD-ID . An example record
might comprise the parts
Content-Type: application/http; msgtype=request ,
Content-Type: application/http; msgtype=response , and
Content-Type: text/xml+rdf (For metadata).
- Or, an archiving institution would store a capture with
multipart/alternatives ranging from
most basic (or 'desiccated' in Kunze-speak)
-- perhaps a text/plain rendition of a PDF capture -- through to
best , the actual PDF binary itself.
To support full mimetypes, we must allow for whitespace between parameters
and allow that parameter values themselves might include whitespace
('quoted-string'). The WARC Writer converts any embedded carriage-return and
newlines to single space.
Swap position of recordid and mimetype in the header line
Because of the above amendment where we allow full mimetypes on header line,
to ease the parse, since miemtype now may include whitespace, we move the
mimetype to last position on header line and recordid to second-from-last.
Use application/http instead of message/http
message type has line length maximum of 1000 characters absent a
Content-Type-Encoding header set to BINARY .
(See definition of message/http for talk of adherence to MIME
message line limits: See
19.1 Internet Media Type message/http and application/http in
RFC2616).
Suggested Spec. Amendments
Apart from the above listed deviations, the below
changes are also suggested for inclusion in 0.10 spec. revision
Below are mostly suggested edits. Changes are not substantative.
Allow multiple instances of a single Named Parameter
Allow that there may be multiple instances of same Named Parameter
in any one Named Parameter block.
E.g. Multiple Related-Record-ID s could prove of use.
Spec. mentions this in 8.1 HTTP and HTTPS section but better
belongs in the 5.2 Named Parameters preamble.
Related, add to Named Field section note on bidirectional
Related-Record-ID .
Miscellaneous
LaHonda in below is reference to meeting of John, Gordon and Stack at
LaHonda Cafe on 16th St., on August 8th, 2006.
- Leave off 9.2 GZIP extra fields. Big section on implementing an option
that has little to do with WARCing. AGREED at LaHonda.
- But, we need to mark gzipped files as being WARC: i.e. that the
GZIP is a member per resource. Its useful so readers know how to invoke
GZIP (That it has to be done once to get at any record or just need to
do per record). Suggest adding GZIP extra field in HEAD of
GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.
- IP-Address for dns resource is DNS Server. Add note to this effect in
8.2 DNS.
- Section 6. is truncated -- missing text. What was intended here? SEE
ISO DOC.
- In-line ANVL definition (From Kunze). Related, can labels have
CTLs such as CRLF (Shouldn't)? When says 'control-chars', does this include
UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they
are not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda.
Postpone to 0.11 revision.
- Fix examples. Use output of experimental ARC Writer.
- Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in Digital Preservation
using the WARC File Format.
Open Issues
Drop response record type
resource is sufficent. Let mimetype distingush if capture with
response headers or not (As per comment at end of 8.1 HTTP and HTTPS
where it allows that if no response headers, use resource record type and
page mimetype rather than response type plus a mimetype of message/http: The
difference in record types is not needed distingushing between the two
types of capture)
Are there other capture methods that would require a response record,
that don't have a mimetype that includes response headers and content?
SMTP has rich MIME set to describe responses. Its request is
pretty much unrecordable. NNTP and FTP similar. Because of rich MIME, no
need of a special response type here.
Related, do we need the request record?
Only makes sense for HTTP?
This proposal is contentious. Gordon drew scenario where response
would be needed distingushing local from remote capture if an archiving
institution purposefully archived without recording headers or
if the payload itself was an archived record. In opposition, was suggested that
should an institution choose to cature in this 'unusual' mode, crawl metadata
could be used consulted to disambiguate confusion on how capture was done (To
be further investigated. In general, definition of record types is still in
need of work).
subject-url
The ISO revision suggests that the positional parameter
subject-uri be renamed. Suggest record-url.
Other issues
- Should we allow freeform creation of custom Named Fields if
have a MIME-like 'X-' or somesuch prefix?
- Nothing on header-line encoding (Section 11 says UTF-8).
For completeness should be US-ASCII or UTF-8, no control-chars (especially
CR or LF), etc.
warcinfo
- What for a scheme? Using UUID as per G suggestion.
- Also, how to populate description of crawl into warcinfo?
'Documentation'
Named Field with list of URLs that can be assumed
to exist somewhere in the current WARC set (We'd have to make the crawler go
get them at start of a crawl).
- I don't want to repeat crawl description for every WARC. How to have this
warcinfo point at an original?
related-record-id seems
insufficent.
- If the crawler config. changes, can I just write a warcinfo with
differences? How to express? Or better as metadata about a warcinfo?
- In the past
we used to get the filename from this URL header field when we unsure of the
filename or it was unavailable (We're reading a Stream). Won't be able to do
that with UUID for URL. So, introducing new warcinfo Named Field (optional)
'Filename' that will be used when warcinfo is put at start of a file.
Allow warcinfo to have a named parameter 'Filename'?
revisit
- What to write? Use a description field or just expect this info
to be present in the warcinfo? Example has request header
(inside XML). Better to use associated
request record for this
kind of info?
Related-Record-ID (RRID) of original is likely
an onerous requirement. Envisioning an implementation where we'd write
revisit records, we'd write such a record where content was
judged same or where date since last fetch had not changed. If we're to
write the RRID, then we'd have to maintain table keyed by URL with value of
page hash or of last modified-date plus associated RRID (actual RRID
URL, not a hash).
- Should we allow a
Description Named Field .
E.g. I add an order file as a metadata record and associate with a
warcinfo record. Description field could say "This is Heritrix
Order file". Same for seeds. Alternative is custom XML packaging (Scheme
could describe fields such as 'order' file or ANVL packaging using ANVL
'comments'.
- Section 11, why was it we said we don't need a parameter or explicit
subtype for special gzip WARC format? I don't remember? Reader needs to
know when its reading a stream. A client would like to know so it wrote
stream to disk with right suffix? Recap. (Perhaps it was looking at
the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields
that denote it WARC, thats sufficent?).
- Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' --
but allow free form description?
Leave off 'superior method of indicating truncation' paragraph. This qualifier
could be added to all sections of doc -- that a subsequent revision of any
aspect of the doc. will be superior.
Rather than
End-Length , like MIME, last record could have
Segment-Number-Total , a count of all segments that make up
complete record.
From LaHonda, discussion of revisit type. Definition was
tighted some by saying revisit is used when you chose not to store the capture.
Was thought possible that it
NOT require pointer back to an original. Suggested it might have a
similarity judgment header -- similiarity-value -- with values
between 0 and 1. Might also have analysis-method and
description . Possible methods discussed included: URI same,
length same, hash of content same, judgement based off content of HTTP HEAD
request, etc. Possible payloads might be: Nothing, a diff, the hash obtained,
etc.
Unimplemented
- Record Segmentation (4.8
continuation record type
and the 5.2 Segment-* Named Parameters. Future TODO.
- 4.7
conversion type. Future TODO.
TODOs
- unit tests using
multipart/* (JavaMail) reading and
writing records? Try record-id as part boundary.
- Performance: Need to add Record-based buffering. GZIP'd streams
have some buffering because of the deflater but could probably do
w/ more.
|