org.archive.io.warc.v10

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.io.warc.v10

org.archive.io.warc.v10
org.archive.io.warc package Experimental WARC Writer and Readers. Code and specification subject to change with no guarantees of backward compatibility: i.e. newer readers may not be able to parse WARCs written with older writers. This code, with noted exceptions, is a loose implementation of parts of the (unreleased and unfinished) WARC File Format (Version 0.9). Deviations from 0.9, outlined below in the section Deviations from Spec., are to be proposed as amendments to the specification to make a new revision. Since the new spec. revision will likely be named version 0.10, code in this package writes WARCs of version 0.10 -- not 0.9. Implementation Notes Tools Initial implementations of `Arc2Warc` and `Warc2Arc` tools can be found in the package above this one, at {@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc} respectively. Pass `--help` to learn how to use each tool. Unique ID Generator WARC requires a GUID for each record written. A configurable unique ID {@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate unique ID generators, was added with a default of {@link org.archive.uid.UUIDGenerator}. The default implementation generates UUIDs (using java5 `java.util.UUID`) with an `urn` scheme using the uuid namespace [See RFC4122]. {@link org.archive.util.anvl ANVL} The ANVL RFC822-like format is used writing `Named Fields` in WARCs and occasionally for metadata. An implementation was added at {@link org.archive.util.anvl}. Miscellaneous Writing WARCs, the `response` record type is chosen as the core record that all others associate to: i.e. all others have a `Related-Record-ID` that points back to the `response`. Deviations from Spec. The below deviations from spec. 0.9 have been realized in code and are to be proposed as spec. amendments with new revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack to the below at La Honda Meeting, August 8th, 2006). mimetype in header line Allow full mimetypes in the header line as per RFC2045 rather than current, shriveled mimetype that allows only type and subtype. This will mean mimetypes are allowed parameters: e.g. `text/plain; charset=UTF-8` or `application/http; msgtype=request`. Allowing full mimetypes, we can support the following scenarios without further amendment to specification and without parsers having to resort to `metadata` records or to custom `Named Fields` to figure how to interpret payload: Consider the case where an archiving organization would store all related to a capture as one record with a mimetype of `multipart/mixed; boundary=RECORD-ID`. An example record might comprise the parts `Content-Type: application/http; msgtype=request`, `Content-Type: application/http; msgtype=response`, and `Content-Type: text/xml+rdf` (For metadata). Or, an archiving institution would store a capture with `multipart/alternatives` ranging from most basic (or 'desiccated' in Kunze-speak) -- perhaps a `text/plain` rendition of a PDF capture -- through to `best`, the actual PDF binary itself. To support full mimetypes, we must allow for whitespace between parameters and allow that parameter values themselves might include whitespace ('quoted-string'). The WARC Writer converts any embedded carriage-return and newlines to single space. Swap position of recordid and mimetype in the header line Because of the above amendment where we allow full mimetypes on header line, to ease the parse, since miemtype now may include whitespace, we move the mimetype to last position on header line and recordid to second-from-last. Use application/http instead of message/http message type has line length maximum of 1000 characters absent a `Content-Type-Encoding` header set to `BINARY`. (See definition of message/http for talk of adherence to MIME `message` line limits: See 19.1 Internet Media Type message/http and application/http in RFC2616). Suggested Spec. Amendments Apart from the above listed deviations, the below changes are also suggested for inclusion in 0.10 spec. revision Below are mostly suggested edits. Changes are not substantative. Allow multiple instances of a single Named Parameter Allow that there may be multiple instances of same Named Parameter in any one Named Parameter block. E.g. Multiple `Related-Record-ID`s could prove of use. Spec. mentions this in 8.1 HTTP and HTTPS section but better belongs in the 5.2 Named Parameters preamble. Related, add to `Named Field` section note on bidirectional `Related-Record-ID`. Miscellaneous LaHonda in below is reference to meeting of John, Gordon and Stack at LaHonda Cafe on 16th St., on August 8th, 2006. Leave off 9.2 GZIP extra fields. Big section on implementing an option that has little to do with WARCing. AGREED at LaHonda. But, we need to mark gzipped files as being WARC: i.e. that the GZIP is a member per resource. Its useful so readers know how to invoke GZIP (That it has to be done once to get at any record or just need to do per record). Suggest adding GZIP extra field in HEAD of GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting. IP-Address for dns resource is DNS Server. Add note to this effect in 8.2 DNS. Section 6. is truncated -- missing text. What was intended here? SEE ISO DOC. In-line ANVL definition (From Kunze). Related, can labels have CTLs such as CRLF (Shouldn't)? When says 'control-chars', does this include UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they are not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda. Postpone to 0.11 revision. Fix examples. Use output of experimental ARC Writer. Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in Digital Preservation using the WARC File Format. Open Issues Drop response record type `resource` is sufficent. Let mimetype distingush if capture with response headers or not (As per comment at end of 8.1 HTTP and HTTPS where it allows that if no response headers, use resource record type and page mimetype rather than response type plus a mimetype of message/http: The difference in record types is not needed distingushing between the two types of capture) Are there other capture methods that would require a response record, that don't have a mimetype that includes response headers and content? SMTP has rich MIME set to describe responses. Its request is pretty much unrecordable. NNTP and FTP similar. Because of rich MIME, no need of a special response type here. Related, do we need the `request` record? Only makes sense for HTTP? This proposal is contentious. Gordon drew scenario where response would be needed distingushing local from remote capture if an archiving institution purposefully archived without recording headers or if the payload itself was an archived record. In opposition, was suggested that should an institution choose to cature in this 'unusual' mode, crawl metadata could be used consulted to disambiguate confusion on how capture was done (To be further investigated. In general, definition of record types is still in need of work). subject-url The ISO revision suggests that the positional parameter `subject-uri` be renamed. Suggest `record-url.` Other issues Should we allow freeform creation of custom Named Fields if have a MIME-like 'X-' or somesuch prefix? Nothing on header-line encoding (Section 11 says UTF-8). For completeness should be US-ASCII or UTF-8, no control-chars (especially CR or LF), etc. warcinfo What for a scheme? Using UUID as per G suggestion. Also, how to populate description of crawl into warcinfo? 'Documentation' Named Field with list of URLs that can be assumed to exist somewhere in the current WARC set (We'd have to make the crawler go get them at start of a crawl). I don't want to repeat crawl description for every WARC. How to have this warcinfo point at an original? related-record-id seems insufficent. If the crawler config. changes, can I just write a warcinfo with differences? How to express? Or better as metadata about a warcinfo? In the past we used to get the filename from this URL header field when we unsure of the filename or it was unavailable (We're reading a Stream). Won't be able to do that with UUID for URL. So, introducing new warcinfo Named Field (optional) 'Filename' that will be used when warcinfo is put at start of a file. Allow warcinfo to have a named parameter 'Filename'? revisit What to write? Use a description field or just expect this info to be present in the warcinfo? Example has request header (inside XML). Better to use associated request record for this kind of info? Related-Record-ID (RRID) of original is likely an onerous requirement. Envisioning an implementation where we'd write revisit records, we'd write such a record where content was judged same or where date since last fetch had not changed. If we're to write the RRID, then we'd have to maintain table keyed by URL with value of page hash or of last modified-date plus associated RRID (actual RRID URL, not a hash). Should we allow a Description Named Field. E.g. I add an order file as a metadata record and associate with a warcinfo record. Description field could say "This is Heritrix Order file". Same for seeds. Alternative is custom XML packaging (Scheme could describe fields such as 'order' file or ANVL packaging using ANVL 'comments'. Section 11, why was it we said we don't need a parameter or explicit subtype for special gzip WARC format? I don't remember? Reader needs to know when its reading a stream. A client would like to know so it wrote stream to disk with right suffix? Recap. (Perhaps it was looking at the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields that denote it WARC, thats sufficent?). Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' -- but allow free form description? Leave off 'superior method of indicating truncation' paragraph. This qualifier could be added to all sections of doc -- that a subsequent revision of any aspect of the doc. will be superior. Rather than End-Length, like MIME, last record could have Segment-Number-Total, a count of all segments that make up complete record. From LaHonda, discussion of revisit type. Definition was tighted some by saying revisit is used when you chose not to store the capture. Was thought possible that it NOT require pointer back to an original. Suggested it might have a similarity judgment header -- similiarity-value -- with values between 0 and 1. Might also have analysis-method and description. Possible methods discussed included: URI same, length same, hash of content same, judgement based off content of HTTP HEAD request, etc. Possible payloads might be: Nothing, a diff, the hash obtained, etc. Unimplemented Record Segmentation (4.8 continuation record type and the 5.2 Segment-* Named Parameters. Future TODO. 4.7 conversion type. Future TODO. TODOs unit tests using multipart/* (JavaMail) reading and writing records? Try record-id as part boundary. Performance: Need to add Record-based buffering. GZIP'd streams have some buffering because of the deflater but could probably do w/ more.
Java Source File Name	Type	Comment
ExperimentalWARCWriter.java	Class	Experimental WARC implementation. Based on unreleased version 0.9 of WARC File Format document.
ExperimentalWARCWriterTest.java	Class	Test Writer and Reader.
WARCReader.java	Class	WARCReader.
WARCReaderFactory.java	Class	Factory for WARC Readers.
WARCRecord.java	Class	A WARC file Record.
WARCRecordTest.java	Class
WARCWriterPool.java	Class	A pool of WARCWriters.

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.

org.archive.io.warc.v10

Implementation Notes

Tools

Unique ID Generator

{@link org.archive.util.anvl ANVL}

Miscellaneous

Deviations from Spec.

mimetype in header line

Swap position of recordid and mimetype in the header line

Use application/http instead of message/http

Suggested Spec. Amendments

Allow multiple instances of a single Named Parameter

Miscellaneous

Open Issues

Drop response record type

subject-url

Other issues

Unimplemented

TODOs