| java.lang.Object org.archive.io.WriterPoolMember org.archive.io.arc.ARCWriter
ARCWriter | public class ARCWriter extends WriterPoolMember implements ARCConstants(Code) | | Write ARC files.
Assumption is that the caller is managing access to this ARCWriter ensuring
only one thread of control accessing this ARC file instance at any one time.
ARC files are described here:
Arc
File Format. This class does version 1 of the ARC file format. It also
writes version 1.1 which is version 1 with data stuffed into the body of the
first arc record in the file, the arc file meta record itself.
An ARC file is three lines of meta data followed by an optional 'body' and
then a couple of '\n' and then: record, '\n', record, '\n', record, etc.
If we are writing compressed ARC files, then each of the ARC file records is
individually gzipped and concatenated together to make up a single ARC file.
In GZIP terms, each ARC record is a GZIP member of a total gzip'd
file.
The GZIPping of the ARC file meta data is exceptional. It is GZIPped
w/ an extra GZIP header, a special Internet Archive (IA) extra header field
(e.g. FEXTRA is set in the GZIP header FLG field and an extra field is
appended to the GZIP header). The extra field has little in it but its
presence denotes this GZIP as an Internet Archive gzipped ARC. See RFC1952
to learn about the GZIP header structure.
This class then does its GZIPping in the following fashion. Each GZIP
member is written w/ a new instance of GZIPOutputStream -- actually
ARCWriterGZIPOututStream so we can get access to the underlying stream.
The underlying stream stays open across GZIPoutputStream instantiations.
For the 'special' GZIPing of the ARC file meta data, we cheat by catching the
GZIPOutputStream output into a byte array, manipulating it adding the
IA GZIP header, before writing to the stream.
I tried writing a resettable GZIPOutputStream and could make it work w/
the SUN JDK but the IBM JDK threw NPE inside in the deflate.reset -- its zlib
native call doesn't seem to like the notion of resetting -- so I gave up on
it.
Because of such as the above and troubles with GZIPInputStream, we should
write our own GZIP*Streams, ones that resettable and consious of gzip
members.
This class will write until we hit >= maxSize. The check is done at
record boundary. Records do not span ARC files. We will then close current
file and open another and then continue writing.
TESTING: Here is how to test that produced ARC files are good
using the
alexa
ARC c-tools:
% av_procarc hx20040109230030-0.arc.gz | av_ziparc > \
/tmp/hx20040109230030-0.dat.gz
% av_ripdat /tmp/hx20040109230030-0.dat.gz > /tmp/hx20040109230030-0.cdx
Examine the produced cdx file to make sure it makes sense. Search
for 'no-type 0'. If found, then we're opening a gzip record w/o data to
write. This is bad.
You can also do gzip -t FILENAME and it will tell you if the
ARC makes sense to GZIP.
While being written, ARCs have a '.open' suffix appended.
author: stack |
Constructor Summary | |
public | ARCWriter(AtomicInteger serialNo, PrintStream out, File arc, boolean cmprs, String a14DigitDate, List metadata) Constructor.
Takes a stream. | public | ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, boolean cmprs, long maxSize) Constructor.
Parameters: serialNo - used to generate unique file name sequences Parameters: dirs - Where to drop the ARC files. Parameters: prefix - ARC file prefix to use. | public | ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, String suffix, boolean cmprs, long maxSize, List meta) Constructor.
Parameters: serialNo - used to generate unique file name sequences Parameters: dirs - Where to drop files. Parameters: prefix - File prefix to use. Parameters: cmprs - Compress the records written. |
Method Summary | |
protected String | createFile() | public String | createMetaline(String uri, String hostIP, String timeStamp, String mimetype, String recordLength) | protected String | getMetaLine(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength) | public String | getMetadataHeaderLinesTwoAndThree(String version) | protected String | validateMetaLine(String metaLineStr) Test that the metadata line is valid before writing. | public void | write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ByteArrayOutputStream baos) | public void | write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, InputStream in) | public void | write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ReplayInputStream ris) |
ARCWriter | public ARCWriter(AtomicInteger serialNo, PrintStream out, File arc, boolean cmprs, String a14DigitDate, List metadata) throws IOException(Code) | | Constructor.
Takes a stream. Use with caution. There is no upperbound check on size.
Will just keep writing.
Parameters: serialNo - used to generate unique file name sequences Parameters: out - Where to write. Parameters: arc - File the out is connected to. Parameters: cmprs - Compress the content written. Parameters: metadata - File meta data. Can be null. Is list of File and/orString objects. Parameters: a14DigitDate - If null, we'll write current time. throws: IOException - |
ARCWriter | public ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, boolean cmprs, long maxSize)(Code) | | Constructor.
Parameters: serialNo - used to generate unique file name sequences Parameters: dirs - Where to drop the ARC files. Parameters: prefix - ARC file prefix to use. If null, we useDEFAULT_ARC_FILE_PREFIX. Parameters: cmprs - Compress the ARC files written. The compression is doneby individually gzipping each record added to the ARC file: i.e. theARC file is a bunch of gzipped records concatenated together. Parameters: maxSize - Maximum size for ARC files written. |
ARCWriter | public ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, String suffix, boolean cmprs, long maxSize, List meta)(Code) | | Constructor.
Parameters: serialNo - used to generate unique file name sequences Parameters: dirs - Where to drop files. Parameters: prefix - File prefix to use. Parameters: cmprs - Compress the records written. Parameters: maxSize - Maximum size for ARC files written. Parameters: suffix - File tail to use. If null, unused. Parameters: meta - File meta data. Can be null. Is list of File and/orString objects. |
getMetaLine | protected String getMetaLine(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength) throws IOException(Code) | | Parameters: uri - Parameters: contentType - Parameters: hostIP - Parameters: fetchBeginTimeStamp - Parameters: recordLength - Metadata line for an ARCRecord made of passed components. exception: IOException - |
getMetadataHeaderLinesTwoAndThree | public String getMetadataHeaderLinesTwoAndThree(String version)(Code) | | |
validateMetaLine | protected String validateMetaLine(String metaLineStr) throws IOException(Code) | | Test that the metadata line is valid before writing.
Parameters: metaLineStr - throws: IOException - The passed in metaline. |
|
|