Java Doc for ARCWriter.java in  » Web-Crawler » heritrix » org » archive » io » arc » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.io.arc 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.archive.io.WriterPoolMember
      org.archive.io.arc.ARCWriter

ARCWriter
public class ARCWriter extends WriterPoolMember implements ARCConstants(Code)
Write ARC files. Assumption is that the caller is managing access to this ARCWriter ensuring only one thread of control accessing this ARC file instance at any one time.

ARC files are described here: Arc File Format. This class does version 1 of the ARC file format. It also writes version 1.1 which is version 1 with data stuffed into the body of the first arc record in the file, the arc file meta record itself.

An ARC file is three lines of meta data followed by an optional 'body' and then a couple of '\n' and then: record, '\n', record, '\n', record, etc. If we are writing compressed ARC files, then each of the ARC file records is individually gzipped and concatenated together to make up a single ARC file. In GZIP terms, each ARC record is a GZIP member of a total gzip'd file.

The GZIPping of the ARC file meta data is exceptional. It is GZIPped w/ an extra GZIP header, a special Internet Archive (IA) extra header field (e.g. FEXTRA is set in the GZIP header FLG field and an extra field is appended to the GZIP header). The extra field has little in it but its presence denotes this GZIP as an Internet Archive gzipped ARC. See RFC1952 to learn about the GZIP header structure.

This class then does its GZIPping in the following fashion. Each GZIP member is written w/ a new instance of GZIPOutputStream -- actually ARCWriterGZIPOututStream so we can get access to the underlying stream. The underlying stream stays open across GZIPoutputStream instantiations. For the 'special' GZIPing of the ARC file meta data, we cheat by catching the GZIPOutputStream output into a byte array, manipulating it adding the IA GZIP header, before writing to the stream.

I tried writing a resettable GZIPOutputStream and could make it work w/ the SUN JDK but the IBM JDK threw NPE inside in the deflate.reset -- its zlib native call doesn't seem to like the notion of resetting -- so I gave up on it.

Because of such as the above and troubles with GZIPInputStream, we should write our own GZIP*Streams, ones that resettable and consious of gzip members.

This class will write until we hit >= maxSize. The check is done at record boundary. Records do not span ARC files. We will then close current file and open another and then continue writing.

TESTING: Here is how to test that produced ARC files are good using the alexa ARC c-tools:

 % av_procarc hx20040109230030-0.arc.gz | av_ziparc > \
 /tmp/hx20040109230030-0.dat.gz
 % av_ripdat /tmp/hx20040109230030-0.dat.gz > /tmp/hx20040109230030-0.cdx
 
Examine the produced cdx file to make sure it makes sense. Search for 'no-type 0'. If found, then we're opening a gzip record w/o data to write. This is bad.

You can also do gzip -t FILENAME and it will tell you if the ARC makes sense to GZIP.

While being written, ARCs have a '.open' suffix appended.
author:
   stack




Constructor Summary
public  ARCWriter(AtomicInteger serialNo, PrintStream out, File arc, boolean cmprs, String a14DigitDate, List metadata)
     Constructor. Takes a stream.
public  ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, boolean cmprs, long maxSize)
     Constructor.
Parameters:
  serialNo - used to generate unique file name sequences
Parameters:
  dirs - Where to drop the ARC files.
Parameters:
  prefix - ARC file prefix to use.
public  ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, String suffix, boolean cmprs, long maxSize, List meta)
     Constructor.
Parameters:
  serialNo - used to generate unique file name sequences
Parameters:
  dirs - Where to drop files.
Parameters:
  prefix - File prefix to use.
Parameters:
  cmprs - Compress the records written.

Method Summary
protected  StringcreateFile()
    
public  StringcreateMetaline(String uri, String hostIP, String timeStamp, String mimetype, String recordLength)
    
protected  StringgetMetaLine(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength)
    
public  StringgetMetadataHeaderLinesTwoAndThree(String version)
    
protected  StringvalidateMetaLine(String metaLineStr)
     Test that the metadata line is valid before writing.
public  voidwrite(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ByteArrayOutputStream baos)
    
public  voidwrite(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, InputStream in)
    
public  voidwrite(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ReplayInputStream ris)
    


Constructor Detail
ARCWriter
public ARCWriter(AtomicInteger serialNo, PrintStream out, File arc, boolean cmprs, String a14DigitDate, List metadata) throws IOException(Code)
Constructor. Takes a stream. Use with caution. There is no upperbound check on size. Will just keep writing.
Parameters:
  serialNo - used to generate unique file name sequences
Parameters:
  out - Where to write.
Parameters:
  arc - File the out is connected to.
Parameters:
  cmprs - Compress the content written.
Parameters:
  metadata - File meta data. Can be null. Is list of File and/orString objects.
Parameters:
  a14DigitDate - If null, we'll write current time.
throws:
  IOException -



ARCWriter
public ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, boolean cmprs, long maxSize)(Code)
Constructor.
Parameters:
  serialNo - used to generate unique file name sequences
Parameters:
  dirs - Where to drop the ARC files.
Parameters:
  prefix - ARC file prefix to use. If null, we useDEFAULT_ARC_FILE_PREFIX.
Parameters:
  cmprs - Compress the ARC files written. The compression is doneby individually gzipping each record added to the ARC file: i.e. theARC file is a bunch of gzipped records concatenated together.
Parameters:
  maxSize - Maximum size for ARC files written.



ARCWriter
public ARCWriter(AtomicInteger serialNo, List<File> dirs, String prefix, String suffix, boolean cmprs, long maxSize, List meta)(Code)
Constructor.
Parameters:
  serialNo - used to generate unique file name sequences
Parameters:
  dirs - Where to drop files.
Parameters:
  prefix - File prefix to use.
Parameters:
  cmprs - Compress the records written.
Parameters:
  maxSize - Maximum size for ARC files written.
Parameters:
  suffix - File tail to use. If null, unused.
Parameters:
  meta - File meta data. Can be null. Is list of File and/orString objects.




Method Detail
createFile
protected String createFile() throws IOException(Code)



createMetaline
public String createMetaline(String uri, String hostIP, String timeStamp, String mimetype, String recordLength)(Code)



getMetaLine
protected String getMetaLine(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength) throws IOException(Code)

Parameters:
  uri -
Parameters:
  contentType -
Parameters:
  hostIP -
Parameters:
  fetchBeginTimeStamp -
Parameters:
  recordLength - Metadata line for an ARCRecord made of passed components.
exception:
  IOException -



getMetadataHeaderLinesTwoAndThree
public String getMetadataHeaderLinesTwoAndThree(String version)(Code)



validateMetaLine
protected String validateMetaLine(String metaLineStr) throws IOException(Code)
Test that the metadata line is valid before writing.
Parameters:
  metaLineStr -
throws:
  IOException - The passed in metaline.



write
public void write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ByteArrayOutputStream baos) throws IOException(Code)



write
public void write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, InputStream in) throws IOException(Code)



write
public void write(String uri, String contentType, String hostIP, long fetchBeginTimeStamp, long recordLength, ReplayInputStream ris) throws IOException(Code)



Fields inherited from org.archive.io.WriterPoolMember
final public static String DEFAULT_PREFIX(Code)(Java Doc)
final public static String DEFAULT_SUFFIX(Code)(Java Doc)
final public static String HOSTNAME_VARIABLE(Code)(Java Doc)
final public static String UTF8(Code)(Java Doc)

Methods inherited from org.archive.io.WriterPoolMember
public void checkSize() throws IOException(Code)(Java Doc)
protected File checkWriteable(File d)(Code)(Java Doc)
public void close() throws IOException(Code)(Java Doc)
protected String createFile() throws IOException(Code)(Java Doc)
protected String createFile(File file) throws IOException(Code)(Java Doc)
protected void flush() throws IOException(Code)(Java Doc)
protected String getBaseFilename()(Code)(Java Doc)
protected String getCreateTimestamp()(Code)(Java Doc)
public File getFile()(Code)(Java Doc)
protected File getNextDirectory(List<File> dirs) throws IOException(Code)(Java Doc)
protected OutputStream getOutputStream()(Code)(Java Doc)
public long getPosition() throws IOException(Code)(Java Doc)
protected synchronized TimestampSerialno getTimestampSerialNo()(Code)(Java Doc)
protected synchronized TimestampSerialno getTimestampSerialNo(String timestamp)(Code)(Java Doc)
public boolean isCompressed()(Code)(Java Doc)
protected void postWriteRecordTasks() throws IOException(Code)(Java Doc)
protected void preWriteRecordTasks() throws IOException(Code)(Java Doc)
protected void readFullyFrom(InputStream is, long recordLength, byte[] b) throws IOException(Code)(Java Doc)
protected void readToLimitFrom(InputStream is, long limit, byte[] b) throws IOException(Code)(Java Doc)
protected void write(byte[] b) throws IOException(Code)(Java Doc)
protected void write(byte[] b, int off, int len) throws IOException(Code)(Java Doc)
protected void write(int b) throws IOException(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.