Java Doc for ARCReader.java in  » Web-Crawler » heritrix » org » archive » io » arc » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.io.arc 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.archive.io.ArchiveReader
      org.archive.io.arc.ARCReader

ARCReader
abstract public class ARCReader extends ArchiveReader implements ARCConstants(Code)
Get an iterator on an ARC file or get a record by absolute position. ARC files are described here: Arc File Format.

This class knows how to parse an ARC file. Pass it a file path or an URL to an ARC. It can parse ARC Version 1 and 2.

Iterator returns ARCRecord though Iterator.next is returning java.lang.Object. Cast the return.

Profiling java.io vs. memory-mapped ByteBufferInputStream shows the latter slightly slower -- but not by much. TODO: Test more. Just change ARCReader.getInputStream(File,long) .
author:
   stack
version:
   $Date: 2007-04-06 00:29:39 +0000 (Fri, 06 Apr 2007) $ $Revision: 5039 $



Field Summary
 Loggerlogger
    

Constructor Summary
 ARCReader()
    

Method Summary
protected  ARCRecordcreateArchiveRecord(InputStream is, long offset)
     Create new arc record. Encapsulate housekeeping that has to do w/ creating a new record.

Call this method at end of constructor to read in the arcfile header.

public static  voidcreateCDXIndexFile(String urlOrPath)
     Generate a CDX index file for an ARC file.
public  voiddump(boolean compress)
    
protected  List<String>fixSpaceInURL(List<String> values, int requiredSize)
     Fix space in URLs. The ARCWriter used to write into the ARC URLs with spaces in them. See [ 1010966 ] crawl.log has URIs with spaces in them. This method does fix up on such headers converting all spaces found to '%20'.
Parameters:
  values - List of metadata values.
Parameters:
  requiredSize - Expected size of resultant values list.
public  ARCReadergetDeleteFileOnCloseReader(File f)
     an ArchiveReader that will delete a local file on close.
public  StringgetDotFileExtension()
    
public  StringgetFileExtension()
    
public  StringgetVersion()
     Returns version of this ARC file.
protected  voidgotoEOR(ArchiveRecord record)
     Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
protected  booleanisAlignedOnFirstRecord()
    
protected  booleanisDate(String date)
    
protected  booleanisLegitimateIPValue(String ip)
    
protected  booleanisNumber(String n)
    
public  booleanisParseHttpHeaders()
    
public static  voidmain(String[] args)
     Command-line interface to ARCReader.
protected  booleanoutput(String format)
    
protected static  voidoutput(ARCReader reader, String format)
     Write out the arcfile.
public  booleanoutputRecord(String format)
    
protected  voidsetAlignedOnFirstRecord(boolean alignedOnFirstRecord)
    
public  voidsetParseHttpHeaders(boolean parse)
    

Field Detail
logger
Logger logger(Code)




Constructor Detail
ARCReader
ARCReader()(Code)




Method Detail
createArchiveRecord
protected ARCRecord createArchiveRecord(InputStream is, long offset) throws IOException(Code)
Create new arc record. Encapsulate housekeeping that has to do w/ creating a new record.

Call this method at end of constructor to read in the arcfile header. Will be problems reading subsequent arc records if you don't since arcfile header has the list of metadata fields for all records that follow.

When parsing through ARCs writing out CDX info, we spend about 38% of CPU in here -- about 30% of which is in getTokenizedHeaderLine -- of which 16% is reading.
Parameters:
  is - InputStream to use.
Parameters:
  offset - Absolute offset into arc file. An arc record.
throws:
  IOException -




createCDXIndexFile
public static void createCDXIndexFile(String urlOrPath) throws IOException, java.text.ParseException(Code)
Generate a CDX index file for an ARC file.
Parameters:
  urlOrPath - The ARC file to generate a CDX index for
throws:
  IOException -
throws:
  java.text.ParseException -



dump
public void dump(boolean compress) throws IOException, java.text.ParseException(Code)



fixSpaceInURL
protected List<String> fixSpaceInURL(List<String> values, int requiredSize)(Code)
Fix space in URLs. The ARCWriter used to write into the ARC URLs with spaces in them. See [ 1010966 ] crawl.log has URIs with spaces in them. This method does fix up on such headers converting all spaces found to '%20'.
Parameters:
  values - List of metadata values.
Parameters:
  requiredSize - Expected size of resultant values list. New list if we successfully fixed up values or original iffixup failed.



getDeleteFileOnCloseReader
public ARCReader getDeleteFileOnCloseReader(File f)(Code)
an ArchiveReader that will delete a local file on close. Usedwhen we bring Archive files local and need to clean up afterward.



getDotFileExtension
public String getDotFileExtension()(Code)



getFileExtension
public String getFileExtension()(Code)



getVersion
public String getVersion()(Code)
Returns version of this ARC file. Usually read from first record of ARC. If we're reading without having first read the first record -- e.g. random access into middle of an ARC -- then version will not have been set. For now, we return a default, version 1.1. Later, if more than just one version of ARC, we could look at such as the meta line to see what version of ARC this is. Version of this ARC file.



gotoEOR
protected void gotoEOR(ArchiveRecord record) throws IOException(Code)
Skip over any trailing new lines at end of the record so we're lined up ready to read the next.
Parameters:
  record -
throws:
  IOException -



isAlignedOnFirstRecord
protected boolean isAlignedOnFirstRecord()(Code)



isDate
protected boolean isDate(String date)(Code)



isLegitimateIPValue
protected boolean isLegitimateIPValue(String ip)(Code)



isNumber
protected boolean isNumber(String n)(Code)



isParseHttpHeaders
public boolean isParseHttpHeaders()(Code)
Returns the parseHttpHeaders.



main
public static void main(String[] args) throws ParseException, IOException, java.text.ParseException(Code)
Command-line interface to ARCReader. Here is the command-line interface:
 usage: java org.archive.io.arc.ARCReader [--offset=#] ARCFILE
 -h,--help      Prints this message and exits.
 -o,--offset    Outputs record at this offset into arc file.

See in $HERITRIX_HOME/bin/arcreader for a script that'll take care of classpaths and the calling of ARCReader.

Outputs using a pseudo-CDX format as described here: CDX Legent and here Example. Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'. Hash is hard-coded straight SHA-1 hash of content.
Parameters:
  args - Command-line arguments.
throws:
  ParseException - Failed parse of the command line.
throws:
  IOException -
throws:
  java.text.ParseException -




output
protected boolean output(String format) throws IOException, java.text.ParseException(Code)



output
protected static void output(ARCReader reader, String format) throws IOException, java.text.ParseException(Code)
Write out the arcfile.
Parameters:
  reader -
Parameters:
  format - Format to use outputting.
throws:
  IOException -
throws:
  java.text.ParseException -



outputRecord
public boolean outputRecord(String format) throws IOException(Code)



setAlignedOnFirstRecord
protected void setAlignedOnFirstRecord(boolean alignedOnFirstRecord)(Code)



setParseHttpHeaders
public void setParseHttpHeaders(boolean parse)(Code)

Parameters:
  parse - The parseHttpHeaders to set.



Fields inherited from org.archive.io.ArchiveReader
final public static int MAX_ALLOWED_RECOVERABLES(Code)(Java Doc)

Methods inherited from org.archive.io.ArchiveReader
protected void cdxOutput(boolean toFile) throws IOException(Code)(Java Doc)
protected void cleanupCurrentRecord() throws IOException(Code)(Java Doc)
public void close() throws IOException(Code)(Java Doc)
abstract protected ArchiveRecord createArchiveRecord(InputStream is, long offset) throws IOException(Code)(Java Doc)
protected ArchiveRecord currentRecord(ArchiveRecord currentRecord)(Code)(Java Doc)
abstract public void dump(boolean compress) throws IOException, java.text.ParseException(Code)(Java Doc)
public ArchiveRecord get(long offset) throws IOException(Code)(Java Doc)
public ArchiveRecord get() throws IOException(Code)(Java Doc)
protected ArchiveRecord getCurrentRecord()(Code)(Java Doc)
abstract public ArchiveReader getDeleteFileOnCloseReader(File f)(Code)(Java Doc)
abstract public String getDotFileExtension()(Code)(Java Doc)
abstract public String getFileExtension()(Code)(Java Doc)
public String getFileName()(Code)(Java Doc)
protected InputStream getIn()(Code)(Java Doc)
protected InputStream getInputStream(File f, long offset) throws IOException(Code)(Java Doc)
protected InputStream getInputStream()(Code)(Java Doc)
protected Logger getLogger()(Code)(Java Doc)
protected static Options getOptions()(Code)(Java Doc)
public String getReaderIdentifier()(Code)(Java Doc)
public String getStrippedFileName()(Code)(Java Doc)
public static String getStrippedFileName(String name, String dotFileExtension)(Code)(Java Doc)
protected static boolean getTrueOrFalse(String value)(Code)(Java Doc)
public String getVersion()(Code)(Java Doc)
abstract protected void gotoEOR(ArchiveRecord record) throws IOException(Code)(Java Doc)
protected void initialize(String i)(Code)(Java Doc)
public boolean isCompressed()(Code)(Java Doc)
public boolean isDigest()(Code)(Java Doc)
public boolean isStrict()(Code)(Java Doc)
public boolean isValid()(Code)(Java Doc)
public Iterator<ArchiveRecord> iterator()(Code)(Java Doc)
public void logStdErr(Level level, String message)(Code)(Java Doc)
protected boolean output(String format) throws IOException, java.text.ParseException(Code)(Java Doc)
public boolean outputRecord(String format) throws IOException(Code)(Java Doc)
protected static void outputRecord(ArchiveReader r, String format) throws IOException(Code)(Java Doc)
protected void rewind() throws IOException(Code)(Java Doc)
protected void setCompressed(boolean compressed)(Code)(Java Doc)
public void setDigest(boolean d)(Code)(Java Doc)
protected void setIn(InputStream in)(Code)(Java Doc)
protected void setReaderIdentifier(String i)(Code)(Java Doc)
public void setStrict(boolean s)(Code)(Java Doc)
protected void setVersion(String version)(Code)(Java Doc)
protected static String stripExtension(String name, String ext)(Code)(Java Doc)
public List validate() throws IOException(Code)(Java Doc)
public List validate(int noRecords) throws IOException(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.