| java.lang.Object org.archive.io.ArchiveReader org.archive.io.arc.ARCReader
ARCReader | abstract public class ARCReader extends ArchiveReader implements ARCConstants(Code) | | Get an iterator on an ARC file or get a record by absolute position.
ARC files are described here:
Arc
File Format.
This class knows how to parse an ARC file. Pass it a file path
or an URL to an ARC. It can parse ARC Version 1 and 2.
Iterator returns ARCRecord
though
Iterator.next is returning
java.lang.Object. Cast the return.
Profiling java.io vs. memory-mapped ByteBufferInputStream shows the
latter slightly slower -- but not by much. TODO: Test more. Just
change
ARCReader.getInputStream(File,long) .
author: stack version: $Date: 2007-04-06 00:29:39 +0000 (Fri, 06 Apr 2007) $ $Revision: 5039 $ |
createArchiveRecord | protected ARCRecord createArchiveRecord(InputStream is, long offset) throws IOException(Code) | | Create new arc record.
Encapsulate housekeeping that has to do w/ creating a new record.
Call this method at end of constructor to read in the
arcfile header. Will be problems reading subsequent arc records
if you don't since arcfile header has the list of metadata fields for
all records that follow.
When parsing through ARCs writing out CDX info, we spend about
38% of CPU in here -- about 30% of which is in getTokenizedHeaderLine
-- of which 16% is reading.
Parameters: is - InputStream to use. Parameters: offset - Absolute offset into arc file. An arc record. throws: IOException - |
fixSpaceInURL | protected List<String> fixSpaceInURL(List<String> values, int requiredSize)(Code) | | Fix space in URLs.
The ARCWriter used to write into the ARC URLs with spaces in them.
See [ 1010966 ]
crawl.log has URIs with spaces in them.
This method does fix up on such headers converting all spaces found
to '%20'.
Parameters: values - List of metadata values. Parameters: requiredSize - Expected size of resultant values list. New list if we successfully fixed up values or original iffixup failed. |
getDeleteFileOnCloseReader | public ARCReader getDeleteFileOnCloseReader(File f)(Code) | | an ArchiveReader that will delete a local file on close. Usedwhen we bring Archive files local and need to clean up afterward. |
getDotFileExtension | public String getDotFileExtension()(Code) | | |
getVersion | public String getVersion()(Code) | | Returns version of this ARC file. Usually read from first record of ARC.
If we're reading without having first read the first record -- e.g.
random access into middle of an ARC -- then version will not have been
set. For now, we return a default, version 1.1. Later, if more than
just one version of ARC, we could look at such as the meta line to see
what version of ARC this is.
Version of this ARC file. |
gotoEOR | protected void gotoEOR(ArchiveRecord record) throws IOException(Code) | | Skip over any trailing new lines at end of the record so we're lined up
ready to read the next.
Parameters: record - throws: IOException - |
isAlignedOnFirstRecord | protected boolean isAlignedOnFirstRecord()(Code) | | |
isLegitimateIPValue | protected boolean isLegitimateIPValue(String ip)(Code) | | |
isParseHttpHeaders | public boolean isParseHttpHeaders()(Code) | | Returns the parseHttpHeaders. |
main | public static void main(String[] args) throws ParseException, IOException, java.text.ParseException(Code) | | Command-line interface to ARCReader.
Here is the command-line interface:
usage: java org.archive.io.arc.ARCReader [--offset=#] ARCFILE
-h,--help Prints this message and exits.
-o,--offset Outputs record at this offset into arc file.
See in $HERITRIX_HOME/bin/arcreader for a script that'll
take care of classpaths and the calling of ARCReader.
Outputs using a pseudo-CDX format as described here:
CDX
Legent and here
Example.
Legend used in below is: 'CDX b e a m s c V (or v if uncompressed) n g'.
Hash is hard-coded straight SHA-1 hash of content.
Parameters: args - Command-line arguments. throws: ParseException - Failed parse of the command line. throws: IOException - throws: java.text.ParseException - |
setAlignedOnFirstRecord | protected void setAlignedOnFirstRecord(boolean alignedOnFirstRecord)(Code) | | |
setParseHttpHeaders | public void setParseHttpHeaders(boolean parse)(Code) | | Parameters: parse - The parseHttpHeaders to set. |
Fields inherited from org.archive.io.ArchiveReader | final public static int MAX_ALLOWED_RECOVERABLES(Code)(Java Doc)
|
|
|