org.archive.crawler.extractor |
|
Java Source File Name | Type | Comment |
AggressiveExtractorHTML.java | Class | Extended version of ExtractorHTML with more aggressive javascript link
extraction where javascript code is parsed first with general HTML tags
regexp, and than by javascript speculative link regexp. |
ChangeEvaluator.java | Class | This processor compares the CrawlURI's current
org.archive.crawler.datamodel.CrawlURI.getContentDigest content digest with the one from a previous crawl. |
CrawlUriSWFAction.java | Class | SWF action that handles discovered URIs. |
CustomSWFTags.java | Class | Overwrite action tags, that may hold URI, to use CrawlUriSWFAction
action. |
Extractor.java | Class | Convenience shared superclass for Extractor Processors.
Currently only wraps Extractor-specific extract() action with
a StackOverflowError catch/log/proceed handler, so that any
extractors that recurse too deep on problematic input will
only suffer a local error, and other normal CrawlURI processing
can continue. |
ExtractorCSS.java | Class | This extractor is parsing URIs from CSS type files.
The format of a CSS URL value is 'url(' followed by optional white space
followed by an optional single quote (') or double quote (") character
followed by the URL itself followed by an optional single quote (') or
double quote (") character followed by optional white space followed by ')'.
Parentheses, commas, white space characters, single quotes (') and double
quotes (") appearing in a URL must be escaped with a backslash:
'\(', '\)', '\,'. |
ExtractorDOC.java | Class | This class allows the caller to extract href style links from word97-format word documents. |
ExtractorHTML.java | Class | Basic link-extraction, from an HTML content-body,
using regular expressions. |
ExtractorHTMLTest.java | Class | Test html extractor. |
ExtractorHTTP.java | Class | Extracts URIs from HTTP response headers. |
ExtractorImpliedURI.java | Class | An extractor for finding 'implied' URIs inside other URIs. |
ExtractorImpliedURITest.java | Class | |
ExtractorJS.java | Class | Processes Javascript files for strings that are likely to be
crawlable URIs. |
ExtractorPDF.java | Class | |
ExtractorSWF.java | Class | Extracts URIs from SWF (flash/shockwave) files. |
ExtractorTool.java | Class | Run named extractors against passed ARC file.
This extractor tool runs suboptimally. |
ExtractorUniversal.java | Class | A last ditch extractor that will look at the raw byte code and try to extract
anything that looks like a link.
If used, it should always be specified as the last link extractor in the
order file.
To accomplish this it will scan through the bytecode and try and build up
strings of consecutive bytes that all represent characters that are valid
in a URL (see #isURLableChar(int) for details).
Once it hits the end of such a string (i.e. |
ExtractorURI.java | Class | An extractor for finding URIs inside other URIs. |
ExtractorURITest.java | Class | |
ExtractorXML.java | Class | |
HTTPContentDigest.java | Class | A processor for calculating custum HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
This processor allows the user to specify a regular expression called
strip-reg-expr. |
JerichoExtractorHTML.java | Class | Improved link-extraction from an HTML content-body using jericho-html parser.
This extractor extends ExtractorHTML and mimics its workflow - but has some
substantial differences when it comes to internal implementation. |
JerichoExtractorHTMLTest.java | Class | Test html extractor. |
Link.java | Class | Link represents one discovered "edge" of the web graph: the source
URI, the destination URI, and the type of reference (represented by the
context in which it was found). |
PDFParser.java | Class | Supports PDF parsing operations. |
TrapSuppressExtractor.java | Class | Pseudo-extractor that suppresses link-extraction of likely trap pages,
by noticing when content's digest is identical to that of its 'via'. |