org.archive.extractor |
|
Java Source File Name | Type | Comment |
CharSequenceLinkExtractor.java | Class | Abstract superclass providing utility methods for LinkExtractors which
would prefer to work on a CharSequence rather than a stream.
ROUGH DRAFT IN PROGRESS / incomplete... |
CharSequenceProvider.java | Interface | Interface indicating an object can efficiently provide a
(perhaps cached or simulated) CharSequence version of itself. |
ExtractErrorListener.java | Interface | ExtractErrorListener receives exceptions that may need to be logged
from inside a LinkExtractor, allowing the extraction to continue
without raising an exception through hasNext()/next()/nextLink(). |
LinkExtractor.java | Interface | LinkExtractor is a general interface for classes which, when given an
InputStream and Charset, can scan for Links and return them via
an Iterator interface.
Implementors may in fact complete all extraction on the first
hasNext(), then trickle Links out from an internal collection,
depending on whether the link-extraction technique used is amenable
to incremental scanning.
ROUGH DRAFT IN PROGRESS / incomplete... |
RegexpCSSLinkExtractor.java | Class | This extractor is parsing URIs from CSS type files.
The format of a CSS URL value is 'url(' followed by optional white space
followed by an optional single quote (') or double quote (") character
followed by the URL itself followed by an optional single quote (') or
double quote (") character followed by optional white space followed by ')'.
Parentheses, commas, white space characters, single quotes (') and double
quotes (") appearing in a URL must be escaped with a backslash:
'\(', '\)', '\,'. |
RegexpHTMLLinkExtractor.java | Class | Basic link-extraction, from an HTML content-body,
using regular expressions.
ROUGH DRAFT IN PROGRESS / incomplete... |
RegexpJSLinkExtractor.java | Class | Uses regular expressions to find likely URIs inside Javascript.
ROUGH DRAFT IN PROGRESS / incomplete... |