| org.archive.crawler.extractor.Extractor org.archive.crawler.extractor.ExtractorImpliedURI
ExtractorImpliedURI | public class ExtractorImpliedURI extends Extractor implements CoreAttributeConstants(Code) | | An extractor for finding 'implied' URIs inside other URIs. If the
'trigger' regex is matched, a new URI will be constructed from the
'build' replacement pattern.
Unlike most other extractors, this works on URIs discovered by
previous extractors. Thus it should appear near the end of any
set of extractors.
Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
TODO: extend to find URIs in path-info
author: Gordon Mohr |
ATTR_BUILD_PATTERN | final public static String ATTR_BUILD_PATTERN(Code) | | replacement pattern used to build 'implied' URI
|
ATTR_REMOVE_TRIGGER_URIS | final public static String ATTR_REMOVE_TRIGGER_URIS(Code) | | whether to remove URIs that trigger addition of 'implied' URI;
default false
|
ATTR_TRIGGER_REGEXP | final public static String ATTR_TRIGGER_REGEXP(Code) | | regex which when matched triggers addition of 'implied' URI
|
ExtractorImpliedURI | public ExtractorImpliedURI(String name)(Code) | | Constructor
Parameters: name - |
extract | public void extract(CrawlURI curi)(Code) | | Perform usual extraction on a CrawlURI
Parameters: curi - Crawl URI to process. |
extractImplied | protected static String extractImplied(CharSequence uri, String trigger, String build)(Code) | | Utility method for extracting 'implied' URI given a source uri,
trigger pattern, and build pattern.
Parameters: uri - source to check for implied URI Parameters: trigger - regex pattern which if matched implies another URI Parameters: build - replacement pattern to build the implied URI implied URI, or null if none |
|
|