| org.archive.crawler.extractor.Extractor org.archive.crawler.extractor.ExtractorURI
ExtractorURI | public class ExtractorURI extends Extractor implements CoreAttributeConstants(Code) | | An extractor for finding URIs inside other URIs. Unlike most other
extractors, this works on URIs discovered by previous extractors. Thus
it should appear near the end of any set of extractors.
Initially, only finds absolute HTTP(S) URIs in query-string or its
parameters.
TODO: extend to find URIs in path-info
author: Gordon Mohr |
ABS_HTTP_URI_PATTERN | final static String ABS_HTTP_URI_PATTERN(Code) | | |
ExtractorURI | public ExtractorURI(String name)(Code) | | Constructor
Parameters: name - |
extract | public void extract(CrawlURI curi)(Code) | | Perform usual extraction on a CrawlURI
Parameters: curi - Crawl URI to process. |
extractLink | protected void extractLink(CrawlURI curi, Link wref)(Code) | | Consider a single Link for internal URIs
Parameters: curi - CrawlURI to add discoveries to Parameters: wref - Link to examine for internal URIs |
extractQueryStringLinks | protected static List<String> extractQueryStringLinks(UURI source)(Code) | | Look for URIs inside the supplied UURI.
Static for ease of testing or outside use.
Parameters: source - UURI to example List of discovered String URIs. |
|
|