| org.archive.crawler.extractor.Extractor org.archive.crawler.extractor.ExtractorHTML
All known Subclasses: org.archive.crawler.extractor.AggressiveExtractorHTML, org.archive.crawler.extractor.JerichoExtractorHTML,
ExtractorHTML | public class ExtractorHTML extends Extractor implements CoreAttributeConstants(Code) | | Basic link-extraction, from an HTML content-body,
using regular expressions.
author: gojomo |
Method Summary | |
public void | extract(CrawlURI curi) | void | extract(CrawlURI curi, CharSequence cs) Run extractor.
This method is package visible to ease testing.
Parameters: curi - CrawlURI we're processing. Parameters: cs - Sequence from underlying ReplayCharSequence. | protected boolean | isHtmlExpectedHere(CrawlURI curi) Test whether this HTML is so unexpected (eg in place of a GIF URI)
that it shouldn't be scanned for links.
Parameters: curi - CrawlURI to examine. | final protected void | processEmbed(CrawlURI curi, CharSequence value, CharSequence context) | protected void | processEmbed(CrawlURI curi, CharSequence value, CharSequence context, char hopType) | protected void | processGeneralTag(CrawlURI curi, CharSequence element, CharSequence cs) | protected void | processLink(CrawlURI curi, CharSequence value, CharSequence context) Handle generic HREF cases. | protected boolean | processMeta(CrawlURI curi, CharSequence cs) Process metadata tags.
Parameters: curi - CrawlURI we're processing. Parameters: cs - Sequence from underlying ReplayCharSequence. | protected void | processScript(CrawlURI curi, CharSequence sequence, int endOfOpenTag) | protected void | processScriptCode(CrawlURI curi, CharSequence cs) Extract the (java)script source in the given CharSequence. | protected void | processStyle(CrawlURI curi, CharSequence sequence, int endOfOpenTag) Process style text.
Parameters: curi - CrawlURI we're processing. Parameters: sequence - Sequence from underlying ReplayCharSequence. | public String | report() |
ATTR_EXTRACT_JAVASCRIPT | final public static String ATTR_EXTRACT_JAVASCRIPT(Code) | | whether to try finding links in Javscript; default true
|
ATTR_IGNORE_FORM_ACTION_URLS | final public static String ATTR_IGNORE_FORM_ACTION_URLS(Code) | | |
ATTR_IGNORE_UNEXPECTED_HTML | final public static String ATTR_IGNORE_UNEXPECTED_HTML(Code) | | |
ATTR_OVERLY_EAGER_LINK_DETECTION | final public static String ATTR_OVERLY_EAGER_LINK_DETECTION(Code) | | |
ATTR_TREAT_FRAMES_AS_EMBED_LINKS | final public static String ATTR_TREAT_FRAMES_AS_EMBED_LINKS(Code) | | |
EACH_ATTRIBUTE_EXTRACTOR | final static String EACH_ATTRIBUTE_EXTRACTOR(Code) | | |
MAX_ATTR_VAL_LENGTH | final static int MAX_ATTR_VAL_LENGTH(Code) | | |
NON_HTML_PATH_EXTENSION | final static String NON_HTML_PATH_EXTENSION(Code) | | |
RELEVANT_TAG_EXTRACTOR | final static String RELEVANT_TAG_EXTRACTOR(Code) | | |
numberOfCURIsHandled | protected long numberOfCURIsHandled(Code) | | |
numberOfLinksExtracted | protected long numberOfLinksExtracted(Code) | | |
extract | void extract(CrawlURI curi, CharSequence cs)(Code) | | Run extractor.
This method is package visible to ease testing.
Parameters: curi - CrawlURI we're processing. Parameters: cs - Sequence from underlying ReplayCharSequence. Thisis TRANSIENT data. Make a copy if you want the data to live outsideof this extractors' lifetime. |
isHtmlExpectedHere | protected boolean isHtmlExpectedHere(CrawlURI curi) throws URIException(Code) | | Test whether this HTML is so unexpected (eg in place of a GIF URI)
that it shouldn't be scanned for links.
Parameters: curi - CrawlURI to examine. True if HTML is acceptable/expected here throws: URIException - |
processMeta | protected boolean processMeta(CrawlURI curi, CharSequence cs)(Code) | | Process metadata tags.
Parameters: curi - CrawlURI we're processing. Parameters: cs - Sequence from underlying ReplayCharSequence. Thisis TRANSIENT data. Make a copy if you want the data to live outsideof this extractors' lifetime. True robots exclusion metatag. |
processScriptCode | protected void processScriptCode(CrawlURI curi, CharSequence cs)(Code) | | Extract the (java)script source in the given CharSequence.
Parameters: curi - source CrawlURI Parameters: cs - CharSequence of javascript code |
processStyle | protected void processStyle(CrawlURI curi, CharSequence sequence, int endOfOpenTag)(Code) | | Process style text.
Parameters: curi - CrawlURI we're processing. Parameters: sequence - Sequence from underlying ReplayCharSequence. Thisis TRANSIENT data. Make a copy if you want the data to live outsideof this extractors' lifetime. Parameters: endOfOpenTag - |
|
|