| org.archive.crawler.extractor.Extractor org.archive.crawler.extractor.ExtractorHTML org.archive.crawler.extractor.JerichoExtractorHTML
JerichoExtractorHTML | public class JerichoExtractorHTML extends ExtractorHTML implements CoreAttributeConstants(Code) | | Improved link-extraction from an HTML content-body using jericho-html parser.
This extractor extends ExtractorHTML and mimics its workflow - but has some
substantial differences when it comes to internal implementation. Instead
of heavily relying upon java regular expressions it uses a real html parser
library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net).
Using this parser it can better handle broken html (i.e. missing quotes)
and also offer improved extraction of HTML form URLs (not only extract
the action of a form, but also its default values).
Unfortunately this parser also has one major drawback - it has to read the
whole document into memory for parsing, thus has an inherent OOME risk.
This OOME risk can be reduced/eleminated by limiting the size of documents
to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule).
Also note that this extractor seems to have a lower overall memory
consumption compared to ExtractorHTML. (still to be confirmed on a larger
scale crawl)
author: Olaf Freyer version: $Date: 2006-11-15 17:57:11 +0000 (Wed, 15 Nov 2006) $ $Revision: 4726 $ |
numberOfFormsProcessed | protected long numberOfFormsProcessed(Code) | | |
JerichoExtractorHTML | public JerichoExtractorHTML(String name)(Code) | | |
extract | void extract(CrawlURI curi, CharSequence cs)(Code) | | Run extractor. This method is package visible to ease testing.
Parameters: curi - CrawlURI we're processing. Parameters: cs - Sequence from underlying ReplayCharSequence. |
processForm | protected void processForm(CrawlURI curi, Element element)(Code) | | |
processGeneralTag | protected void processGeneralTag(CrawlURI curi, Element element, Attributes attributes)(Code) | | |
processMeta | protected boolean processMeta(CrawlURI curi, Element element)(Code) | | |
processScript | protected void processScript(CrawlURI curi, Element element)(Code) | | |
processStyle | protected void processStyle(CrawlURI curi, Element element)(Code) | | |
Methods inherited from org.archive.crawler.extractor.ExtractorHTML | public void extract(CrawlURI curi)(Code)(Java Doc) void extract(CrawlURI curi, CharSequence cs)(Code)(Java Doc) protected boolean isHtmlExpectedHere(CrawlURI curi) throws URIException(Code)(Java Doc) final protected void processEmbed(CrawlURI curi, CharSequence value, CharSequence context)(Code)(Java Doc) protected void processEmbed(CrawlURI curi, CharSequence value, CharSequence context, char hopType)(Code)(Java Doc) protected void processGeneralTag(CrawlURI curi, CharSequence element, CharSequence cs)(Code)(Java Doc) protected void processLink(CrawlURI curi, CharSequence value, CharSequence context)(Code)(Java Doc) protected boolean processMeta(CrawlURI curi, CharSequence cs)(Code)(Java Doc) protected void processScript(CrawlURI curi, CharSequence sequence, int endOfOpenTag)(Code)(Java Doc) protected void processScriptCode(CrawlURI curi, CharSequence cs)(Code)(Java Doc) protected void processStyle(CrawlURI curi, CharSequence sequence, int endOfOpenTag)(Code)(Java Doc) public String report()(Code)(Java Doc)
|
|
|