| java.lang.Object org.archive.extractor.CharSequenceLinkExtractor org.archive.extractor.RegexpHTMLLinkExtractor
RegexpHTMLLinkExtractor | public class RegexpHTMLLinkExtractor extends CharSequenceLinkExtractor (Code) | | Basic link-extraction, from an HTML content-body,
using regular expressions.
ROUGH DRAFT IN PROGRESS / incomplete... untested...
author: gojomo |
EACH_ATTRIBUTE_EXTRACTOR | final static String EACH_ATTRIBUTE_EXTRACTOR(Code) | | |
NON_HTML_PATH_EXTENSION | final static String NON_HTML_PATH_EXTENSION(Code) | | |
RELEVANT_TAG_EXTRACTOR | final static String RELEVANT_TAG_EXTRACTOR(Code) | | Compiled relevant tag extractor.
This pattern extracts either:
(1) whole <script>...</script> or
(2) <style>...</style> or
(3) <meta ...> or
(4) any other open-tag with at least one attribute
(eg matches "<a href='boo'>" but not "</a>" or "<br>")
groups:
1: SCRIPT SRC=foo>boo</SCRIPT
2: just script open tag
3: STYLE TYPE=moo>zoo</STYLE
4: just style open tag
5: entire other tag, without '<' '>'
6: element
7: META
8: !-- comment --
|
extractInlineCss | boolean extractInlineCss(Code) | | |
extractInlineJs | boolean extractInlineJs(Code) | | |
honorRobots | boolean honorRobots(Code) | | |
findNextLink | protected boolean findNextLink()(Code) | | |
processStyle | protected void processStyle(CharSequence sequence, int endOfOpenTag)(Code) | | Parameters: sequence - Parameters: endOfOpenTag - |
reset | public void reset()(Code) | | Discard all state. Another setup() is required to use again.
|
Methods inherited from org.archive.extractor.CharSequenceLinkExtractor | protected CharSequence charSequenceFrom(InputStream content, Charset charset)(Code)(Java Doc) protected CharSequence createCharSequenceFrom(InputStream content, Charset charset)(Code)(Java Doc) public static void extract(CharSequence content, UURI source, UURI base, List<Link> collector, ExtractErrorListener extractErrorListener)(Code)(Java Doc) abstract protected boolean findNextLink()(Code)(Java Doc) public boolean hasNext()(Code)(Java Doc) protected static CharSequenceLinkExtractor newDefaultInstance()(Code)(Java Doc) public Object next()(Code)(Java Doc) public Link nextLink()(Code)(Java Doc) public void remove()(Code)(Java Doc) public void reset()(Code)(Java Doc) public void setup(UURI source, UURI base, InputStream content, Charset charset, ExtractErrorListener listener)(Code)(Java Doc) public void setup(UURI source, UURI base, CharSequence content, ExtractErrorListener listener)(Code)(Java Doc) public void setup(UURI sourceandbase, CharSequence content, ExtractErrorListener listener)(Code)(Java Doc) public void setup(UURI sourceandbase, InputStream content, Charset charset, ExtractErrorListener listener)(Code)(Java Doc)
|
|
|