Java Doc for RegexpHTMLLinkExtractor.java in  » Web-Crawler » heritrix » org » archive » extractor » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.extractor 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.archive.extractor.CharSequenceLinkExtractor
      org.archive.extractor.RegexpHTMLLinkExtractor

RegexpHTMLLinkExtractor
public class RegexpHTMLLinkExtractor extends CharSequenceLinkExtractor (Code)
Basic link-extraction, from an HTML content-body, using regular expressions. ROUGH DRAFT IN PROGRESS / incomplete... untested...
author:
   gojomo


Field Summary
final static  StringAMP
    
final static  StringAPPLET
    
final static  StringBASE
    
final static  StringCLASSEXT
    
final static  StringEACH_ATTRIBUTE_EXTRACTOR
    
final static  StringESCAPED_AMP
    
final static  StringJAVASCRIPT
    
final static  StringLIKELY_URI_PATH
    
final static  StringLINK
    
final static  StringNON_HTML_PATH_EXTENSION
    
final static  StringRELEVANT_TAG_EXTRACTOR
     Compiled relevant tag extractor.
final static  StringWHITESPACE
    
 booleanextractInlineCss
    
 booleanextractInlineJs
    
 booleanhonorRobots
    
protected  LinkedList<Link>next
    
protected  Matchertags
    


Method Summary
protected  booleanfindNextLink()
    
protected static  CharSequenceLinkExtractornewDefaultInstance()
    
protected  longprocessEmbed(CharSequence value, CharSequence context)
    
protected  booleanprocessGeneralTag(CharSequence element, CharSequence cs)
    
protected  voidprocessLink(CharSequence value, CharSequence context)
    
protected  voidprocessMeta(CharSequence cs)
    
protected  voidprocessScript(CharSequence sequence, int endOfOpenTag)
    
protected  voidprocessScriptCode(CharSequence cs)
    
protected  voidprocessStyle(CharSequence sequence, int endOfOpenTag)
    
public  voidreset()
     Discard all state.

Field Detail
AMP
final static String AMP(Code)



APPLET
final static String APPLET(Code)



BASE
final static String BASE(Code)



CLASSEXT
final static String CLASSEXT(Code)



EACH_ATTRIBUTE_EXTRACTOR
final static String EACH_ATTRIBUTE_EXTRACTOR(Code)



ESCAPED_AMP
final static String ESCAPED_AMP(Code)



JAVASCRIPT
final static String JAVASCRIPT(Code)



LIKELY_URI_PATH
final static String LIKELY_URI_PATH(Code)



LINK
final static String LINK(Code)



NON_HTML_PATH_EXTENSION
final static String NON_HTML_PATH_EXTENSION(Code)



RELEVANT_TAG_EXTRACTOR
final static String RELEVANT_TAG_EXTRACTOR(Code)
Compiled relevant tag extractor.

This pattern extracts either:

  • (1) whole <script>...</script> or
  • (2) <style>...</style> or
  • (3) <meta ...> or
  • (4) any other open-tag with at least one attribute (eg matches "<a href='boo'>" but not "</a>" or "<br>")

    groups:

  • 1: SCRIPT SRC=foo>boo</SCRIPT
  • 2: just script open tag
  • 3: STYLE TYPE=moo>zoo</STYLE
  • 4: just style open tag
  • 5: entire other tag, without '<' '>'
  • 6: element
  • 7: META
  • 8: !-- comment --



  • WHITESPACE
    final static String WHITESPACE(Code)



    extractInlineCss
    boolean extractInlineCss(Code)



    extractInlineJs
    boolean extractInlineJs(Code)



    honorRobots
    boolean honorRobots(Code)



    next
    protected LinkedList<Link> next(Code)



    tags
    protected Matcher tags(Code)





    Method Detail
    findNextLink
    protected boolean findNextLink()(Code)



    newDefaultInstance
    protected static CharSequenceLinkExtractor newDefaultInstance()(Code)



    processEmbed
    protected long processEmbed(CharSequence value, CharSequence context)(Code)



    processGeneralTag
    protected boolean processGeneralTag(CharSequence element, CharSequence cs)(Code)



    processLink
    protected void processLink(CharSequence value, CharSequence context)(Code)

    Parameters:
      value -
    Parameters:
      context -



    processMeta
    protected void processMeta(CharSequence cs)(Code)



    processScript
    protected void processScript(CharSequence sequence, int endOfOpenTag)(Code)



    processScriptCode
    protected void processScriptCode(CharSequence cs)(Code)

    Parameters:
      cs -



    processStyle
    protected void processStyle(CharSequence sequence, int endOfOpenTag)(Code)

    Parameters:
      sequence -
    Parameters:
      endOfOpenTag -



    reset
    public void reset()(Code)
    Discard all state. Another setup() is required to use again.



    Fields inherited from org.archive.extractor.CharSequenceLinkExtractor
    protected UURI base(Code)(Java Doc)
    protected ExtractErrorListener extractErrorListener(Code)(Java Doc)
    protected LinkedList<Link> next(Code)(Java Doc)
    protected UURI source(Code)(Java Doc)
    protected CharSequence sourceContent(Code)(Java Doc)

    Methods inherited from org.archive.extractor.CharSequenceLinkExtractor
    protected CharSequence charSequenceFrom(InputStream content, Charset charset)(Code)(Java Doc)
    protected CharSequence createCharSequenceFrom(InputStream content, Charset charset)(Code)(Java Doc)
    public static void extract(CharSequence content, UURI source, UURI base, List<Link> collector, ExtractErrorListener extractErrorListener)(Code)(Java Doc)
    abstract protected boolean findNextLink()(Code)(Java Doc)
    public boolean hasNext()(Code)(Java Doc)
    protected static CharSequenceLinkExtractor newDefaultInstance()(Code)(Java Doc)
    public Object next()(Code)(Java Doc)
    public Link nextLink()(Code)(Java Doc)
    public void remove()(Code)(Java Doc)
    public void reset()(Code)(Java Doc)
    public void setup(UURI source, UURI base, InputStream content, Charset charset, ExtractErrorListener listener)(Code)(Java Doc)
    public void setup(UURI source, UURI base, CharSequence content, ExtractErrorListener listener)(Code)(Java Doc)
    public void setup(UURI sourceandbase, CharSequence content, ExtractErrorListener listener)(Code)(Java Doc)
    public void setup(UURI sourceandbase, InputStream content, Charset charset, ExtractErrorListener listener)(Code)(Java Doc)

    Methods inherited from java.lang.Object
    native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
    public boolean equals(Object obj)(Code)(Java Doc)
    protected void finalize() throws Throwable(Code)(Java Doc)
    final native public Class getClass()(Code)(Java Doc)
    native public int hashCode()(Code)(Java Doc)
    final native public void notify()(Code)(Java Doc)
    final native public void notifyAll()(Code)(Java Doc)
    public String toString()(Code)(Java Doc)
    final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
    final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
    final public void wait() throws InterruptedException(Code)(Java Doc)

    www.java2java.com | Contact Us
    Copyright 2009 - 12 Demo Source and Support. All rights reserved.
    All other trademarks are property of their respective owners.