Java Doc for TextExtractor.java in  » HTML-Parser » jericho-html » au » id » jericho » lib » html » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » HTML Parser » jericho html » au.id.jericho.lib.html 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   au.id.jericho.lib.html.TextExtractor

TextExtractor
public class TextExtractor implements CharStreamSource(Code)
Extracts the textual content from HTML markup.

The output is ideal for feeding into a text search engine such as Apache Lucene, especially when the TextExtractor.setIncludeAttributes(boolean) IncludeAttributes property has been set to true.

Use one of the following methods to obtain the output:

The process removes all of the tags and . A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an element. An exception to this is the HTMLElementName.BR BR element, which is also converted to a space despite being an inline-level element.

Text inside HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements contained within this segment is ignored.

Setting the TextExtractor.setExcludeNonHTMLElements(boolean) ExcludeNonHTMLElements property results in the exclusion of any content within a non-HTML element.

See the TextExtractor.excludeElement(StartTag) method for details on how to implement a more complex mechanism to determine whether the of each Element is to be excluded from the output.

All tags that are not normal tags, such as , etc., are removed from the output without adding whitespace to the output.

Note that segments on which the Segment.ignoreWhenParsing method has been called are treated as text rather than markup, resulting in their inclusion in the output. To remove specific segments before extracting the text, create an OutputDocument and call its OutputDocument.remove(Segment) remove(Segment) or OutputDocument.replaceWithSpaces(intint) replaceWithSpaces(int begin, int end) method for each segment to be removed. Then create a new source document using Source.Source(CharSequence) new Source(outputDocument.toString()) and perform the text extraction on this new source object.

Extracting the text from an entire Source object performs a automatically.

To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the Renderer class instead.

Example:
Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".



Constructor Summary
public  TextExtractor(Segment segment)
     Constructs a new TextExtractor based on the specified Segment .

Method Summary
public  booleanexcludeElement(StartTag startTag)
     Indicates whether the text inside the Element of the specified start tag should be excluded from the output.

During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its should be excluded from the output.

The default implementation of this method is to always return false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.

All elements nested inside an excluded element are also implicitly excluded, as are all HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.

Example:
To extract the text from a segment, excluding any text inside elements with the attribute class="NotIndexed":

TextExtractor textExtractor=new TextExtractor(segment) {
    public boolean excludeElement(StartTag startTag) {
        return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
    }
};
String extractedText=textExtractor.toString();

Parameters:
  startTag - the start tag of the element to check for inclusion.
public  booleangetConvertNonBreakingSpaces()
     Indicates whether non-breaking space ( CharacterEntityReference._nbsp &nbsp; ) character entity references are converted to spaces.
public  longgetEstimatedMaximumOutputLength()
    
public  booleangetExcludeNonHTMLElements()
     Indicates whether the content of non-HTML elements is excluded from the output.
public  booleangetIncludeAttributes()
     Indicates whether the values of title, alt, label, and summary, and content attributes of tags are to be included in the output.
public  TextExtractorsetConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)
     Sets whether non-breaking space ( CharacterEntityReference._nbsp &nbsp; ) character entity references are converted to spaces.

The default value is true.
Parameters:
  convertNonBreakingSpaces - specifies whether non-breaking space (CharacterEntityReference._nbsp &nbsp;) character entity references are converted to spaces.

public  TextExtractorsetExcludeNonHTMLElements(boolean excludeNonHTMLElements)
     Sets whether the content of non-HTML elements is excluded from the output.

The default value is false, meaning that content from all elements meeting the other criteria is included.
Parameters:
  excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output.

public  TextExtractorsetIncludeAttributes(boolean includeAttributes)
     Sets whether the values of title, alt, label, summary, and content attributes of tags are to be included in the output.

The value of a content attribute is only included if a name attribute is also present, as the content attribute of a HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.

The default value is false.
Parameters:
  includeAttributes - specifies whether the attribute values are included in the output.

public  StringtoString()
    
public  voidwriteTo(Writer writer)
    


Constructor Detail
TextExtractor
public TextExtractor(Segment segment)(Code)
Constructs a new TextExtractor based on the specified Segment .
Parameters:
  segment - the segment from which the text will be extracted.
See Also:   Segment.getTextExtractor




Method Detail
excludeElement
public boolean excludeElement(StartTag startTag)(Code)
Indicates whether the text inside the Element of the specified start tag should be excluded from the output.

During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its should be excluded from the output.

The default implementation of this method is to always return false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.

All elements nested inside an excluded element are also implicitly excluded, as are all HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.

Example:
To extract the text from a segment, excluding any text inside elements with the attribute class="NotIndexed":

TextExtractor textExtractor=new TextExtractor(segment) {
    public boolean excludeElement(StartTag startTag) {
        return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
    }
};
String extractedText=textExtractor.toString();

Parameters:
  startTag - the start tag of the element to check for inclusion. if the text inside the Element of the specified start tag should be excluded from the output, otherwise false.



getConvertNonBreakingSpaces
public boolean getConvertNonBreakingSpaces()(Code)
Indicates whether non-breaking space ( CharacterEntityReference._nbsp &nbsp; ) character entity references are converted to spaces.

See the TextExtractor.setConvertNonBreakingSpaces(boolean) method for a full description of this property. true if non-breaking space (CharacterEntityReference._nbsp &nbsp;) character entity references are converted to spaces, otherwise false.




getEstimatedMaximumOutputLength
public long getEstimatedMaximumOutputLength()(Code)



getExcludeNonHTMLElements
public boolean getExcludeNonHTMLElements()(Code)
Indicates whether the content of non-HTML elements is excluded from the output.

See the TextExtractor.setExcludeNonHTMLElements(boolean) method for a full description of this property. true if the content of non-HTML elements is excluded from the output, otherwise false.




getIncludeAttributes
public boolean getIncludeAttributes()(Code)
Indicates whether the values of title, alt, label, and summary, and content attributes of tags are to be included in the output.

See the TextExtractor.setIncludeAttributes(boolean) method for a full description of this property. true if the attribute values are to be included in the output, otherwise false.




setConvertNonBreakingSpaces
public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)(Code)
Sets whether non-breaking space ( CharacterEntityReference._nbsp &nbsp; ) character entity references are converted to spaces.

The default value is true.
Parameters:
  convertNonBreakingSpaces - specifies whether non-breaking space (CharacterEntityReference._nbsp &nbsp;) character entity references are converted to spaces. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:   TextExtractor.getConvertNonBreakingSpaces()




setExcludeNonHTMLElements
public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)(Code)
Sets whether the content of non-HTML elements is excluded from the output.

The default value is false, meaning that content from all elements meeting the other criteria is included.
Parameters:
  excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:   TextExtractor.getExcludeNonHTMLElements()




setIncludeAttributes
public TextExtractor setIncludeAttributes(boolean includeAttributes)(Code)
Sets whether the values of title, alt, label, summary, and content attributes of tags are to be included in the output.

The value of a content attribute is only included if a name attribute is also present, as the content attribute of a HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.

The default value is false.
Parameters:
  includeAttributes - specifies whether the attribute values are included in the output. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also:   TextExtractor.getIncludeAttributes()




toString
public String toString()(Code)



writeTo
public void writeTo(Writer writer) throws IOException(Code)



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.