Java Doc for TextExtractor.java in » HTML-Parser » jericho-html » au » id » jericho » lib » html » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » HTML Parser » jericho html » au.id.jericho.lib.html

Source Cross Reference Class Diagram Java Document (Java Doc)

java.lang .Object

au.id.jericho.lib.html .TextExtractor

TextExtractor

public class TextExtractor implements CharStreamSource(Code)

Extracts the textual content from HTML markup.

The output is ideal for feeding into a text search engine such as Apache Lucene, especially when the TextExtractor.setIncludeAttributes(boolean) IncludeAttributes property has been set to true.

Use one of the following methods to obtain the output:

TextExtractor.writeTo(Writer)
TextExtractor.toString()
CharStreamSourceUtil.getReader(CharStreamSource) CharStreamSourceUtil.getReader(this)

The process removes all of the tags and . A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an element. An exception to this is the HTMLElementName.BR BR element, which is also converted to a space despite being an inline-level element.

Text inside HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements contained within this segment is ignored.

Setting the TextExtractor.setExcludeNonHTMLElements(boolean) ExcludeNonHTMLElements property results in the exclusion of any content within a non-HTML element.

See the TextExtractor.excludeElement(StartTag) method for details on how to implement a more complex mechanism to determine whether the of each Element is to be excluded from the output.

All tags that are not normal tags, such as , etc., are removed from the output without adding whitespace to the output.

Note that segments on which the Segment.ignoreWhenParsing method has been called are treated as text rather than markup, resulting in their inclusion in the output. To remove specific segments before extracting the text, create an OutputDocument and call its OutputDocument.remove(Segment) remove(Segment) or OutputDocument.replaceWithSpaces(intint) replaceWithSpaces(int begin, int end) method for each segment to be removed. Then create a new source document using Source.Source(CharSequence) new Source(outputDocument.toString()) and perform the text extraction on this new source object.

Extracting the text from an entire Source object performs a automatically.

To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the Renderer class instead.

Example:: Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".

Constructor Summary
public	TextExtractor(Segment segment) Constructs a new `TextExtractor` based on the specified Segment .

Method Summary
public boolean	excludeElement(StartTag startTag) Indicates whether the text inside the Element of the specified start tag should be excluded from the output. During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its should be excluded from the output. The default implementation of this method is to always return `false`, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag. All elements nested inside an excluded element are also implicitly excluded, as are all HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method. Example: To extract the text from a `segment`, excluding any text inside elements with the attribute `class="NotIndexed"`: `TextExtractor textExtractor=new TextExtractor(segment) { public boolean excludeElement(StartTag startTag) { return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class")); } }; String extractedText=textExtractor.toString();` Parameters: startTag - the start tag of the element to check for inclusion.
public boolean	getConvertNonBreakingSpaces() Indicates whether non-breaking space ( CharacterEntityReference._nbsp   ) character entity references are converted to spaces.
public long	getEstimatedMaximumOutputLength()
public boolean	getExcludeNonHTMLElements() Indicates whether the content of non-HTML elements is excluded from the output.
public boolean	getIncludeAttributes() Indicates whether the values of title, alt, label, and summary, and content attributes of tags are to be included in the output.
public TextExtractor	setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces) Sets whether non-breaking space ( CharacterEntityReference._nbsp   ) character entity references are converted to spaces. The default value is `true`. Parameters: convertNonBreakingSpaces - specifies whether non-breaking space (CharacterEntityReference._nbsp  ) character entity references are converted to spaces.
public TextExtractor	setExcludeNonHTMLElements(boolean excludeNonHTMLElements) Sets whether the content of non-HTML elements is excluded from the output. The default value is `false`, meaning that content from all elements meeting the other criteria is included. Parameters: excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output.
public TextExtractor	setIncludeAttributes(boolean includeAttributes) Sets whether the values of title, alt, label, summary, and content attributes of tags are to be included in the output. The value of a content attribute is only included if a name attribute is also present, as the content attribute of a HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute. The default value is `false`. Parameters: includeAttributes - specifies whether the attribute values are included in the output.
public String	toString()
public void	writeTo(Writer writer)

Constructor Detail

TextExtractor
public TextExtractor(Segment segment)(Code)
	Constructs a new `TextExtractor` based on the specified Segment . Parameters: segment - the segment from which the text will be extracted. See Also: Segment.getTextExtractor

Method Detail

excludeElement

public boolean excludeElement(StartTag startTag)(Code)

Indicates whether the text inside the Element of the specified start tag should be excluded from the output.

During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its should be excluded from the output.

The default implementation of this method is to always return false, so that every element is included, but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.

All elements nested inside an excluded element are also implicitly excluded, as are all HTMLElementName.SCRIPT SCRIPT and HTMLElementName.STYLE STYLE elements. Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.

Example:: To extract the text from a segment, excluding any text inside elements with the attribute class="NotIndexed":

TextExtractor textExtractor=new TextExtractor(segment) { public boolean excludeElement(StartTag startTag) { return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class")); } }; String extractedText=textExtractor.toString();

Parameters:
startTag - the start tag of the element to check for inclusion. if the text inside the Element of the specified start tag should be excluded from the output, otherwise false.

getConvertNonBreakingSpaces
public boolean getConvertNonBreakingSpaces()(Code)
	Indicates whether non-breaking space ( CharacterEntityReference._nbsp   ) character entity references are converted to spaces. See the TextExtractor.setConvertNonBreakingSpaces(boolean) method for a full description of this property. `true` if non-breaking space (CharacterEntityReference._nbsp  ) character entity references are converted to spaces, otherwise `false`.

getEstimatedMaximumOutputLength
public long getEstimatedMaximumOutputLength()(Code)

getExcludeNonHTMLElements
public boolean getExcludeNonHTMLElements()(Code)
	Indicates whether the content of non-HTML elements is excluded from the output. See the TextExtractor.setExcludeNonHTMLElements(boolean) method for a full description of this property. `true` if the content of non-HTML elements is excluded from the output, otherwise `false`.

getIncludeAttributes
public boolean getIncludeAttributes()(Code)
	Indicates whether the values of title, alt, label, and summary, and content attributes of tags are to be included in the output. See the TextExtractor.setIncludeAttributes(boolean) method for a full description of this property. `true` if the attribute values are to be included in the output, otherwise `false`.

setConvertNonBreakingSpaces
public TextExtractor setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces)(Code)
	Sets whether non-breaking space ( CharacterEntityReference._nbsp   ) character entity references are converted to spaces. The default value is `true`. Parameters: convertNonBreakingSpaces - specifies whether non-breaking space (CharacterEntityReference._nbsp  ) character entity references are converted to spaces. this `TextExtractor` instance, allowing multiple property setting methods to be chained in a single statement. See Also: TextExtractor.getConvertNonBreakingSpaces()

setExcludeNonHTMLElements
public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)(Code)
	Sets whether the content of non-HTML elements is excluded from the output. The default value is `false`, meaning that content from all elements meeting the other criteria is included. Parameters: excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output. this `TextExtractor` instance, allowing multiple property setting methods to be chained in a single statement. See Also: TextExtractor.getExcludeNonHTMLElements()

setIncludeAttributes

public TextExtractor setIncludeAttributes(boolean includeAttributes)(Code)

Sets whether the values of title, alt, label, summary, and content attributes of tags are to be included in the output.

The value of a content attribute is only included if a name attribute is also present, as the content attribute of a HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an http-equiv attribute.

The default value is false.
Parameters:
includeAttributes - specifies whether the attribute values are included in the output. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement.
See Also: TextExtractor.getIncludeAttributes()

toString
public String toString()(Code)

writeTo
public void writeTo(Writer writer) throws IOException(Code)

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.