| java.lang.Object au.id.jericho.lib.html.TextExtractor
TextExtractor | public class TextExtractor implements CharStreamSource(Code) | | Extracts the textual content from HTML markup.
The output is ideal for feeding into a text search engine such as Apache Lucene,
especially when the
TextExtractor.setIncludeAttributes(boolean) IncludeAttributes property has been set to true .
Use one of the following methods to obtain the output:
The process removes all of the tags and
.
A space character is included in the output where a normal tag is present in the source,
unless the tag belongs to an
element.
An exception to this is the
HTMLElementName.BR BR element, which is also converted to a space despite being an inline-level element.
Text inside
HTMLElementName.SCRIPT SCRIPT and
HTMLElementName.STYLE STYLE elements contained within this segment
is ignored.
Setting the
TextExtractor.setExcludeNonHTMLElements(boolean) ExcludeNonHTMLElements property results in the exclusion of any content within a
non-HTML element.
See the
TextExtractor.excludeElement(StartTag) method for details on how to implement a more complex mechanism to determine whether the
of each
Element is to be excluded from the output.
All tags that are not normal tags, such as
,
etc., are removed from the output without adding whitespace to the output.
Note that segments on which the
Segment.ignoreWhenParsing method has been called are treated as text rather than markup,
resulting in their inclusion in the output.
To remove specific segments before extracting the text, create an
OutputDocument and call its
OutputDocument.remove(Segment) remove(Segment) or
OutputDocument.replaceWithSpaces(intint) replaceWithSpaces(int begin, int end) method for each segment to be removed.
Then create a new source document using
Source.Source(CharSequence) new Source(outputDocument.toString()) and perform the text extraction on this new source object.
Extracting the text from an entire
Source object performs a
automatically.
To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the
Renderer class instead.
- Example:
- Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div> "
produces the text "One Two Three ".
|
Method Summary | |
public boolean | excludeElement(StartTag startTag) Indicates whether the text inside the
Element of the specified start tag should be excluded from the output.
During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its
should be excluded from the output.
The default implementation of this method is to always return false , so that every element is included,
but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.
All elements nested inside an excluded element are also implicitly excluded, as are all
HTMLElementName.SCRIPT SCRIPT and
HTMLElementName.STYLE STYLE elements.
Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.
- Example:
-
To extract the text from a
segment , excluding any text inside elements with the attribute class="NotIndexed" :
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean excludeElement(StartTag startTag) {
return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
}
};
String extractedText=textExtractor.toString();
Parameters: startTag - the start tag of the element to check for inclusion. | public boolean | getConvertNonBreakingSpaces() Indicates whether non-breaking space (
CharacterEntityReference._nbsp ) character entity references are converted to spaces. | public long | getEstimatedMaximumOutputLength() | public boolean | getExcludeNonHTMLElements() Indicates whether the content of non-HTML elements is excluded from the output. | public boolean | getIncludeAttributes() Indicates whether the values of
title,
alt,
label, and
summary, and
content
attributes of
tags are to be included in the output. | public TextExtractor | setConvertNonBreakingSpaces(boolean convertNonBreakingSpaces) Sets whether non-breaking space (
CharacterEntityReference._nbsp ) character entity references are converted to spaces.
The default value is true .
Parameters: convertNonBreakingSpaces - specifies whether non-breaking space (CharacterEntityReference._nbsp ) character entity references are converted to spaces. | public TextExtractor | setExcludeNonHTMLElements(boolean excludeNonHTMLElements) Sets whether the content of non-HTML elements is excluded from the output.
The default value is false , meaning that content from all elements meeting the other criteria is included.
Parameters: excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output. | public TextExtractor | setIncludeAttributes(boolean includeAttributes) Sets whether the values of
title,
alt,
label,
summary, and
content
attributes of
tags are to be included in the output.
The value of a content attribute is
only included if a name attribute is also present,
as the content attribute of a
HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an
http-equiv attribute.
The default value is false .
Parameters: includeAttributes - specifies whether the attribute values are included in the output. | public String | toString() | public void | writeTo(Writer writer) |
TextExtractor | public TextExtractor(Segment segment)(Code) | | Constructs a new TextExtractor based on the specified
Segment .
Parameters: segment - the segment from which the text will be extracted. See Also: Segment.getTextExtractor |
excludeElement | public boolean excludeElement(StartTag startTag)(Code) | | Indicates whether the text inside the
Element of the specified start tag should be excluded from the output.
During the text extraction process, every start tag encountered in the segment is checked using this method to determine whether the text inside its
should be excluded from the output.
The default implementation of this method is to always return false , so that every element is included,
but the method can be overridden in a subclass to perform a check of arbitrary complexity on each start tag.
All elements nested inside an excluded element are also implicitly excluded, as are all
HTMLElementName.SCRIPT SCRIPT and
HTMLElementName.STYLE STYLE elements.
Such elements are skipped over without calling this method, so there is no way to include them by overriding the method.
- Example:
-
To extract the text from a
segment , excluding any text inside elements with the attribute class="NotIndexed" :
TextExtractor textExtractor=new TextExtractor(segment) {
public boolean excludeElement(StartTag startTag) {
return "NotIndexed".equalsIgnoreCase(startTag.getAttributeValue("class"));
}
};
String extractedText=textExtractor.toString();
Parameters: startTag - the start tag of the element to check for inclusion. if the text inside the Element of the specified start tag should be excluded from the output, otherwise false . |
getEstimatedMaximumOutputLength | public long getEstimatedMaximumOutputLength()(Code) | | |
setExcludeNonHTMLElements | public TextExtractor setExcludeNonHTMLElements(boolean excludeNonHTMLElements)(Code) | | Sets whether the content of non-HTML elements is excluded from the output.
The default value is false , meaning that content from all elements meeting the other criteria is included.
Parameters: excludeNonHTMLElements - specifies whether content non-HTML elements is excluded from the output. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement. See Also: TextExtractor.getExcludeNonHTMLElements() |
setIncludeAttributes | public TextExtractor setIncludeAttributes(boolean includeAttributes)(Code) | | Sets whether the values of
title,
alt,
label,
summary, and
content
attributes of
tags are to be included in the output.
The value of a content attribute is
only included if a name attribute is also present,
as the content attribute of a
HTMLElementName.META META tag only contains human readable text if the name attribute is used as opposed to an
http-equiv attribute.
The default value is false .
Parameters: includeAttributes - specifies whether the attribute values are included in the output. this TextExtractor instance, allowing multiple property setting methods to be chained in a single statement. See Also: TextExtractor.getIncludeAttributes() |
|
|