| java.lang.Object com.flexive.extractor.htmlExtractor.HtmlExtractor
HtmlExtractor | public class HtmlExtractor (Code) | | HTML Text Extractor.
Part of the fleXive 3.X Framework
author: Gregor Schober (gregor.schober@flexive.com), UCS - unique computing solutions gmbh (http://www.ucs.at) |
META_CREATED | final protected static String META_CREATED(Code) | | |
META_LAST_MODIFIED | final protected static String META_LAST_MODIFIED(Code) | | |
convertSpecialHtmlChars | boolean convertSpecialHtmlChars(Code) | | |
HtmlExtractor | public HtmlExtractor(String html, boolean convertSpecialHtmlChars)(Code) | | Constructor.
Parameters: convertSpecialHtmlChars - if set to true special HTML characters are replacedto a readable form in text files (eg german umlaute). Parameters: html - the html to parse |
HtmlExtractor | public HtmlExtractor(InputStream in, boolean convertSpecialHtmlChars)(Code) | | Constructor.
Parameters: convertSpecialHtmlChars - if set to true special HTML characters are replacedto a readable form in text files (eg german umlaute). Parameters: in - the html to parse |
appendTagText | protected void appendTagText(String text)(Code) | | |
getCharacterCount | public int getCharacterCount()(Code) | | The total number of characters in the HTML file.
the total number of characters in the HTML file |
getError | public Exception getError()(Code) | | Returns null if the parser was successfully, or the parser error.
null if the parser was successfully, or the parser error |
getTagText | public String getTagText()(Code) | | Returns the text extracted from tag attributes like 'title' and 'alt'.
the text extracted from tag attributes like 'title' and 'alt'. |
getText | public String getText()(Code) | | Returns the extracted text.
the extracted text |
getWordCount | public int getWordCount()(Code) | | Returns the number of words in the EXTRACTED text.
the number of words in the EXTRACTED text. |
hadError | public boolean hadError()(Code) | | Returns true if a error occured during the parseing - in this case only the
text extracted up to the error is returned.
true if a error occured during the parseing |
|
|