| java.lang.Object org.w3c.tidy.Clean
Clean | public class Clean (Code) | | Clean up misuse of presentation markup. Filters from other formats such as Microsoft Word often make excessive use of
presentation markup such as font tags, B, I, and the align attribute. By applying a set of production rules, it is
straight forward to transform this to use CSS. Some rules replace some of the children of an element by style
properties on the element, e.g.
...
.
...
Such rules are applied to the element's content and then to the element itself until none of the rules more apply.
Having applied all the rules to an element, it will have a style attribute with one or more properties. Other rules
strip the element they apply to, replacing it by style properties on the contents, e.g.
...
.
... These rules are applied to an element before processing its content and replace the current element by the first
element in the exposed content. After applying both sets of rules, you can replace the style attribute by a class
value and style rule in the document head. To support this, an association of styles and class names is built. A
naive approach is to rely on string matching to test when two property lists are the same. A better approach would be
to first sort the properties before matching.
author: Dave Raggett dsr@w3.org author: Andy Quick ac.quick@sympatico.ca (translation to Java) author: Fabrizio Giustina version: $Revision: 1.25 $ ($Author: fgiust $) |
Constructor Summary | |
public | Clean(TagTable tagTable) Instantiates a new Clean. |
Method Summary | |
public void | bQ2Div(Node node) Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with
the indent set to match the nesting depth. | static void | bumpObject(Lexer lexer, Node html) Where appropriate move object elements from head to body. | public void | cleanTree(Lexer lexer, Node doc) Clean an html tree. | public void | cleanWord2000(Lexer lexer, Node node) This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. | public void | dropSections(Lexer lexer, Node node) Drop if/endif sections inserted by word2000. | public void | emFromI(Node node) Replace i by em and b by strong. | Node | findEnclosingCell(Node node) Find the enclosing table cell for the given node. | public boolean | isWord2000(Node root) Check if the current document is a converted Word document. | public void | list2BQ(Node node) Some people use dir or ul without an li to indent the content. | public void | nestedEmphasis(Node node) simplifies ... | boolean | noMargins(Node node) Used to hunt for hidden preformatted sections. | public Node | pruneSection(Lexer lexer, Node node) node is <![if ...]> prune up to <![endif]> . | public void | purgeWord2000Attributes(Node node) Remove word2000 attributes from node. | boolean | singleSpace(Lexer lexer, Node node) | public Node | stripSpan(Lexer lexer, Node span) Word2000 uses span excessively, so we strip span out. |
Clean | public Clean(TagTable tagTable)(Code) | | Instantiates a new Clean.
Parameters: tagTable - tag table instance |
bQ2Div | public void bQ2Div(Node node)(Code) | | Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with
the indent set to match the nesting depth.
Parameters: node - root Node |
bumpObject | static void bumpObject(Lexer lexer, Node html)(Code) | | Where appropriate move object elements from head to body.
Parameters: lexer - Lexer Parameters: html - html node |
cleanTree | public void cleanTree(Lexer lexer, Node doc)(Code) | | Clean an html tree.
Parameters: lexer - Lexer Parameters: doc - root node |
cleanWord2000 | public void cleanWord2000(Lexer lexer, Node node)(Code) | | This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. It
doesn't yet know what to do with VML tags, but these will appear as errors unless you declare them as new tags,
such as o:p which needs to be declared as inline.
Parameters: lexer - Lexer Parameters: node - node to clean up |
dropSections | public void dropSections(Lexer lexer, Node node)(Code) | | Drop if/endif sections inserted by word2000.
Parameters: lexer - Lexer Parameters: node - Node root node |
emFromI | public void emFromI(Node node)(Code) | | Replace i by em and b by strong.
Parameters: node - root Node |
findEnclosingCell | Node findEnclosingCell(Node node)(Code) | | Find the enclosing table cell for the given node.
Parameters: node - Node enclosing cell node |
isWord2000 | public boolean isWord2000(Node root)(Code) | | Check if the current document is a converted Word document.
Parameters: root - root Node true if the document has been geenrated by Microsoft Word. |
list2BQ | public void list2BQ(Node node)(Code) | | Some people use dir or ul without an li to indent the content. The pattern to look for is a list with a single
implicit li. This is recursively replaced by an implicit blockquote.
Parameters: node - root Node |
nestedEmphasis | public void nestedEmphasis(Node node)(Code) | | simplifies ... ... etc.
Parameters: node - root Node |
noMargins | boolean noMargins(Node node)(Code) | | Used to hunt for hidden preformatted sections.
Parameters: node - checked node true if the node has a "margin-top: 0" or "margin-bottom: 0" style |
pruneSection | public Node pruneSection(Lexer lexer, Node node)(Code) | | node is <![if ...]> prune up to <![endif]> .
Parameters: lexer - Lexer Parameters: node - Node cleaned up Node |
purgeWord2000Attributes | public void purgeWord2000Attributes(Node node)(Code) | | Remove word2000 attributes from node.
Parameters: node - node to cleanup |
singleSpace | boolean singleSpace(Lexer lexer, Node node)(Code) | | Does element have a single space as its content?
Parameters: lexer - Lexer Parameters: node - checked node true if the element has a single space as its content |
stripSpan | public Node stripSpan(Lexer lexer, Node span)(Code) | | Word2000 uses span excessively, so we strip span out.
Parameters: lexer - Lexer Parameters: span - Node span cleaned node |
|
|