| java.lang.Object org.w3c.tidy.Lexer
Lexer | public class Lexer (Code) | | Lexer for html parser.
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one
level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2
null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted
mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted
to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case.
Not yet done: - Doctype subset and marked sections
author: Dave Raggett dsr@w3.org author: Andy Quick ac.quick@sympatico.ca (translation to Java) author: Fabrizio Giustina version: $Revision: 1.93 $ ($Author: fgiust $) |
Field Summary | |
final public static short | IGNORE_MARKUP state: ignore markup. | final public static short | IGNORE_WHITESPACE state: ignore whitespace. | final public static short | MIXED_CONTENT state: mixed content. | final public static short | PREFORMATTED state: preformatted. | protected short | badAccess for accessibility errors. | protected short | badChars for bad char encodings. | protected boolean | badDoctype set if html or PUBLIC is missing. | protected short | badForm for mismatched/mispositioned form tags. | protected short | badLayout for bad style errors. | protected int | columns at start of current token. | protected Configuration | configuration configuration. | protected int | doctype version as given by doctype (if any). | protected short | errors count of errors. | protected PrintWriter | errout error output stream. | protected boolean | excludeBlocks Netscape compatibility. | protected boolean | exiled true if moved out of table. | protected StreamIn | in file stream. | protected Node | inode Inline stack for compatibility with Mosaic. | protected int | insert for inferring inline tags. | protected boolean | insertspace when space is moved after end tag. | protected Stack | istack stack. | protected int | istackbase start of frame. | protected boolean | isvoyager true if xmlns attribute on html element. | protected byte[] | lexbuf Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of
all of the elements. | protected int | lexlength allocated. | protected int | lexsize used. | protected int | lines lines seen. | protected boolean | pushed true after token has been pushed back. | protected Report | report report. | protected Node | root Root node is saved here. | protected boolean | seenEndBody | protected boolean | seenEndHtml | protected short | state state of lexer's finite state machine. | protected Style | styles used for cleaning up presentation markup. | protected Node | token current node. | protected int | txtend end of current node. | protected int | txtstart start of current node. | protected short | versions bit vector of HTML versions. | protected short | warnings count of warnings in this document. | protected boolean | waswhite used to collapse contiguous white space. |
Method Summary | |
public void | addByte(int c) Adds a byte to lexer buffer. | public void | addCharToLexer(int c) Store char c as UTF-8 encoded byte stream. | public boolean | addGenerator(Node root) Add meta element for Tidy. | public void | addStringLiteral(String str) calls addCharToLexer for any char in the string. | void | addStringLiteralLen(String str, int len) calls addCharToLexer for any char in the string till len is reached. | public void | addStringToLexer(String str) Adds a string to lexer buffer. | public short | apparentVersion() Return the html version used in document. | public boolean | canPrune(Node element) | public void | changeChar(byte c) Substitute the last char in buffer. | public boolean | checkDocTypeKeyWords(Node doctype) Check system keywords (keywords should be uppercase). | public AttVal | cloneAttributes(AttVal attrs) Clones an attribute value and add eventual asp or php node to node list. | public Node | cloneNode(Node node) Clones a node and add it to node list. | void | constrainVersion(int vers) Constraint the html version in the document to the given one. | public void | deferDup() Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated. | public boolean | endOfInput() | public short | findGivenVersion(Node doctype) Examine DOCTYPE to identify version. | public boolean | fixDocType(Node root) Fixup doctype if missing. | public void | fixHTMLNameSpace(Node root, String profile) Fix xhtml namespace. | public void | fixId(Node node) duplicate name attribute as an id and check if id and name match. | public boolean | fixXmlDecl(Node root) Ensure XML document starts with <?XML version="1.0"?> . | public Node | getCDATA(Node container) Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some
foo. | public Node | getToken(short mode) Gets a token. | public short | htmlVersion() Choose what version to use for new doctype. | public String | htmlVersionName() Choose what version to use for new doctype. | public Node | inferredTag(String name) Generates and inserts a new node. | public int | inlineDup(Node node) This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P,
TD, TH, DIV, PRE etc. | public Node | insertedToken() | public static boolean | isCSS1Selector(String buf) In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they
cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a
numeric code (see next item). | public boolean | isPushed(Node node) | public static boolean | isValidAttrName(String attr) Check if attr is a valid name. | public Node | newLineNode() Adds a new line node. | public Node | newNode() Creates a new node and add it to nodelist. | public Node | newNode(short type, byte[] textarray, int start, int end) Creates a new node and add it to nodelist.
Parameters: type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. | public Node | newNode(short type, byte[] textarray, int start, int end, String element) Creates a new node and add it to nodelist.
Parameters: type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. | Node | newXhtmlDocTypeNode(Node root) Put DOCTYPE declaration between the <:?xml version "1.0" ... | public Node | parseAsp() parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to
dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to
tailor the attribute value. | public String | parseAttribute(boolean[] isempty, Node[] asp, Node[] php) consumes the '>' terminating start tags. | public AttVal | parseAttrs(boolean[] isempty) Parse tag attributes. | public void | parseEntity(short mode) Parse an html entity. | public Node | parsePhp() PHP is like ASP but is based upon XML processing instructions, e.g. | public int | parseServerInstruction() Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this
routine recognizes ' and " quoted strings. | public char | parseTagName() Parses a tag name. | public String | parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim) Parse an attribute value. | public void | popInline(Node node) Pop a copy of an inline node from the stack. | protected boolean | preContent(Node node) | public void | pushInline(Node node) Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones
generated from the istack) One issue arises with pushing inlines when the tag is already pushed. | public boolean | setXHTMLDocType(Node root) Adds a new xhtml doctype to the document. | public void | ungetToken() | protected void | updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray) Update oldtextarray in the current nodes. |
IGNORE_MARKUP | final public static short IGNORE_MARKUP(Code) | | state: ignore markup.
|
IGNORE_WHITESPACE | final public static short IGNORE_WHITESPACE(Code) | | state: ignore whitespace.
|
MIXED_CONTENT | final public static short MIXED_CONTENT(Code) | | state: mixed content.
|
PREFORMATTED | final public static short PREFORMATTED(Code) | | state: preformatted.
|
badAccess | protected short badAccess(Code) | | for accessibility errors.
|
badChars | protected short badChars(Code) | | for bad char encodings.
|
badDoctype | protected boolean badDoctype(Code) | | set if html or PUBLIC is missing.
|
badForm | protected short badForm(Code) | | for mismatched/mispositioned form tags.
|
badLayout | protected short badLayout(Code) | | for bad style errors.
|
columns | protected int columns(Code) | | at start of current token.
|
doctype | protected int doctype(Code) | | version as given by doctype (if any).
|
errors | protected short errors(Code) | | count of errors.
|
excludeBlocks | protected boolean excludeBlocks(Code) | | Netscape compatibility.
|
exiled | protected boolean exiled(Code) | | true if moved out of table.
|
inode | protected Node inode(Code) | | Inline stack for compatibility with Mosaic. For deferring text node.
|
insert | protected int insert(Code) | | for inferring inline tags.
|
insertspace | protected boolean insertspace(Code) | | when space is moved after end tag.
|
istackbase | protected int istackbase(Code) | | start of frame.
|
isvoyager | protected boolean isvoyager(Code) | | true if xmlns attribute on html element.
|
lexbuf | protected byte[] lexbuf(Code) | | Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of
all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.
|
lexlength | protected int lexlength(Code) | | allocated.
|
lexsize | protected int lexsize(Code) | | used.
|
lines | protected int lines(Code) | | lines seen.
|
pushed | protected boolean pushed(Code) | | true after token has been pushed back.
|
root | protected Node root(Code) | | Root node is saved here.
|
seenEndBody | protected boolean seenEndBody(Code) | | already seen end body tag?
|
seenEndHtml | protected boolean seenEndHtml(Code) | | already seen end html tag?
|
state | protected short state(Code) | | state of lexer's finite state machine.
|
styles | protected Style styles(Code) | | used for cleaning up presentation markup.
|
txtend | protected int txtend(Code) | | end of current node.
|
txtstart | protected int txtstart(Code) | | start of current node.
|
versions | protected short versions(Code) | | bit vector of HTML versions.
|
warnings | protected short warnings(Code) | | count of warnings in this document.
|
waswhite | protected boolean waswhite(Code) | | used to collapse contiguous white space.
|
Lexer | public Lexer(StreamIn in, Configuration configuration, Report report)(Code) | | Instantiates a new Lexer.
Parameters: in - StreamIn Parameters: configuration - configuation instance Parameters: report - report instance, for reporting errors |
addByte | public void addByte(int c)(Code) | | Adds a byte to lexer buffer.
Parameters: c - byte to add |
addCharToLexer | public void addCharToLexer(int c)(Code) | | Store char c as UTF-8 encoded byte stream.
Parameters: c - char to store |
addGenerator | public boolean addGenerator(Node root)(Code) | | Add meta element for Tidy. If the meta tag is already present, update release date.
Parameters: root - root node true if the tag has been added |
addStringLiteral | public void addStringLiteral(String str)(Code) | | calls addCharToLexer for any char in the string.
Parameters: str - input String |
addStringLiteralLen | void addStringLiteralLen(String str, int len)(Code) | | calls addCharToLexer for any char in the string till len is reached.
Parameters: str - input String Parameters: len - length of the substring to be added |
addStringToLexer | public void addStringToLexer(String str)(Code) | | Adds a string to lexer buffer.
Parameters: str - String to add |
apparentVersion | public short apparentVersion()(Code) | | Return the html version used in document.
version code |
canPrune | public boolean canPrune(Node element)(Code) | | Can the given element be removed?
Parameters: element - node true if he element can be removed |
changeChar | public void changeChar(byte c)(Code) | | Substitute the last char in buffer.
Parameters: c - new char |
checkDocTypeKeyWords | public boolean checkDocTypeKeyWords(Node doctype)(Code) | | Check system keywords (keywords should be uppercase).
Parameters: doctype - doctype node true if doctype keywords are all uppercase |
cloneAttributes | public AttVal cloneAttributes(AttVal attrs)(Code) | | Clones an attribute value and add eventual asp or php node to node list.
Parameters: attrs - original AttVal cloned AttVal |
cloneNode | public Node cloneNode(Node node)(Code) | | Clones a node and add it to node list.
Parameters: node - Node cloned Node |
constrainVersion | void constrainVersion(int vers)(Code) | | Constraint the html version in the document to the given one. Everything is allowed in proprietary version of
HTML this is handled here rather than in the tag/attr dicts.
Parameters: vers - html version code |
deferDup | public void deferDup()(Code) | | Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
|
endOfInput | public boolean endOfInput()(Code) | | Has end of input stream been reached?
true if end of input stream been reached |
findGivenVersion | public short findGivenVersion(Node doctype)(Code) | | Examine DOCTYPE to identify version.
Parameters: doctype - doctype node version code |
fixDocType | public boolean fixDocType(Node root)(Code) | | Fixup doctype if missing.
Parameters: root - root node false if current version has not been identified |
fixHTMLNameSpace | public void fixHTMLNameSpace(Node root, String profile)(Code) | | Fix xhtml namespace.
Parameters: root - root Node Parameters: profile - current profile |
fixId | public void fixId(Node node)(Code) | | duplicate name attribute as an id and check if id and name match.
Parameters: node - Node to check for name/it attributes |
fixXmlDecl | public boolean fixXmlDecl(Node root)(Code) | | Ensure XML document starts with <?XML version="1.0"?> . Add encoding attribute if not using
ASCII or UTF-8 output.
Parameters: root - root node always true |
getCDATA | public Node getCDATA(Node container)(Code) | | Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some
foo.
Parameters: container - container node cdata node |
getToken | public Node getToken(short mode)(Code) | | Gets a token.
Parameters: mode - one of the following:MixedContent -- for elements which don't accept PCDATAPreformatted -- white spacepreserved as isIgnoreMarkup -- for CDATA elements such as script, style next Node |
htmlVersion | public short htmlVersion()(Code) | | Choose what version to use for new doctype.
html version constant |
htmlVersionName | public String htmlVersionName()(Code) | | Choose what version to use for new doctype.
html version name |
inferredTag | public Node inferredTag(String name)(Code) | | Generates and inserts a new node.
Parameters: name - tag name generated node |
inlineDup | public int inlineDup(Node node)(Code) | | This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P,
TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as
will be the case in: <i><h1>italic heading</h1></i> which is then treated as
equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.
Parameters: node - original node stack size |
isCSS1Selector | public static boolean isCSS1Selector(String buf)(Code) | | In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they
cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a
numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the
Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special
meaning, by putting a backslash in front.
Parameters: buf - css selector name true if the given string is a valid css1 selector name |
isPushed | public boolean isPushed(Node node)(Code) | | Is the node in the stack?
Parameters: node - Node true is the node is found in the stack |
isValidAttrName | public static boolean isValidAttrName(String attr)(Code) | | Check if attr is a valid name.
Parameters: attr - String to check, must be non-null true if attr is a valid name. |
newLineNode | public Node newLineNode()(Code) | | Adds a new line node. Used for creating preformatted text from Word2000.
new line node |
newNode | public Node newNode()(Code) | | Creates a new node and add it to nodelist.
Node |
newNode | public Node newNode(short type, byte[] textarray, int start, int end)(Code) | | Creates a new node and add it to nodelist.
Parameters: type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL Parameters: textarray - array of bytes contained in the Node Parameters: start - start position Parameters: end - end position Node |
newNode | public Node newNode(short type, byte[] textarray, int start, int end, String element)(Code) | | Creates a new node and add it to nodelist.
Parameters: type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECL Parameters: textarray - array of bytes contained in the Node Parameters: start - start position Parameters: end - end position Parameters: element - tag name Node |
newXhtmlDocTypeNode | Node newXhtmlDocTypeNode(Node root)(Code) | | Put DOCTYPE declaration between the <:?xml version "1.0" ... ?> declaration, if any, and the
html tag. Should also work for any comments, etc. that may precede the html tag.
Parameters: root - root node new doctype node |
parseAsp | public Node parseAsp()(Code) | | parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to
dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to
tailor the attribute value. Here is an example of a work around for using ASP in attribute values:
href='<%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.
parsed Node |
parseAttribute | public String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)(Code) | | consumes the '>' terminating start tags.
Parameters: isempty - flag is passed as array so it can be modified Parameters: asp - asp Node, passed as array so it can be modified Parameters: php - php Node, passed as array so it can be modified parsed attribute |
parseAttrs | public AttVal parseAttrs(boolean[] isempty)(Code) | | Parse tag attributes.
Parameters: isempty - is tag empty? parsed attribute/value list |
parseEntity | public void parseEntity(short mode)(Code) | | Parse an html entity.
Parameters: mode - mode |
parsePhp | public Node parsePhp()(Code) | | PHP is like ASP but is based upon XML processing instructions, e.g. <?php ... ?> .
parsed Node |
parseServerInstruction | public int parseServerInstruction()(Code) | | Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this
routine recognizes ' and " quoted strings.
delimiter |
parseTagName | public char parseTagName()(Code) | | Parses a tag name.
first char after the tag name |
parseValue | public String parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim)(Code) | | Parse an attribute value.
Parameters: name - attribute name Parameters: foldCase - fold case? Parameters: isempty - is attribute empty? Passed as an array reference to allow modification Parameters: pdelim - delimiter, passed as an array reference to allow modification parsed value |
popInline | public void popInline(Node node)(Code) | | Pop a copy of an inline node from the stack.
Parameters: node - Node to be popped |
preContent | protected boolean preContent(Node node)(Code) | | Is content acceptable for pre elements?
Parameters: node - content true if node is acceptable in pre elements |
pushInline | public void pushInline(Node node)(Code) | | Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones
generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance:
<p><em> text <p><em> more text Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>
Parameters: node - Node to be pushed |
setXHTMLDocType | public boolean setXHTMLDocType(Node root)(Code) | | Adds a new xhtml doctype to the document.
Parameters: root - root node true if a doctype has been added |
ungetToken | public void ungetToken()(Code) | | |
updateNodeTextArrays | protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)(Code) | | Update oldtextarray in the current nodes.
Parameters: oldtextarray - previous text array Parameters: newtextarray - new text array |
|
|