Java Doc for Lexer.java in » HTML-Parser » JTidy » org » w3c » tidy » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » HTML Parser » JTidy » org.w3c.tidy

Source Cross Reference Class Diagram Java Document (Java Doc)

java.lang .Object

org.w3c.tidy .Lexer

Lexer
public class Lexer (Code)
	Lexer for html parser. Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections author: Dave Raggett dsr@w3.org author: Andy Quick ac.quick@sympatico.ca (translation to Java) author: Fabrizio Giustina version: $Revision: 1.93 $ ($Author: fgiust $)

Field Summary
final public static short	IGNORE_MARKUP state: ignore markup.
final public static short	IGNORE_WHITESPACE state: ignore whitespace.
final public static short	MIXED_CONTENT state: mixed content.
final public static short	PREFORMATTED state: preformatted.
protected short	badAccess for accessibility errors.
protected short	badChars for bad char encodings.
protected boolean	badDoctype set if html or PUBLIC is missing.
protected short	badForm for mismatched/mispositioned form tags.
protected short	badLayout for bad style errors.
protected int	columns at start of current token.
protected Configuration	configuration configuration.
protected int	doctype version as given by doctype (if any).
protected short	errors count of errors.
protected PrintWriter	errout error output stream.
protected boolean	excludeBlocks Netscape compatibility.
protected boolean	exiled true if moved out of table.
protected StreamIn	in file stream.
protected Node	inode Inline stack for compatibility with Mosaic.
protected int	insert for inferring inline tags.
protected boolean	insertspace when space is moved after end tag.
protected Stack	istack stack.
protected int	istackbase start of frame.
protected boolean	isvoyager true if xmlns attribute on html element.
protected byte[]	lexbuf Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements.
protected int	lexlength allocated.
protected int	lexsize used.
protected int	lines lines seen.
protected boolean	pushed true after token has been pushed back.
protected Report	report report.
protected Node	root Root node is saved here.
protected boolean	seenEndBody
protected boolean	seenEndHtml
protected short	state state of lexer's finite state machine.
protected Style	styles used for cleaning up presentation markup.
protected Node	token current node.
protected int	txtend end of current node.
protected int	txtstart start of current node.
protected short	versions bit vector of HTML versions.
protected short	warnings count of warnings in this document.
protected boolean	waswhite used to collapse contiguous white space.

Constructor Summary
public	Lexer(StreamIn in, Configuration configuration, Report report) Instantiates a new Lexer.

Method Summary
public void	addByte(int c) Adds a byte to lexer buffer.
public void	addCharToLexer(int c) Store char c as UTF-8 encoded byte stream.
public boolean	addGenerator(Node root) Add meta element for Tidy.
public void	addStringLiteral(String str) calls addCharToLexer for any char in the string.
void	addStringLiteralLen(String str, int len) calls addCharToLexer for any char in the string till len is reached.
public void	addStringToLexer(String str) Adds a string to lexer buffer.
public short	apparentVersion() Return the html version used in document.
public boolean	canPrune(Node element)
public void	changeChar(byte c) Substitute the last char in buffer.
public boolean	checkDocTypeKeyWords(Node doctype) Check system keywords (keywords should be uppercase).
public AttVal	cloneAttributes(AttVal attrs) Clones an attribute value and add eventual asp or php node to node list.
public Node	cloneNode(Node node) Clones a node and add it to node list.
void	constrainVersion(int vers) Constraint the html version in the document to the given one.
public void	deferDup() Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
public boolean	endOfInput()
public short	findGivenVersion(Node doctype) Examine DOCTYPE to identify version.
public boolean	fixDocType(Node root) Fixup doctype if missing.
public void	fixHTMLNameSpace(Node root, String profile) Fix xhtml namespace.
public void	fixId(Node node) duplicate name attribute as an id and check if id and name match.
public boolean	fixXmlDecl(Node root) Ensure XML document starts with `<?XML version="1.0"?>`.
public Node	getCDATA(Node container) Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo.
public Node	getToken(short mode) Gets a token.
public short	htmlVersion() Choose what version to use for new doctype.
public String	htmlVersionName() Choose what version to use for new doctype.
public Node	inferredTag(String name) Generates and inserts a new node.
public int	inlineDup(Node node) This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc.
public Node	insertedToken()
public static boolean	isCSS1Selector(String buf) In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item).
public boolean	isPushed(Node node)
public static boolean	isValidAttrName(String attr) Check if attr is a valid name.
public Node	newLineNode() Adds a new line node.
public Node	newNode() Creates a new node and add it to nodelist.
public Node	newNode(short type, byte[] textarray, int start, int end) Creates a new node and add it to nodelist. Parameters: type - node type: Node.ROOT_NODE \| Node.DOCTYPE_TAG \| Node.COMMENT_TAG \| Node.PROC_INS_TAG \| Node.TEXT_NODE \|Node.START_TAG \| Node.END_TAG \| Node.START_END_TAG \| Node.CDATA_TAG \| Node.SECTION_TAG \| Node.
public Node	newNode(short type, byte[] textarray, int start, int end, String element) Creates a new node and add it to nodelist. Parameters: type - node type: Node.ROOT_NODE \| Node.DOCTYPE_TAG \| Node.COMMENT_TAG \| Node.PROC_INS_TAG \| Node.TEXT_NODE \|Node.START_TAG \| Node.END_TAG \| Node.START_END_TAG \| Node.CDATA_TAG \| Node.SECTION_TAG \| Node.
Node	newXhtmlDocTypeNode(Node root) Put DOCTYPE declaration between the <:?xml version "1.0" ...
public Node	parseAsp() parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value.
public String	parseAttribute(boolean[] isempty, Node[] asp, Node[] php) consumes the '>' terminating start tags.
public AttVal	parseAttrs(boolean[] isempty) Parse tag attributes.
public void	parseEntity(short mode) Parse an html entity.
public Node	parsePhp() PHP is like ASP but is based upon XML processing instructions, e.g.
public int	parseServerInstruction() Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings.
public char	parseTagName() Parses a tag name.
public String	parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim) Parse an attribute value.
public void	popInline(Node node) Pop a copy of an inline node from the stack.
protected boolean	preContent(Node node)
public void	pushInline(Node node) Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
public boolean	setXHTMLDocType(Node root) Adds a new xhtml doctype to the document.
public void	ungetToken()
protected void	updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray) Update `oldtextarray` in the current nodes.

Field Detail

IGNORE_MARKUP
final public static short IGNORE_MARKUP(Code)
	state: ignore markup.

IGNORE_WHITESPACE
final public static short IGNORE_WHITESPACE(Code)
	state: ignore whitespace.

MIXED_CONTENT
final public static short MIXED_CONTENT(Code)
	state: mixed content.

PREFORMATTED
final public static short PREFORMATTED(Code)
	state: preformatted.

badAccess
protected short badAccess(Code)
	for accessibility errors.

badChars
protected short badChars(Code)
	for bad char encodings.

badDoctype
protected boolean badDoctype(Code)
	set if html or PUBLIC is missing.

badForm
protected short badForm(Code)
	for mismatched/mispositioned form tags.

badLayout
protected short badLayout(Code)
	for bad style errors.

columns
protected int columns(Code)
	at start of current token.

configuration
protected Configuration configuration(Code)
	configuration.

doctype
protected int doctype(Code)
	version as given by doctype (if any).

errors
protected short errors(Code)
	count of errors.

errout
protected PrintWriter errout(Code)
	error output stream.

excludeBlocks
protected boolean excludeBlocks(Code)
	Netscape compatibility.

exiled
protected boolean exiled(Code)
	true if moved out of table.

in
protected StreamIn in(Code)
	file stream.

inode
protected Node inode(Code)
	Inline stack for compatibility with Mosaic. For deferring text node.

insert
protected int insert(Code)
	for inferring inline tags.

insertspace
protected boolean insertspace(Code)
	when space is moved after end tag.

istack
protected Stack istack(Code)
	stack.

istackbase
protected int istackbase(Code)
	start of frame.

isvoyager
protected boolean isvoyager(Code)
	true if xmlns attribute on html element.

lexbuf
protected byte[] lexbuf(Code)
	Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. Lexsize must be reset for each file. Byte buffer of UTF-8 chars.

lexlength
protected int lexlength(Code)
	allocated.

lexsize
protected int lexsize(Code)
	used.

lines
protected int lines(Code)
	lines seen.

pushed
protected boolean pushed(Code)
	true after token has been pushed back.

report
protected Report report(Code)
	report.

root
protected Node root(Code)
	Root node is saved here.

seenEndBody
protected boolean seenEndBody(Code)
	already seen end body tag?

seenEndHtml
protected boolean seenEndHtml(Code)
	already seen end html tag?

state
protected short state(Code)
	state of lexer's finite state machine.

styles
protected Style styles(Code)
	used for cleaning up presentation markup.

token
protected Node token(Code)
	current node.

txtend
protected int txtend(Code)
	end of current node.

txtstart
protected int txtstart(Code)
	start of current node.

versions
protected short versions(Code)
	bit vector of HTML versions.

warnings
protected short warnings(Code)
	count of warnings in this document.

waswhite
protected boolean waswhite(Code)
	used to collapse contiguous white space.

Constructor Detail

Lexer
public Lexer(StreamIn in, Configuration configuration, Report report)(Code)
	Instantiates a new Lexer. Parameters: in - StreamIn Parameters: configuration - configuation instance Parameters: report - report instance, for reporting errors

Method Detail

addByte
public void addByte(int c)(Code)
	Adds a byte to lexer buffer. Parameters: c - byte to add

addCharToLexer
public void addCharToLexer(int c)(Code)
	Store char c as UTF-8 encoded byte stream. Parameters: c - char to store

addGenerator
public boolean addGenerator(Node root)(Code)
	Add meta element for Tidy. If the meta tag is already present, update release date. Parameters: root - root node `true` if the tag has been added

addStringLiteral
public void addStringLiteral(String str)(Code)
	calls addCharToLexer for any char in the string. Parameters: str - input String

addStringLiteralLen
void addStringLiteralLen(String str, int len)(Code)
	calls addCharToLexer for any char in the string till len is reached. Parameters: str - input String Parameters: len - length of the substring to be added

addStringToLexer
public void addStringToLexer(String str)(Code)
	Adds a string to lexer buffer. Parameters: str - String to add

apparentVersion
public short apparentVersion()(Code)
	Return the html version used in document. version code

canPrune
public boolean canPrune(Node element)(Code)
	Can the given element be removed? Parameters: element - node `true` if he element can be removed

changeChar
public void changeChar(byte c)(Code)
	Substitute the last char in buffer. Parameters: c - new char

checkDocTypeKeyWords
public boolean checkDocTypeKeyWords(Node doctype)(Code)
	Check system keywords (keywords should be uppercase). Parameters: doctype - doctype node true if doctype keywords are all uppercase

cloneAttributes
public AttVal cloneAttributes(AttVal attrs)(Code)
	Clones an attribute value and add eventual asp or php node to node list. Parameters: attrs - original AttVal cloned AttVal

cloneNode
public Node cloneNode(Node node)(Code)
	Clones a node and add it to node list. Parameters: node - Node cloned Node

constrainVersion
void constrainVersion(int vers)(Code)
	Constraint the html version in the document to the given one. Everything is allowed in proprietary version of HTML this is handled here rather than in the tag/attr dicts. Parameters: vers - html version code

deferDup
public void deferDup()(Code)
	Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.

endOfInput
public boolean endOfInput()(Code)
	Has end of input stream been reached? `true` if end of input stream been reached

findGivenVersion
public short findGivenVersion(Node doctype)(Code)
	Examine DOCTYPE to identify version. Parameters: doctype - doctype node version code

fixDocType
public boolean fixDocType(Node root)(Code)
	Fixup doctype if missing. Parameters: root - root node `false` if current version has not been identified

fixHTMLNameSpace
public void fixHTMLNameSpace(Node root, String profile)(Code)
	Fix xhtml namespace. Parameters: root - root Node Parameters: profile - current profile

fixId
public void fixId(Node node)(Code)
	duplicate name attribute as an id and check if id and name match. Parameters: node - Node to check for name/it attributes

fixXmlDecl
public boolean fixXmlDecl(Node root)(Code)
	Ensure XML document starts with `<?XML version="1.0"?>`. Add encoding attribute if not using ASCII or UTF-8 output. Parameters: root - root node always true

getCDATA
public Node getCDATA(Node container)(Code)
	Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo. Parameters: container - container node cdata node

getToken
public Node getToken(short mode)(Code)
	Gets a token. Parameters: mode - one of the following: `MixedContent`-- for elements which don't accept PCDATA `Preformatted`-- white spacepreserved as is `IgnoreMarkup`-- for CDATA elements such as script, style next Node

htmlVersion
public short htmlVersion()(Code)
	Choose what version to use for new doctype. html version constant

htmlVersionName
public String htmlVersionName()(Code)
	Choose what version to use for new doctype. html version name

inferredTag
public Node inferredTag(String name)(Code)
	Generates and inserts a new node. Parameters: name - tag name generated node

inlineDup
public int inlineDup(Node node)(Code)
	This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. This procedure is called at the start of ParseBlock. When the inline stack is not empty, as will be the case in: `<i><h1>italic heading</h1></i>` which is then treated as equivalent to `<h1><i>italic heading</i></h1>` This is implemented by setting the lexer into a mode where it gets tokens from the inline stack rather than from the input stream. Parameters: node - original node stack size

insertedToken
public Node insertedToken()(Code)

isCSS1Selector
public static boolean isCSS1Selector(String buf)(Code)
	In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). The backslash followed by at most four hexadecimal digits (0..9A..F) stands for the Unicode character with that number. Any character except a hexadecimal digit can be escaped to remove its special meaning, by putting a backslash in front. Parameters: buf - css selector name `true` if the given string is a valid css1 selector name

isPushed
public boolean isPushed(Node node)(Code)
	Is the node in the stack? Parameters: node - Node `true` is the node is found in the stack

isValidAttrName
public static boolean isValidAttrName(String attr)(Code)
	Check if attr is a valid name. Parameters: attr - String to check, must be non-null `true` if attr is a valid name.

newLineNode
public Node newLineNode()(Code)
	Adds a new line node. Used for creating preformatted text from Word2000. new line node

newNode
public Node newNode()(Code)
	Creates a new node and add it to nodelist. Node

newNode
public Node newNode(short type, byte[] textarray, int start, int end)(Code)
	Creates a new node and add it to nodelist. Parameters: type - node type: Node.ROOT_NODE \| Node.DOCTYPE_TAG \| Node.COMMENT_TAG \| Node.PROC_INS_TAG \| Node.TEXT_NODE \|Node.START_TAG \| Node.END_TAG \| Node.START_END_TAG \| Node.CDATA_TAG \| Node.SECTION_TAG \| Node. ASP_TAG \|Node.JSTE_TAG \| Node.PHP_TAG \| Node.XML_DECL Parameters: textarray - array of bytes contained in the Node Parameters: start - start position Parameters: end - end position Node

newNode
public Node newNode(short type, byte[] textarray, int start, int end, String element)(Code)
	Creates a new node and add it to nodelist. Parameters: type - node type: Node.ROOT_NODE \| Node.DOCTYPE_TAG \| Node.COMMENT_TAG \| Node.PROC_INS_TAG \| Node.TEXT_NODE \|Node.START_TAG \| Node.END_TAG \| Node.START_END_TAG \| Node.CDATA_TAG \| Node.SECTION_TAG \| Node. ASP_TAG \|Node.JSTE_TAG \| Node.PHP_TAG \| Node.XML_DECL Parameters: textarray - array of bytes contained in the Node Parameters: start - start position Parameters: end - end position Parameters: element - tag name Node

newXhtmlDocTypeNode
Node newXhtmlDocTypeNode(Node root)(Code)
	Put DOCTYPE declaration between the <:?xml version "1.0" ... ?> declaration, if any, and the `html` tag. Should also work for any comments, etc. that may precede the `html` tag. Parameters: root - root node new doctype node

parseAsp
public Node parseAsp()(Code)
	parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. Here is an example of a work around for using ASP in attribute values: `href='<%=rsSchool.Fields("ID").Value%>'` where the ASP that generates the attribute value is masked from Tidy by the quotemarks. parsed Node

parseAttribute
public String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)(Code)
	consumes the '>' terminating start tags. Parameters: isempty - flag is passed as array so it can be modified Parameters: asp - asp Node, passed as array so it can be modified Parameters: php - php Node, passed as array so it can be modified parsed attribute

parseAttrs
public AttVal parseAttrs(boolean[] isempty)(Code)
	Parse tag attributes. Parameters: isempty - is tag empty? parsed attribute/value list

parseEntity
public void parseEntity(short mode)(Code)
	Parse an html entity. Parameters: mode - mode

parsePhp
public Node parsePhp()(Code)
	PHP is like ASP but is based upon XML processing instructions, e.g. `<?php ... ?>`. parsed Node

parseServerInstruction
public int parseServerInstruction()(Code)
	Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings. delimiter

parseTagName
public char parseTagName()(Code)
	Parses a tag name. first char after the tag name

parseValue
public String parseValue(String name, boolean foldCase, boolean[] isempty, int[] pdelim)(Code)
	Parse an attribute value. Parameters: name - attribute name Parameters: foldCase - fold case? Parameters: isempty - is attribute empty? Passed as an array reference to allow modification Parameters: pdelim - delimiter, passed as an array reference to allow modification parsed value

popInline
public void popInline(Node node)(Code)
	Pop a copy of an inline node from the stack. Parameters: node - Node to be popped

preContent
protected boolean preContent(Node node)(Code)
	Is content acceptable for pre elements? Parameters: node - content `true` if node is acceptable in pre elements

pushInline
public void pushInline(Node node)(Code)
	Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. For instance: `<p><em> text <p><em> more text` Shouldn't be mapped to `<p><em> text </em></p><p><em><em> more text </em></em>` Parameters: node - Node to be pushed

setXHTMLDocType
public boolean setXHTMLDocType(Node root)(Code)
	Adds a new xhtml doctype to the document. Parameters: root - root node `true` if a doctype has been added

ungetToken
public void ungetToken()(Code)

updateNodeTextArrays
protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)(Code)
	Update `oldtextarray` in the current nodes. Parameters: oldtextarray - previous text array Parameters: newtextarray - new text array

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.