| java.lang.Object sunlabs.brazil.util.LexML
All known Subclasses: sunlabs.brazil.util.LexHTML,
LexML | public class LexML (Code) | | This class breaks angle-bracket-separated markup languages like SGML, XML,
and HTML into tokens. It understands three types of tokens:
- tags
- Formally known as "entities", tags are delimited by "<" and
">". The first word in the tag is the tag name and the
rest of the tag consists of the attributes, a set of
"name=value" or "name" data. Spaces in tags are not significant
except for quoted values in the attributes.
- string
- Plain strings that are not in angle-brackets. Spaces are
significant and preserved.
- comments
- Delimited by "<!--" and "-->". All text between the
delimiters is part of the comment. However, by convention,
some comments actually contain data and so the methods that
extract the fields from tags can be used to attempt to extract
the fields from comments, too. Spaces are significant and
preserved in a comment, unless the comment is treated as a
tag, in which the tag rules apply.
This class is intended to parse markup languages, not to validate them.
"Malformed" data is interpreted as graciously as possible, in order to
extract as much information as possible. For instance: spaces are
allowed between the "<" and the tag name, values in tags do not need
to be quoted, and unbalanced quotes are accepted.
One type of "malformed" data specifically not handled is a quoted
">" character occurring within the body of a tag. Even if it is
quoted, a ">" in the attributes of a tag will be interpreted as the
end of the tag. For example, the single tag <img src='foo.jpg'
alt='xyz > abc'> will be erroneously broken by
this parser into two tokens:
- the tag
<img src='foo.jpg' alt='xyz >
- the string "abc'>" (and possibly whatever text follows after).
Unfortunately, this type of "malformed" data is known to occur regularly.
This class also may not properly parse all well-formed XML tags, such
as tags with extended paired delimiters <& and
&> , <? and ?> , or
<![CDATA[ and ]]> .
Additionally, XML tags that have embedded comments containing the
">" character will not be parsed correctly (for example:
<!DOCTYPE foo SYSTEM -- a > b -- foo.dtd> ),
since the ">" in the comment will be interpreted as
the end of declaration tag, for the same reason mentioned
above.
author: Colin Stevens (colin.stevens@sun.com) version: 1.6, 01/01/16 |
Constructor Summary | |
public | LexML(String str) Create a new ML parser, which can be used to iterate over the
tokens in the given string. |
Method Summary | |
public String | getArgs() Gets the name/value pairs in the body of the current tag as a
string. | public StringMap | getAttributes() Gets the name/value pairs in the body of the current tag as a
table.
Any quote marks in the body, either single or double quotes, are
left on the values, so that the values can be easily re-emitted
and still form a valid body.
For names that have no associated value in the tag, the value is
stored as the empty string "". | public String | getBody() Gets the string making up the current token, not including the angle
brackets or comment delimiters, if appropriate. | public String | getTag() Gets the tag name at the beginning of the current tag. | public String | getToken() Gets the string making up the whole current token, including the
brackets or comment delimiters, if appropriate. | public int | getType() Gets the type of the current token. | public boolean | nextToken() Advances to the next token. | public void | replace(String str) Changes the string that this LexML is parsing.
Example use: the caller decided to parse part of the body,
and now wants this LexML to pick up and parse the rest of it.
Parameters: str - The string that this LexML should now parse. | public String | rest() Gets the rest of the string that has not yet been parsed. |
COMMENT | final public static int COMMENT(Code) | | The value returned by getType for comment tokens
|
STRING | final public static int STRING(Code) | | The value returned by getType for string tokens
|
TAG | final public static int TAG(Code) | | The value returned by getType for tag tokens
|
tokenStart | int tokenStart(Code) | | |
LexML | public LexML(String str)(Code) | | Create a new ML parser, which can be used to iterate over the
tokens in the given string.
Parameters: str - The ML to parse. |
getArgs | public String getArgs()(Code) | | Gets the name/value pairs in the body of the current tag as a
string.
The name/value pairs, or null ifthe current token was a string. |
getAttributes | public StringMap getAttributes()(Code) | | Gets the name/value pairs in the body of the current tag as a
table.
Any quote marks in the body, either single or double quotes, are
left on the values, so that the values can be easily re-emitted
and still form a valid body.
For names that have no associated value in the tag, the value is
stored as the empty string "". Therefore, the two tags
<table border> and
<table border=""> cannot be distinguished
based on the result of calling getAttributes .
The table of name/value pairs, or null ifthe current token was a string. |
getBody | public String getBody()(Code) | | Gets the string making up the current token, not including the angle
brackets or comment delimiters, if appropriate.
The body of the token. |
getTag | public String getTag()(Code) | | Gets the tag name at the beginning of the current tag. In other
words, the tag name for <table border=3> is
"table". Any surrounding space characters are removed, but the
case of the tag is preserved.
For comments, the "tag" is the first word in the comment. This can
be used to help parse comments that are structured similar to regular
tags, such as server-side include comments like
<!--#include virtual="file.inc"> . The tag in
this case would be "!--#include".
The tag name, or null if the current tokenwas a string. |
getToken | public String getToken()(Code) | | Gets the string making up the whole current token, including the
brackets or comment delimiters, if appropriate.
The current token. |
nextToken | public boolean nextToken()(Code) | | Advances to the next token. The user can then call the other methods
in this class to get information about the new current token.
true if a token was found, false if there were no more tokens left. |
replace | public void replace(String str)(Code) | | Changes the string that this LexML is parsing.
Example use: the caller decided to parse part of the body,
and now wants this LexML to pick up and parse the rest of it.
Parameters: str - The string that this LexML should now parse. Whateverstring this LexML was parsing is forgotten, and it nowstarts parsing at the beginning of the new string. See Also: LexML.rest |
rest | public String rest()(Code) | | Gets the rest of the string that has not yet been parsed.
Example use: to help the parser in circumstances such as the HTML
"<script>" tag where the script body doesn't the obey the rules
because it might contain lone "<" or ">" characters, which this
parser would interpret as the start or end of funny-looking tags.
The unparsed remainder of the string. See Also: LexML.replace |
|
|