| de.susebox.jtopas.Tokenizer
All known Subclasses: de.susebox.jtopas.StandardTokenizer, de.susebox.jtopas.AbstractTokenizer,
Tokenizer | public interface Tokenizer (Code) | |
The interface Tokenizer contains setup methods, parse operations
and other getter and setter methods for a tokenizer. A tokenizer splits a
stream of input data into various units like whitespaces, comments, keywords
etc. These units are the tokens that are reflected in the
Token class
of the de.susebox.jtopas package.
A Tokenizer is configured using a
TokenizerProperties
object that contains declarations for whitespaces, separators, comments,
keywords, special sequences and patterns. It is designed to enable a common
approach for parsing texts like program code, annotated documents like HTML
and so on.
To detect links in an HTML document, a tokenizer would be invoked like that
(see
StandardTokenizerProperties and
StandardTokenizer for the
classes mentioned here):
Vector links = new Vector();
FileReader reader = new FileReader("index.html");
TokenizerProperties props = new StandardTokenizerProperties();
Tokenizer tokenizer = new StandardTokenizer();
Token token;
props.setParseFlags(Tokenizer.F_NO_CASE);
props.setSeparators("=");
props.addString("\"", "\"", "\\");
props.addBlockComment(">", "<");
props.addKeyword("HREF");
tokenizer.setTokenizerProperties(props);
tokenizer.setSource(new ReaderSource(reader));
try {
while (tokenizer.hasMoreToken()) {
token = tokenizer.nextToken();
if (token.getType() == Token.KEYWORD) {
tokenizer.nextToken(); // should be the '=' character
links.addElement(tokenizer.next());
}
}
} finally {
tokenizer.close();
reader.close();
}
This somewhat rough way to find links should work fine on syntactically
correct HTML code. It finds common links as well as mail, ftp links etc. Note
the block comment. It starts with the ">" character, that is the closing
character for HTML tags and ends with the "<" being the starting character
of HTML tags. The effect is that all the real text is treated as a comment.
To extract the contents of a HTML file, one would write:
StringBuffer contents = new StringBuffer(4096);
FileReader reader = new FileReader("index.html");
TokenizerProperties props = new StandardTokenizerProperties();
Tokenizer tokenizer = new StandardTokenizer();
Token token;
props.setParseFlags(Tokenizer.F_NO_CASE);
props.addBlockComment(">", "<");
props.addBlockComment(">HEAD<", ">/HEAD<");
props.addBlockComment(">!--;", "--<");
tokenizer.setTokenizerProperties(props);
tokenizer.setSource(new ReaderSource(reader));
try {
while (tokenizer.hasMoreToken()) {
token = tokenizer.nextToken();
if (token.getType() != Token.BLOCK_COMMENT) {
contents.append(token.getToken());
}
}
} finally {
tokenizer.close();
reader.close();
}
Here the block comment is the exact opposite of the first example. Now all the
HTML tags are skipped. Moreover, we declared the HTML-Header as a block
comment as well - the informations from the header are thus skipped alltogether.
Parsing (tokenizing) is done on a well defined priority scheme. See
Tokenizer.nextToken for details.
NOTE: if a character sequence is registered for two categories of tokenizer
properties (e.g. as a line comments starting sequence as well as a special
sequence), the category with the highest priority wins (e.g. if the metioned
sequence is found, it is interpreted as a line comment).
The tokenizer interface is clearly designed for "readable" data, say ASCII-
or UNICODE data. Parsing binary data has other characteristics that do not
necessarily fit in a scheme of comments, keywords, strings, identifiers and
operators.
Note that the interface has no methods that handle stream data sources. This
is left to the implementations that may have quite different data sources, e. g.
java.io.InputStreamReader , database queries, string arrays etc. The
interface
TokenizerSource serves as an abstraction of such widely
varying data sources.
The Tokenizer interface partly replaces the older
de.susebox.java.util.Tokenizer interface which is deprecated.
See Also: Token See Also: TokenizerProperties author: Heiko Blau |
changeParseFlags | public void changeParseFlags(int flags, int mask) throws TokenizerException(Code) | | Setting the control flags of the TokenizerProperties . Use a
combination of the F_... flags declared in
TokenizerProperties for the parameter. The mask parameter contains a bit mask of
the F_... flags to change.
The parse flags for a tokenizer can be set through the associated
TokenizerProperties instance. These global settings take effect in all
Tokenizer instances that use the same TokenizerProperties
object. Flags related to the parsing process can also be set separately
for each tokenizer during runtime. These are the dynamic flags:
Other flags can also be set for each tokenizer separately, but should be set
before the tokenizing starts to make sense.
The other flags should only be used on the TokenizerProperties
instance or on single
TokenizerProperty objects and influence all
Tokenizer instances sharing the same TokenizerProperties
object. For instance, using the flag
TokenizerProperties.F_NO_CASE
is an invalid operation on a Tokenizer . It affects the interpretation
of keywords and sequences by the associated TokenizerProperties
instance and, moreover, possibly the storage of these properties.
This method throws a
TokenizerException if a flag is passed that cannot
be handled by the Tokenizer object itself.
This method takes precedence over the
TokenizerProperties.setParseFlags method of the associated TokenizerProperties object. Even if
the global settings of one of the dynamic flags (see above) change after a
call to this method, the flags set separately for this tokenizer, stay
active.
Parameters: flags - the parser control flags Parameters: mask - the mask for the flags to set or unset throws: TokenizerException - if one or more of the flags given cannot be honored See Also: Tokenizer.getParseFlags |
close | public void close()(Code) | | This method is nessecary to release memory and remove object references if
a Tokenizer instances are frequently created for small tasks.
Generally, the method shouldn't throw any exceptions. It is also ok to call
it more than once.
It is an error, to call any other method of the implementing class after
close has been called.
|
currentlyAvailable | public int currentlyAvailable()(Code) | | Retrieving the number of the currently available characters. This includes
both characters already parsed by the Tokenizer and characters
still to be analyzed.
number of currently available characters |
getChar | public char getChar(int pos) throws IndexOutOfBoundsException(Code) | | Get a single character from the current text range.
Parameters: pos - position of the required character the character at the specified position throws: IndexOutOfBoundsException - if the parameter pos is not in the available text range (text window) |
getLineNumber | public int getLineNumber()(Code) | | If the flag
TokenizerProperties.F_COUNT_LINES is set, this method
will return the line number starting with 0 in the input stream. The
implementation of the Tokenizer interface can decide which
end-of-line sequences should be recognized. The most flexible approach is
to process the following end-of-line sequences:
-
Carriage Return (ASCII 13, '\r'). This EOL is used on Apple Macintosh
-
Linefeed (ASCII 10, '\n'). This is the UNIX EOL character.
-
Carriage Return + Linefeed ("\r\n"). This is used on MS Windows systems.
Another legitime and in many cases satisfying way is to use the system
property "line.separator".
Displaying information about lines usually means adding 1 to the zero-based
line number.
the current line number starting with 0 or -1 if no line numbers are supplied (TokenizerProperties.F_COUNT_LINES is not set). See Also: Tokenizer.getColumnNumber |
getParseFlags | public int getParseFlags()(Code) | | Retrieving the parser control flags. A bitmask containing the F_...
constants is returned. This method returns both the flags that are set
separately for this Tokenizer and the flags set for the
associated
TokenizerProperties object.
the current parser control flags See Also: Tokenizer.changeParseFlags |
getRangeStart | public int getRangeStart()(Code) | | This method returns the absolute offset in characters to the start of the
parsed stream. Together with
Tokenizer.currentlyAvailable it describes the
currently available text "window".
The position returned by this method and also by
Tokenizer.getReadPosition are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.
the absolute offset of the current text window in characters from the start of the data source of the Tokenizer |
getReadPosition | public int getReadPosition()(Code) | | Getting the current read offset. This is the absolute position where the
next call to nextToken or next will start. It is
therefore not the same as the position returned by
Token.getStartPosition of the current token (
Tokenizer.currentToken ).
It is the starting position of the token returned by the next call to
Tokenizer.nextToken , if that token is no whitespace or if whitespaces are
returned (
TokenizerProperties.F_RETURN_WHITESPACES ).
The position returned by this method and also by
Tokenizer.getRangeStart are absolute rather than relative in a text buffer to give the tokenizer
the full control of how and when to refill its text buffer.
the absolute offset in characters from the start of the data source of the Tokenizer where reading will be continued |
getText | public String getText(int start, int length) throws IndexOutOfBoundsException(Code) | | Retrieve text from the currently available range. The start and length
parameters must be inside
Tokenizer.getRangeStart and
Tokenizer.getRangeStart +
Tokenizer.currentlyAvailable .
Example:
int startPos = tokenizer.getReadPosition();
String source;
while (tokenizer.hasMoreToken()) {
Token token = tokenizer.nextToken();
switch (token.getType()) {
case Token.LINE_COMMENT:
case Token.BLOCK_COMMENT:
source = tokenizer.getText(startPos, token.getStartPos() - startPos);
startPos = token.getStartPos();
}
}
Parameters: start - position where the text begins Parameters: length - length of the text the text beginning at the given position ith the given length throws: IndexOutOfBoundsException - if the starting position or the length is out of the current text window |
hasMoreToken | public boolean hasMoreToken()(Code) | | Check if there are more tokens available. This method will return
true until and enf-of-file condition is encountered during a
call to
Tokenizer.nextToken or
Tokenizer.nextImage .
That means, that the EOF is returned one time, afterwards hasMoreToken
will return false . Furthermore, that implies, that the method
will return true at least once, even if the input data stream
is empty.
The method can be conveniently used in a while loop.
true if a call to Tokenizer.nextToken or Tokenizer.nextImagewill succed, false otherwise |
nextToken | public Token nextToken() throws TokenizerException(Code) | | Retrieving the next
Token . The method works in this order:
-
Check for an end-of-file condition. If there is such a condition then
return it.
-
Try to collect a sequence of whitespaces. If such a sequence can be found
return if the flag
F_RETURN_WHITESPACES is set, or skip these
whitespaces.
-
Check the next characters against all known pattern. A pattern is usually
a regular expression that is used by
java.util.regex.Pattern . But
implementations of
de.susebox.jtopas.spi.PatternHandler may use
other pattern syntaxes. Note that pattern are not recognized within
"normal" text (see below for a more precise description).
-
Check the next characters against all known line and block comments. If
a line or block comment starting sequence matches, return if the flag
F_RETURN_WHITESPACES is set, or skip the comment.
If comments are returned they include their starting and ending sequences
(newline in case of a line comment).
-
Check the next characters against all known string starting sequences. If
a string begin could be identified return the string until and including
the closing sequence.
-
Check the next characters against all known special sequences. Especially,
find the longest possible match. If a special sequence could be identified
then return it.
-
Check for ordinary separators. If one could be found return it.
-
Check the next characters against all known keywords. If a keyword could
be identified then return it.
-
Return the text portion until the next whitespace, comment, special
sequence or separator. Note that pattern are not recognized within "normal"
text. A pattern match has therefore always a whitespace, comment, special
sequence, separator or another pattern match in front of it or starts at
position 0 of the data.
The method will return the EOF token as long as
Tokenizer.hasMoreToken returns
false . It will not return null in such conditions.
found Token including the EOF token throws: TokenizerException - generic exception (list) for all problems that may occur while parsing(IOExceptions for instance) See Also: Tokenizer.nextImage |
readMore | public int readMore() throws TokenizerException(Code) | | Try to read more data into the text buffer of the tokenizer. This can be
useful when a method needs to look ahead of the available data or a skip
operation should be performed.
The method returns the same value than an immediately following call to
Tokenizer.currentlyAvailable would return.
the number of character now available throws: TokenizerException - generic exception (list) for all problems that may occur while reading (IOExceptions for instance) |
setSource | public void setSource(TokenizerSource source)(Code) | | Setting the source of data. This method is usually called during setup of
the Tokenizer but may also be invoked while the tokenizing
is in progress. It will reset the tokenizers input buffer, line and column
counters etc.
It is allowed to pass null . Calls to
Tokenizer.hasMoreToken will return false , while calling
Tokenizer.nextToken will return
an EOF token.
Parameters: source - a TokenizerSource to read data from See Also: Tokenizer.getSource |
|
|