| java.lang.Object it.unimi.dsi.mg4j.util.parser.BulletParser
BulletParser | public class BulletParser (Code) | | A fast, lightweight, on-demand (X)HTML parser.
The bullet parser has been written with two specific goals in mind:
web crawling and targeted data extraction from massive web data sets.
To be usable in such environments, a parser must obey a number of
restrictions:
- it should avoid excessive object creation (which, for instance,
forbids a significant usage of Java strings);
- it should tolerate invalid syntax and recover reasonably; in fact,
it should never throw exceptions;
- it should perform actual parsing only on a settable feature subset:
there is no reason to parse the attributes of a P
element while searching for links;
- it should parse HTML as a regular language, and leave context-free
properties (e.g., stack maintenance and repair) to suitably designed callbacks.
Thus, in fact the bullet parser is not a parser. It is a bunch of
spaghetti code that analyses a stream of characters pretending that
it is an (X)HTML document. It has a very defensive attitude against
the stream character it is parsing, but at the same time it is
forgiving with all typical (X)HTML mistakes.
The bullet parser is officially StringFree™.
MutableString s
are used for internal processing, and Java strings are used only to return attribute
values. All internal maps are
from fastutil, which
helps to accelerate further the parsing process.
HTML data
The bullet parser uses attributes and methods of
it.unimi.dsi.mg4j.util.parser.HTMLFactory ,
it.unimi.dsi.mg4j.util.parser.Element ,
it.unimi.dsi.mg4j.util.parser.Attribute and
it.unimi.dsi.mg4j.util.parser.Entity .
Thus, for instance, whenever an element is to be passed around it is one
of the shared objects contained in
it.unimi.dsi.mg4j.util.parser.Element (e.g.,
it.unimi.dsi.mg4j.util.parser.Element.BODY ).
Callbacks
The result of the parsing process is the invocation of a callback.
The
of the bullet parser remembers closely SAX2, but it has some additional
methods targeted at (X)HTML, such as
Callback.cdata(it.unimi.dsi.mg4j.util.parser.Elementchar[]intint) ,
which returns characters found in a CDATA section (e.g., a stylesheet).
Each callback must configure the parser, by requesting to perform
the analysis and the callbacks it requires. A callback that wants to
extract and tokenise text, for instance, will certainly require
BulletParser.parseText(boolean) parseText(true) , but not
BulletParser.parseTags(boolean) parseTags(true) .
On the other hand, a callback wishing to extract links will require
to
certain attribute types.
A more precise description follows.
Writing callbacks
The first important issue is what has to be required to the parser. A newly
created parser does not invoke any callback. It is up to every callback
to add features so that it can do its job. Remember that since many
callbacks can be
,
you must always add features, never remove them, and moreover
your callbacks must be ready to be invoked with features they did not
request (e.g., attribute types added by another callback).
The following parse features
may be configured; most of them are just boolean features, a.k.a. flags:
unless otherwise specified, by default all flags are set to false (e.g., by
the default the parser will not parse tags):
Invoking the parser
After
,
you just call
BulletParser.parse(char[],int,int) .
|
Method Summary | |
protected char | entity2Char(MutableString name) Returns the character corresponding to a given entity name.
Parameters: name - the name of an entity. | protected int | handleMarkup(char[] text, int pos, int end) Handles markup.
Parameters: text - the text. Parameters: pos - the first character in the markup after <!. Parameters: end - the end of text . | protected int | handleProcessingInstruction(char[] text, int pos, int end) Handles processing instruction, ASP tags etc.
Parameters: text - the text. Parameters: pos - the first character in the markup after <%. Parameters: end - the end of text . | public void | parse(char[] text) Analyze the text document to extract information. | public void | parse(char[] text, int offset, int length) Analyze the text document to extract information. | public BulletParser | parseAttribute(Attribute attribute) Adds the given attribute to the set of attributes to be parsed.
Parameters: attribute - an attribute that should be parsed. throws: IllegalStateException - if BulletParser.parseAttributes(boolean) parseAttributes(true)has not been invoked on this parser. | public boolean | parseAttributes() Returns whether this parser will parse attributes. | public BulletParser | parseAttributes(boolean parseAttributes) Sets the attribute parsing flag.
Parameters: parseAttributes - the new value for the flag. | public boolean | parseCDATA() Returns whether this parser will invoke the CDATA-section handler. | public BulletParser | parseCDATA(boolean parseCDATA) Sets the CDATA-section handler flag.
Parameters: parseCDATA - the new value. | public boolean | parseTags() Returns whether this parser will parse tags and invoke element handlers. | public BulletParser | parseTags(boolean parseTags) Sets whether this parser will parse tags and invoke element handlers.
Parameters: parseTags - the new value. | public boolean | parseText() Returns whether this parser will invoke the text handler. | public BulletParser | parseText(boolean parseText) Sets the text handler flag.
Parameters: parseText - the new value. | protected void | replaceEntities(MutableString s, MutableString entity, boolean loose) Replaces entities with the corresponding characters. | protected int | scanEntity(char[] a, int offset, int length, boolean loose, MutableString entity) Searches for the end of an entity.
This method will search for the end of an entity starting at the given offset (the offset
must correspond to the ampersand).
Real-world HTML pages often contain hundreds of misplaced ampersands, due to the
unfortunate idea of using the ampersand as query separator (please use the comma
in new code!). | public BulletParser | setCallback(Callback callback) Sets the callback for this parser, resetting at the same time all parsing flags.
Parameters: callback - the new callback. |
CLOSED_CDATA | final protected static TextPattern CLOSED_CDATA(Code) | | Closed section (conditional, CDATA, etc.).
|
CLOSED_COMMENT | final protected static TextPattern CLOSED_COMMENT(Code) | | Closed comment. It should be "-->", but mistakes are common.
|
CLOSED_PERCENT | final protected static TextPattern CLOSED_PERCENT(Code) | | Closed ASP or similar tag.
|
CLOSED_PIC | final protected static TextPattern CLOSED_PIC(Code) | | Closed processing instruction.
|
CLOSED_SECTION | final protected static TextPattern CLOSED_SECTION(Code) | | Closed section (conditional, etc.).
|
HEXADECIMAL | final protected static int HEXADECIMAL(Code) | | The base for non-decimal entity.
|
MAX_DEC_ENTITY_LENGTH | final protected static int MAX_DEC_ENTITY_LENGTH(Code) | | The maximum number of digits of a decimal numeric entity.
|
MAX_ENTITY_VALUE | final protected static int MAX_ENTITY_VALUE(Code) | | The maximum Unicode value accepted for a numeric entity.
|
MAX_HEX_ENTITY_LENGTH | final protected static int MAX_HEX_ENTITY_LENGTH(Code) | | The maximum number of digits of a hexadecimal numeric entity.
|
NONSPACE_WHITESPACE | final protected static char[] NONSPACE_WHITESPACE(Code) | | An array containing the non-space whitespace.
|
SCRIPT_CLOSE_TAG_PATTERN | final protected static TextPattern SCRIPT_CLOSE_TAG_PATTERN(Code) | | Closing tag for a script element.
|
STATE_BEFORE_END_TAG_NAME | final protected static int STATE_BEFORE_END_TAG_NAME(Code) | | Scanning a closing tag.
|
STATE_BEFORE_START_TAG_NAME | final protected static int STATE_BEFORE_START_TAG_NAME(Code) | | Scanning attribute name/value pairs.
|
STATE_IN_END_TAG | final protected static int STATE_IN_END_TAG(Code) | | Scanning a closing tag.
|
STATE_IN_START_TAG | final protected static int STATE_IN_START_TAG(Code) | | Scanning attribute name/value pairs.
|
STATE_TEXT | final protected static int STATE_TEXT(Code) | | Scanning text..
|
STYLE_CLOSE_TAG_PATTERN | final protected static TextPattern STYLE_CLOSE_TAG_PATTERN(Code) | | Closing tag for a style element.
|
callback | protected Callback callback(Code) | | The callback of this parser.
|
lastEntity | protected char lastEntity(Code) | | The character represented by the last scanned entity.
|
parseAttributes | protected boolean parseAttributes(Code) | | Whether we should parse attributes.
|
parseCDATA | protected boolean parseCDATA(Code) | | Whether we should invoke the CDATA section handler.
|
parseTags | protected boolean parseTags(Code) | | Whether we should parse tags.
|
parseText | protected boolean parseText(Code) | | Whether we should invoke the text handler.
|
parsedAttributes | public ReferenceSet<Attribute> parsedAttributes(Code) | | An externally visible, immutable subset of attributes whose values will
be actually parsed.
|
entity2Char | protected char entity2Char(MutableString name)(Code) | | Returns the character corresponding to a given entity name.
Parameters: name - the name of an entity. the character corresponding to the entity, or an ASCII NUL if no entity with that name was found. |
handleMarkup | protected int handleMarkup(char[] text, int pos, int end)(Code) | | Handles markup.
Parameters: text - the text. Parameters: pos - the first character in the markup after <!. Parameters: end - the end of text . the position of the first character after the markup. |
handleProcessingInstruction | protected int handleProcessingInstruction(char[] text, int pos, int end)(Code) | | Handles processing instruction, ASP tags etc.
Parameters: text - the text. Parameters: pos - the first character in the markup after <%. Parameters: end - the end of text . the position of the first character after the processing instruction. |
parse | public void parse(char[] text)(Code) | | Analyze the text document to extract information.
Parameters: text - a char array of text to be parsed. |
parse | public void parse(char[] text, int offset, int length)(Code) | | Analyze the text document to extract information.
Parameters: text - a char array of text to be parsed. Parameters: offset - the offset in the array from which the parsing will begin. Parameters: length - the number of characters to be parsed. |
parseAttributes | public BulletParser parseAttributes(boolean parseAttributes)(Code) | | Sets the attribute parsing flag.
Parameters: parseAttributes - the new value for the flag. this parser. |
parseCDATA | public boolean parseCDATA()(Code) | | Returns whether this parser will invoke the CDATA-section handler.
whether this parser will invoke the CDATA-section handler. See Also: BulletParser.parseCDATA(boolean) |
parseCDATA | public BulletParser parseCDATA(boolean parseCDATA)(Code) | | Sets the CDATA-section handler flag.
Parameters: parseCDATA - the new value. this parser. |
parseTags | public boolean parseTags()(Code) | | Returns whether this parser will parse tags and invoke element handlers.
whether this parser will parse tags and invoke element handlers. See Also: BulletParser.parseTags(boolean) |
parseTags | public BulletParser parseTags(boolean parseTags)(Code) | | Sets whether this parser will parse tags and invoke element handlers.
Parameters: parseTags - the new value. this parser. |
parseText | public boolean parseText()(Code) | | Returns whether this parser will invoke the text handler.
whether this parser will invoke the text handler. See Also: BulletParser.parseText(boolean) |
parseText | public BulletParser parseText(boolean parseText)(Code) | | Sets the text handler flag.
Parameters: parseText - the new value. this parser. |
replaceEntities | protected void replaceEntities(MutableString s, MutableString entity, boolean loose)(Code) | | Replaces entities with the corresponding characters.
This method will modify the mutable string s so that all legal occurrences
of entities are replaced by the corresponding character.
Parameters: s - a mutable string whose entities will be replaced by the corresponding characters. Parameters: entity - a support mutable string used by BulletParser.scanEntity(char[],int,int,boolean,MutableString). Parameters: loose - a parameter that will be passed to BulletParser.scanEntity(char[],int,int,boolean,MutableString). |
scanEntity | protected int scanEntity(char[] a, int offset, int length, boolean loose, MutableString entity)(Code) | | Searches for the end of an entity.
This method will search for the end of an entity starting at the given offset (the offset
must correspond to the ampersand).
Real-world HTML pages often contain hundreds of misplaced ampersands, due to the
unfortunate idea of using the ampersand as query separator (please use the comma
in new code!). All such ampersand should be specified as &.
If named entities are delimited using a transition
from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter
loose is false, named entities can be delimited only by whitespace or by a comma.
Parameters: a - a character array containing the entity. Parameters: offset - the offset at which the entity starts (the offset must point at the ampersand). Parameters: length - an upper bound to the maximum returned position. Parameters: loose - if true, named entities can be terminated by any non-alphabetical character (instead of whitespace or comma). Parameters: entity - a support mutable string used to query ParsingFactory.getEntity(MutableString). the position of the last character of the entity, or -1 if no entity was found. |
setCallback | public BulletParser setCallback(Callback callback)(Code) | | Sets the callback for this parser, resetting at the same time all parsing flags.
Parameters: callback - the new callback. this parser. |
|
|