Java Doc for BulletParser.java in  » Search-Engine » mg4j » it » unimi » dsi » mg4j » util » parser » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » mg4j » it.unimi.dsi.mg4j.util.parser 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   it.unimi.dsi.mg4j.util.parser.BulletParser

BulletParser
public class BulletParser (Code)
A fast, lightweight, on-demand (X)HTML parser.

The bullet parser has been written with two specific goals in mind: web crawling and targeted data extraction from massive web data sets. To be usable in such environments, a parser must obey a number of restrictions:

  • it should avoid excessive object creation (which, for instance, forbids a significant usage of Java strings);
  • it should tolerate invalid syntax and recover reasonably; in fact, it should never throw exceptions;
  • it should perform actual parsing only on a settable feature subset: there is no reason to parse the attributes of a P element while searching for links;
  • it should parse HTML as a regular language, and leave context-free properties (e.g., stack maintenance and repair) to suitably designed callbacks.

Thus, in fact the bullet parser is not a parser. It is a bunch of spaghetti code that analyses a stream of characters pretending that it is an (X)HTML document. It has a very defensive attitude against the stream character it is parsing, but at the same time it is forgiving with all typical (X)HTML mistakes.

The bullet parser is officially StringFree™. MutableStrings are used for internal processing, and Java strings are used only to return attribute values. All internal maps are from fastutil, which helps to accelerate further the parsing process.

HTML data

The bullet parser uses attributes and methods of it.unimi.dsi.mg4j.util.parser.HTMLFactory , it.unimi.dsi.mg4j.util.parser.Element , it.unimi.dsi.mg4j.util.parser.Attribute and it.unimi.dsi.mg4j.util.parser.Entity . Thus, for instance, whenever an element is to be passed around it is one of the shared objects contained in it.unimi.dsi.mg4j.util.parser.Element (e.g., it.unimi.dsi.mg4j.util.parser.Element.BODY ).

Callbacks

The result of the parsing process is the invocation of a callback. The of the bullet parser remembers closely SAX2, but it has some additional methods targeted at (X)HTML, such as Callback.cdata(it.unimi.dsi.mg4j.util.parser.Elementchar[]intint) , which returns characters found in a CDATA section (e.g., a stylesheet).

Each callback must configure the parser, by requesting to perform the analysis and the callbacks it requires. A callback that wants to extract and tokenise text, for instance, will certainly require BulletParser.parseText(boolean) parseText(true) , but not BulletParser.parseTags(boolean) parseTags(true) . On the other hand, a callback wishing to extract links will require to certain attribute types.

A more precise description follows.

Writing callbacks

The first important issue is what has to be required to the parser. A newly created parser does not invoke any callback. It is up to every callback to add features so that it can do its job. Remember that since many callbacks can be , you must always add features, never remove them, and moreover your callbacks must be ready to be invoked with features they did not request (e.g., attribute types added by another callback).

The following parse features may be configured; most of them are just boolean features, a.k.a. flags: unless otherwise specified, by default all flags are set to false (e.g., by the default the parser will not parse tags):

Invoking the parser

After , you just call BulletParser.parse(char[],int,int) .



Field Summary
final protected static  TextPatternCLOSED_CDATA
     Closed section (conditional, CDATA, etc.).
final protected static  TextPatternCLOSED_COMMENT
     Closed comment.
final protected static  TextPatternCLOSED_PERCENT
     Closed ASP or similar tag.
final protected static  TextPatternCLOSED_PIC
     Closed processing instruction.
final protected static  TextPatternCLOSED_SECTION
     Closed section (conditional, etc.).
final protected static  intHEXADECIMAL
     The base for non-decimal entity.
final protected static  intMAX_DEC_ENTITY_LENGTH
     The maximum number of digits of a decimal numeric entity.
final protected static  intMAX_ENTITY_VALUE
     The maximum Unicode value accepted for a numeric entity.
final protected static  intMAX_HEX_ENTITY_LENGTH
     The maximum number of digits of a hexadecimal numeric entity.
final protected static  char[]NONSPACE_WHITESPACE
     An array containing the non-space whitespace.
final protected static  TextPatternSCRIPT_CLOSE_TAG_PATTERN
     Closing tag for a script element.
final protected static  char[]SPACE
     An array, parallel to BulletParser.NONSPACE_WHITESPACE , containing spaces.
final protected static  intSTATE_BEFORE_END_TAG_NAME
     Scanning a closing tag.
final protected static  intSTATE_BEFORE_START_TAG_NAME
     Scanning attribute name/value pairs.
final protected static  intSTATE_IN_END_TAG
     Scanning a closing tag.
final protected static  intSTATE_IN_START_TAG
     Scanning attribute name/value pairs.
final protected static  intSTATE_TEXT
     Scanning text..
final protected static  TextPatternSTYLE_CLOSE_TAG_PATTERN
     Closing tag for a style element.
protected  Reference2ObjectMap<Attribute, MutableString>attrMap
     A map from attributes to attribute values.
protected  Callbackcallback
     The callback of this parser.
final public  ParsingFactoryfactory
     The parsing factory used by this parser.
protected  charlastEntity
     The character represented by the last scanned entity.
protected  booleanparseAttributes
     Whether we should parse attributes.
protected  booleanparseCDATA
     Whether we should invoke the CDATA section handler.
protected  booleanparseTags
     Whether we should parse tags.
protected  booleanparseText
     Whether we should invoke the text handler.
public  ReferenceSet<Attribute>parsedAttributes
     An externally visible, immutable subset of attributes whose values will be actually parsed.
protected  ReferenceArraySet<Attribute>parsedAttrs
     The subset of attributes whose values will be actually parsed (if, of course, BulletParser.parseAttributes is true).

Constructor Summary
public  BulletParser(ParsingFactory factory)
     Creates a new bullet parser.
public  BulletParser()
     Creates a new bullet parser using the default factory HTMLFactory.INSTANCE .

Method Summary
protected  charentity2Char(MutableString name)
     Returns the character corresponding to a given entity name.
Parameters:
  name - the name of an entity.
protected  inthandleMarkup(char[] text, int pos, int end)
     Handles markup.
Parameters:
  text - the text.
Parameters:
  pos - the first character in the markup after <!.
Parameters:
  end - the end of text.
protected  inthandleProcessingInstruction(char[] text, int pos, int end)
     Handles processing instruction, ASP tags etc.
Parameters:
  text - the text.
Parameters:
  pos - the first character in the markup after <%.
Parameters:
  end - the end of text.
public  voidparse(char[] text)
     Analyze the text document to extract information.
public  voidparse(char[] text, int offset, int length)
     Analyze the text document to extract information.
public  BulletParserparseAttribute(Attribute attribute)
     Adds the given attribute to the set of attributes to be parsed.
Parameters:
  attribute - an attribute that should be parsed.
throws:
  IllegalStateException - if BulletParser.parseAttributes(boolean) parseAttributes(true)has not been invoked on this parser.
public  booleanparseAttributes()
     Returns whether this parser will parse attributes.
public  BulletParserparseAttributes(boolean parseAttributes)
     Sets the attribute parsing flag.
Parameters:
  parseAttributes - the new value for the flag.
public  booleanparseCDATA()
     Returns whether this parser will invoke the CDATA-section handler.
public  BulletParserparseCDATA(boolean parseCDATA)
     Sets the CDATA-section handler flag.
Parameters:
  parseCDATA - the new value.
public  booleanparseTags()
     Returns whether this parser will parse tags and invoke element handlers.
public  BulletParserparseTags(boolean parseTags)
     Sets whether this parser will parse tags and invoke element handlers.
Parameters:
  parseTags - the new value.
public  booleanparseText()
     Returns whether this parser will invoke the text handler.
public  BulletParserparseText(boolean parseText)
     Sets the text handler flag.
Parameters:
  parseText - the new value.
protected  voidreplaceEntities(MutableString s, MutableString entity, boolean loose)
     Replaces entities with the corresponding characters.
protected  intscanEntity(char[] a, int offset, int length, boolean loose, MutableString entity)
     Searches for the end of an entity.

This method will search for the end of an entity starting at the given offset (the offset must correspond to the ampersand).

Real-world HTML pages often contain hundreds of misplaced ampersands, due to the unfortunate idea of using the ampersand as query separator (please use the comma in new code!).

public  BulletParsersetCallback(Callback callback)
     Sets the callback for this parser, resetting at the same time all parsing flags.
Parameters:
  callback - the new callback.

Field Detail
CLOSED_CDATA
final protected static TextPattern CLOSED_CDATA(Code)
Closed section (conditional, CDATA, etc.).



CLOSED_COMMENT
final protected static TextPattern CLOSED_COMMENT(Code)
Closed comment. It should be "-->", but mistakes are common.



CLOSED_PERCENT
final protected static TextPattern CLOSED_PERCENT(Code)
Closed ASP or similar tag.



CLOSED_PIC
final protected static TextPattern CLOSED_PIC(Code)
Closed processing instruction.



CLOSED_SECTION
final protected static TextPattern CLOSED_SECTION(Code)
Closed section (conditional, etc.).



HEXADECIMAL
final protected static int HEXADECIMAL(Code)
The base for non-decimal entity.



MAX_DEC_ENTITY_LENGTH
final protected static int MAX_DEC_ENTITY_LENGTH(Code)
The maximum number of digits of a decimal numeric entity.



MAX_ENTITY_VALUE
final protected static int MAX_ENTITY_VALUE(Code)
The maximum Unicode value accepted for a numeric entity.



MAX_HEX_ENTITY_LENGTH
final protected static int MAX_HEX_ENTITY_LENGTH(Code)
The maximum number of digits of a hexadecimal numeric entity.



NONSPACE_WHITESPACE
final protected static char[] NONSPACE_WHITESPACE(Code)
An array containing the non-space whitespace.



SCRIPT_CLOSE_TAG_PATTERN
final protected static TextPattern SCRIPT_CLOSE_TAG_PATTERN(Code)
Closing tag for a script element.



SPACE
final protected static char[] SPACE(Code)
An array, parallel to BulletParser.NONSPACE_WHITESPACE , containing spaces.



STATE_BEFORE_END_TAG_NAME
final protected static int STATE_BEFORE_END_TAG_NAME(Code)
Scanning a closing tag.



STATE_BEFORE_START_TAG_NAME
final protected static int STATE_BEFORE_START_TAG_NAME(Code)
Scanning attribute name/value pairs.



STATE_IN_END_TAG
final protected static int STATE_IN_END_TAG(Code)
Scanning a closing tag.



STATE_IN_START_TAG
final protected static int STATE_IN_START_TAG(Code)
Scanning attribute name/value pairs.



STATE_TEXT
final protected static int STATE_TEXT(Code)
Scanning text..



STYLE_CLOSE_TAG_PATTERN
final protected static TextPattern STYLE_CLOSE_TAG_PATTERN(Code)
Closing tag for a style element.



attrMap
protected Reference2ObjectMap<Attribute, MutableString> attrMap(Code)
A map from attributes to attribute values.



callback
protected Callback callback(Code)
The callback of this parser.



factory
final public ParsingFactory factory(Code)
The parsing factory used by this parser.



lastEntity
protected char lastEntity(Code)
The character represented by the last scanned entity.



parseAttributes
protected boolean parseAttributes(Code)
Whether we should parse attributes.



parseCDATA
protected boolean parseCDATA(Code)
Whether we should invoke the CDATA section handler.



parseTags
protected boolean parseTags(Code)
Whether we should parse tags.



parseText
protected boolean parseText(Code)
Whether we should invoke the text handler.



parsedAttributes
public ReferenceSet<Attribute> parsedAttributes(Code)
An externally visible, immutable subset of attributes whose values will be actually parsed.



parsedAttrs
protected ReferenceArraySet<Attribute> parsedAttrs(Code)
The subset of attributes whose values will be actually parsed (if, of course, BulletParser.parseAttributes is true).




Constructor Detail
BulletParser
public BulletParser(ParsingFactory factory)(Code)
Creates a new bullet parser.



BulletParser
public BulletParser()(Code)
Creates a new bullet parser using the default factory HTMLFactory.INSTANCE .




Method Detail
entity2Char
protected char entity2Char(MutableString name)(Code)
Returns the character corresponding to a given entity name.
Parameters:
  name - the name of an entity. the character corresponding to the entity, or an ASCII NUL if no entity with that name was found.



handleMarkup
protected int handleMarkup(char[] text, int pos, int end)(Code)
Handles markup.
Parameters:
  text - the text.
Parameters:
  pos - the first character in the markup after <!.
Parameters:
  end - the end of text. the position of the first character after the markup.



handleProcessingInstruction
protected int handleProcessingInstruction(char[] text, int pos, int end)(Code)
Handles processing instruction, ASP tags etc.
Parameters:
  text - the text.
Parameters:
  pos - the first character in the markup after <%.
Parameters:
  end - the end of text. the position of the first character after the processing instruction.



parse
public void parse(char[] text)(Code)
Analyze the text document to extract information.
Parameters:
  text - a char array of text to be parsed.



parse
public void parse(char[] text, int offset, int length)(Code)
Analyze the text document to extract information.
Parameters:
  text - a char array of text to be parsed.
Parameters:
  offset - the offset in the array from which the parsing will begin.
Parameters:
  length - the number of characters to be parsed.



parseAttribute
public BulletParser parseAttribute(Attribute attribute)(Code)
Adds the given attribute to the set of attributes to be parsed.
Parameters:
  attribute - an attribute that should be parsed.
throws:
  IllegalStateException - if BulletParser.parseAttributes(boolean) parseAttributes(true)has not been invoked on this parser. this parser.



parseAttributes
public boolean parseAttributes()(Code)
Returns whether this parser will parse attributes. whether this parser will parse attributes.
See Also:   BulletParser.parseAttributes(boolean)



parseAttributes
public BulletParser parseAttributes(boolean parseAttributes)(Code)
Sets the attribute parsing flag.
Parameters:
  parseAttributes - the new value for the flag. this parser.



parseCDATA
public boolean parseCDATA()(Code)
Returns whether this parser will invoke the CDATA-section handler. whether this parser will invoke the CDATA-section handler.
See Also:   BulletParser.parseCDATA(boolean)



parseCDATA
public BulletParser parseCDATA(boolean parseCDATA)(Code)
Sets the CDATA-section handler flag.
Parameters:
  parseCDATA - the new value. this parser.



parseTags
public boolean parseTags()(Code)
Returns whether this parser will parse tags and invoke element handlers. whether this parser will parse tags and invoke element handlers.
See Also:   BulletParser.parseTags(boolean)



parseTags
public BulletParser parseTags(boolean parseTags)(Code)
Sets whether this parser will parse tags and invoke element handlers.
Parameters:
  parseTags - the new value. this parser.



parseText
public boolean parseText()(Code)
Returns whether this parser will invoke the text handler. whether this parser will invoke the text handler.
See Also:   BulletParser.parseText(boolean)



parseText
public BulletParser parseText(boolean parseText)(Code)
Sets the text handler flag.
Parameters:
  parseText - the new value. this parser.



replaceEntities
protected void replaceEntities(MutableString s, MutableString entity, boolean loose)(Code)
Replaces entities with the corresponding characters.

This method will modify the mutable string s so that all legal occurrences of entities are replaced by the corresponding character.
Parameters:
  s - a mutable string whose entities will be replaced by the corresponding characters.
Parameters:
  entity - a support mutable string used by BulletParser.scanEntity(char[],int,int,boolean,MutableString).
Parameters:
  loose - a parameter that will be passed to BulletParser.scanEntity(char[],int,int,boolean,MutableString).




scanEntity
protected int scanEntity(char[] a, int offset, int length, boolean loose, MutableString entity)(Code)
Searches for the end of an entity.

This method will search for the end of an entity starting at the given offset (the offset must correspond to the ampersand).

Real-world HTML pages often contain hundreds of misplaced ampersands, due to the unfortunate idea of using the ampersand as query separator (please use the comma in new code!). All such ampersand should be specified as &amp;. If named entities are delimited using a transition from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter loose is false, named entities can be delimited only by whitespace or by a comma.
Parameters:
  a - a character array containing the entity.
Parameters:
  offset - the offset at which the entity starts (the offset must point at the ampersand).
Parameters:
  length - an upper bound to the maximum returned position.
Parameters:
  loose - if true, named entities can be terminated by any non-alphabetical character (instead of whitespace or comma).
Parameters:
  entity - a support mutable string used to query ParsingFactory.getEntity(MutableString). the position of the last character of the entity, or -1 if no entity was found.




setCallback
public BulletParser setCallback(Callback callback)(Code)
Sets the callback for this parser, resetting at the same time all parsing flags.
Parameters:
  callback - the new callback. this parser.



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.