Java Doc for UURIFactory.java in  » Web-Crawler » heritrix » org » archive » net » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.net 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


org.archive.net.UURIFactory

UURIFactory
public class UURIFactory extends URI (Code)
Factory that returns UURIs. Does escaping and fixup on URIs massaging in accordance with RFC2396 and to match browser practice. For example, it removes any '..' if first thing in the path as per IE, converts backslashes to forward slashes, and discards any 'fragment'/anchor portion of the URI. This class will also fail URIs if they are longer than IE's allowed maximum length.

TODO: Test logging.
author:
   stack



Field Summary
final static  StringACCEPTABLE_ASCII_DOMAIN
     Characters we'll accept in the domain label part of a URI authority: ASCII letters-digits-hyphen (LDH) plus underscore, with single intervening '.' characters.
final public static  StringAPOSTROPH
    
final public static  StringBACKSLASH
    
final public static  StringBACKSLASH_PATTERN
    
final public static  StringCIRCUMFLEX
    
final public static  StringCIRCUMFLEX_PATTERN
    
final public static  charCOLON
    
final public static  StringCOMMERCIAL_AT
    
final public static  StringDOT
    
final public static  StringEMPTY_STRING
    
final public static  StringESCAPED_APOSTROPH
    
final public static  StringESCAPED_BACKSLASH
    
final public static  StringESCAPED_CIRCUMFLEX
    
final public static  StringESCAPED_LCURBRACKET
    
final public static  StringESCAPED_LSQRBRACKET
    
final public static  StringESCAPED_PIPE
    
final public static  StringESCAPED_QUOT
    
final public static  StringESCAPED_RCURBRACKET
    
final public static  StringESCAPED_RSQRBRACKET
    
final public static  StringESCAPED_SPACE
    
final public static  StringESCAPED_SQUOT
    
final public static  StringHTTP
    
final public static  StringHTTPS
    
final public static  StringHTTPS_PORT
    
final public static  StringHTTP_PORT
    
final static  PatternHTTP_SCHEME_SLASHES
     Pattern that looks for case of three or more slashes after the scheme.
final public static  intIGNORED_SCHEME
    
final public static  StringIMPROPERESC
    
final public static  StringIMPROPERESC_REPLACE
    
final public static  StringLCURBRACKET
    
final public static  StringLCURBRACKET_PATTERN
    
final public static  StringLSQRBRACKET
    
final public static  StringLSQRBRACKET_PATTERN
    
final static  PatternMULTIPLE_SLASHES
     Pattern that looks for case of two or more slashes in a path.
final public static  StringNBSP
    
final public static  charPERCENT_SIGN
    
final public static  StringPIPE
    
final public static  StringPIPE_PATTERN
    
final static  PatternPORTREGEX
     Authority port number regex.
final public static  StringQUOT
    
final public static  StringRCURBRACKET
    
final public static  StringRCURBRACKET_PATTERN
    
final static  PatternRFC2396REGEX
     RFC 2396-inspired regex. From the RFC Appendix B:
 URI Generic Syntax                August 1998
 B.
final public static  StringRSQRBRACKET
    
final public static  StringRSQRBRACKET_PATTERN
    
final public static  StringSLASH
    
final public static  StringSLASHDOTDOTSLASH
    
final public static  StringSPACE
    
final public static  StringSQUOT
    
final public static  StringSTRAY_SPACING
    
final public static  StringTRAILING_ESCAPED_SPACE
    
final public static  StringURI_HEX_ENCODING
     First percent sign in string followed by two hex chars.


Method Summary
protected  voidcheckHttpSchemeSpecificPartSlashPrefix(URI base, String scheme, String schemeSpecificPart)
     If http(s) scheme, check scheme specific part begins '//'.
throws:
  URIException -
See Also:    http://www.faqs.org/rfcs/rfc1738.html Section 3.1.
protected  StringescapeWhitespace(String uri)
     Escape any whitespace found. The parent class takes care of the bulk of escaping.
public static  UURIgetInstance(String uri)
    
Parameters:
  uri - URI as string.
public static  UURIgetInstance(String uri, String charset)
    
Parameters:
  uri - URI as string.
Parameters:
  charset - Character encoding of the passed uri string.
public static  UURIgetInstance(UURI base, String relative)
    
Parameters:
  base - Base uri to use resolving passed relative uri.
Parameters:
  relative - URI as string.
public static  booleanhasSupportedScheme(String possibleUrl)
     Test of whether passed String has an allowed URI scheme. First tests if likely scheme suffix.
protected  UURIvalidityCheck(UURI uuri)
     Check the generated UURI. At the least look at length of uuri string.

Field Detail
ACCEPTABLE_ASCII_DOMAIN
final static String ACCEPTABLE_ASCII_DOMAIN(Code)
Characters we'll accept in the domain label part of a URI authority: ASCII letters-digits-hyphen (LDH) plus underscore, with single intervening '.' characters. (We accept '_' because DNS servers have tolerated for many years counter to spec; we also accept dash patterns and ACE prefixes that will be rejected by IDN-punycoding attempt.)



APOSTROPH
final public static String APOSTROPH(Code)



BACKSLASH
final public static String BACKSLASH(Code)



BACKSLASH_PATTERN
final public static String BACKSLASH_PATTERN(Code)



CIRCUMFLEX
final public static String CIRCUMFLEX(Code)



CIRCUMFLEX_PATTERN
final public static String CIRCUMFLEX_PATTERN(Code)



COLON
final public static char COLON(Code)



COMMERCIAL_AT
final public static String COMMERCIAL_AT(Code)



DOT
final public static String DOT(Code)



EMPTY_STRING
final public static String EMPTY_STRING(Code)



ESCAPED_APOSTROPH
final public static String ESCAPED_APOSTROPH(Code)



ESCAPED_BACKSLASH
final public static String ESCAPED_BACKSLASH(Code)



ESCAPED_CIRCUMFLEX
final public static String ESCAPED_CIRCUMFLEX(Code)



ESCAPED_LCURBRACKET
final public static String ESCAPED_LCURBRACKET(Code)



ESCAPED_LSQRBRACKET
final public static String ESCAPED_LSQRBRACKET(Code)



ESCAPED_PIPE
final public static String ESCAPED_PIPE(Code)



ESCAPED_QUOT
final public static String ESCAPED_QUOT(Code)



ESCAPED_RCURBRACKET
final public static String ESCAPED_RCURBRACKET(Code)



ESCAPED_RSQRBRACKET
final public static String ESCAPED_RSQRBRACKET(Code)



ESCAPED_SPACE
final public static String ESCAPED_SPACE(Code)



ESCAPED_SQUOT
final public static String ESCAPED_SQUOT(Code)



HTTP
final public static String HTTP(Code)



HTTPS
final public static String HTTPS(Code)



HTTPS_PORT
final public static String HTTPS_PORT(Code)



HTTP_PORT
final public static String HTTP_PORT(Code)



HTTP_SCHEME_SLASHES
final static Pattern HTTP_SCHEME_SLASHES(Code)
Pattern that looks for case of three or more slashes after the scheme. If found, we replace them with two only as mozilla does.



IGNORED_SCHEME
final public static int IGNORED_SCHEME(Code)



IMPROPERESC
final public static String IMPROPERESC(Code)



IMPROPERESC_REPLACE
final public static String IMPROPERESC_REPLACE(Code)



LCURBRACKET
final public static String LCURBRACKET(Code)



LCURBRACKET_PATTERN
final public static String LCURBRACKET_PATTERN(Code)



LSQRBRACKET
final public static String LSQRBRACKET(Code)



LSQRBRACKET_PATTERN
final public static String LSQRBRACKET_PATTERN(Code)



MULTIPLE_SLASHES
final static Pattern MULTIPLE_SLASHES(Code)
Pattern that looks for case of two or more slashes in a path.



NBSP
final public static String NBSP(Code)



PERCENT_SIGN
final public static char PERCENT_SIGN(Code)



PIPE
final public static String PIPE(Code)



PIPE_PATTERN
final public static String PIPE_PATTERN(Code)



PORTREGEX
final static Pattern PORTREGEX(Code)
Authority port number regex.



QUOT
final public static String QUOT(Code)



RCURBRACKET
final public static String RCURBRACKET(Code)



RCURBRACKET_PATTERN
final public static String RCURBRACKET_PATTERN(Code)



RFC2396REGEX
final static Pattern RFC2396REGEX(Code)
RFC 2396-inspired regex. From the RFC Appendix B:
 URI Generic Syntax                August 1998
 B. Parsing a URI Reference with a Regular Expression
 As described in Section 4.3, the generic URI syntax is not sufficient
 to disambiguate the components of some forms of URI.  Since the
 "greedy algorithm" described in that section is identical to the
 disambiguation method used by POSIX regular expressions, it is
 natural and commonplace to use a regular expression for parsing the
 potential four components and fragment identifier of a URI reference.
 The following line is the regular expression for breaking-down a URI
 reference into its components.
 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
 The numbers in the second line above are only to assist readability;
 they indicate the reference points for each subexpression (i.e., each
 paired parenthesis).  We refer to the value matched for subexpression
  as $.  For example, matching the above expression to
 http://www.ics.uci.edu/pub/ietf/uri/#Related
 results in the following subexpression matches:
 $1 = http:
 $2 = http
 $3 = //www.ics.uci.edu
 $4 = www.ics.uci.edu
 $5 = /pub/ietf/uri/
 $6 = 
 $7 = 
 $8 = #Related
 $9 = Related
 where  indicates that the component is not present, as is
 the case for the query component in the above example.  Therefore, we
 can determine the value of the four components and fragment as
 scheme    = $2
 authority = $4
 path      = $5
 query     = $7
 fragment  = $9
 
--

Below differs from the rfc regex in that it has java escaping of regex characters and we allow a URI made of a fragment only (Added extra group so indexing is off by one after scheme).




RSQRBRACKET
final public static String RSQRBRACKET(Code)



RSQRBRACKET_PATTERN
final public static String RSQRBRACKET_PATTERN(Code)



SLASH
final public static String SLASH(Code)



SLASHDOTDOTSLASH
final public static String SLASHDOTDOTSLASH(Code)



SPACE
final public static String SPACE(Code)



SQUOT
final public static String SQUOT(Code)



STRAY_SPACING
final public static String STRAY_SPACING(Code)



TRAILING_ESCAPED_SPACE
final public static String TRAILING_ESCAPED_SPACE(Code)



URI_HEX_ENCODING
final public static String URI_HEX_ENCODING(Code)
First percent sign in string followed by two hex chars.





Method Detail
checkHttpSchemeSpecificPartSlashPrefix
protected void checkHttpSchemeSpecificPartSlashPrefix(URI base, String scheme, String schemeSpecificPart) throws URIException(Code)
If http(s) scheme, check scheme specific part begins '//'.
throws:
  URIException -
See Also:    http://www.faqs.org/rfcs/rfc1738.html Section 3.1. Common Internet
See Also:   Scheme Syntax



escapeWhitespace
protected String escapeWhitespace(String uri)(Code)
Escape any whitespace found. The parent class takes care of the bulk of escaping. But if any instance of escaping is found in the URI, then we ask for parent to do NO escaping. Here we escape any whitespace found irrespective of whether the uri has already been escaped. We do this for case where uri has been judged already-escaped only, its been incompletly done and whitespace remains. Spaces, etc., in the URI are a real pain. Their presence will break log file and ARC parsing.
Parameters:
  uri - URI string to check. uri with spaces escaped if any found.



getInstance
public static UURI getInstance(String uri) throws URIException(Code)

Parameters:
  uri - URI as string. An instance of UURI
throws:
  URIException -



getInstance
public static UURI getInstance(String uri, String charset) throws URIException(Code)

Parameters:
  uri - URI as string.
Parameters:
  charset - Character encoding of the passed uri string. An instance of UURI
throws:
  URIException -



getInstance
public static UURI getInstance(UURI base, String relative) throws URIException(Code)

Parameters:
  base - Base uri to use resolving passed relative uri.
Parameters:
  relative - URI as string. An instance of UURI
throws:
  URIException -



hasSupportedScheme
public static boolean hasSupportedScheme(String possibleUrl)(Code)
Test of whether passed String has an allowed URI scheme. First tests if likely scheme suffix. If so, we then test if its one of the supported schemes.
Parameters:
  possibleUrl - URL string to examine. True if passed string looks like it could be an URL.



validityCheck
protected UURI validityCheck(UURI uuri) throws URIException(Code)
Check the generated UURI. At the least look at length of uuri string. We were seeing case where before escaping, string was < MAX_URL_LENGTH but after was >. Letting out a too-big message was causing us troubles later down the processing chain.
Parameters:
  uuri - Created uuri to check. The passed uuri so can easily inline this check.
throws:
  URIException -



www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.