| org.archive.net.LaxURI
All known Subclasses: org.archive.net.UURI,
LaxURI | public class LaxURI extends URI (Code) | | URI subclass which allows partial/inconsistent encoding, matching
the URIs which will be relayed in requests from popular web
browsers (esp. Mozilla Firefox and MS IE).
author: gojomo |
Method Summary | |
protected static String | decode(char[] component, String charset) | protected static String | decode(String component, String charset) | public String | getPath() | public String | getPathQuery() | public String | getURI() | protected BitSet | lax(BitSet generous) | protected void | parseAuthority(String original, boolean escaped) Coalesce the _host and _authority fields where
possible.
In the web crawl/http domain, most URIs have an
identical _host and _authority. | protected void | parseUriReference(String original, boolean escaped) IA OVERRIDDEN IN LaxURI TO INCLUDE FIX FOR
http://issues.apache.org/jira/browse/HTTPCLIENT-588
In order to avoid any possilbity of conflict with non-ASCII characters,
Parse a URI reference as a String with the character
encoding of the local system or the document. | protected void | setURI() Coalesce _scheme to existing instances, where appropriate.
In the web-crawl domain, most _schemes are 'http' or 'https',
but the superclass always creates a new char[] instance. | protected boolean | validate(char[] component, BitSet generous) | protected boolean | validate(char[] component, int soffset, int eoffset, BitSet generous) |
HTTPS_SCHEME | final protected static char[] HTTPS_SCHEME(Code) | | |
HTTP_SCHEME | final protected static char[] HTTP_SCHEME(Code) | | |
lax_abs_path | final protected static BitSet lax_abs_path(Code) | | |
lax_rel_segment | final protected static BitSet lax_rel_segment(Code) | | |
LaxURI | public LaxURI(String uri, boolean escaped, String charset) throws URIException(Code) | | |
LaxURI | public LaxURI(URI base, URI relative) throws URIException(Code) | | |
LaxURI | public LaxURI(String uri, boolean escaped) throws URIException(Code) | | |
decode | protected static String decode(char[] component, String charset) throws URIException(Code) | | |
getPathQuery | public String getPathQuery() throws URIException(Code) | | |
lax | protected BitSet lax(BitSet generous)(Code) | | Given a BitSet -- typically one of the URI superclass's
predefined static variables -- possibly replace it with
a more-lax version to better match the character sets
actually left unencoded in web browser requests
Parameters: generous - original BitSet (possibly more lax) BitSet to use |
parseAuthority | protected void parseAuthority(String original, boolean escaped) throws URIException(Code) | | Coalesce the _host and _authority fields where
possible.
In the web crawl/http domain, most URIs have an
identical _host and _authority. (There is no port
or user info.) However, the superclass always
creates two separate char[] instances.
Notably, the lengths of these char[] fields are
equal if and only if their values are identical.
This method makes use of this fact to reduce the
two instances to one where possible, slimming
instances.
See Also: org.apache.commons.httpclient.URI.parseAuthority(java.lang.Stringboolean) |
parseUriReference | protected void parseUriReference(String original, boolean escaped) throws URIException(Code) | | IA OVERRIDDEN IN LaxURI TO INCLUDE FIX FOR
http://issues.apache.org/jira/browse/HTTPCLIENT-588
In order to avoid any possilbity of conflict with non-ASCII characters,
Parse a URI reference as a String with the character
encoding of the local system or the document.
The following line is the regular expression for breaking-down a URI
reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
For example, matching the above expression to
http://jakarta.apache.org/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
scheme = $2 = http
$3 = //jakarta.apache.org
authority = $4 = jakarta.apache.org
path = $5 = /ietf/uri/
$6 =
query = $7 =
$8 = #Related
fragment = $9 = Related
Parameters: original - the original character sequence Parameters: escaped - true if original is escaped throws: URIException - If an error occurs. |
setURI | protected void setURI()(Code) | | Coalesce _scheme to existing instances, where appropriate.
In the web-crawl domain, most _schemes are 'http' or 'https',
but the superclass always creates a new char[] instance. For
these two cases, we replace the created instance with a
long-lived instance from a static field, saving 12-14 bytes
per instance.
See Also: org.apache.commons.httpclient.URI.setURI |
validate | protected boolean validate(char[] component, BitSet generous)(Code) | | |
validate | protected boolean validate(char[] component, int soffset, int eoffset, BitSet generous)(Code) | | |
|
|