| java.lang.Object sunlabs.brazil.util.regexp.Regexp
Regexp | public class Regexp implements java.io.Serializable(Code) | | The Regexp class can be used to match a pattern against a
string and optionally replace the matched parts with new strings.
Regular expressions were implemented by translating Henry Spencer's
regular expression package for tcl8.0.
Much of the description below is copied verbatim from the tcl8.0 regsub
manual entry.
REGULAR EXPRESSIONS
A regular expression is zero or more branches , separated by
"|". It matches anything that matches one of the branches.
A branch is zero or more pieces , concatenated.
It matches a match for the first piece, followed by a match for the
second piece, etc.
A piece is an atom , possibly followed by "*", "+", or
"?".
- An atom followed by "*" matches a sequence of 0 or more matches of
the atom.
- An atom followed by "+" matches a sequence of 1 or more matches of
the atom.
- An atom followed by "?" matches either 0 or 1 matches of the atom.
An atom is
- a regular expression in parentheses (matching a match for the
regular expression)
- a
range (see below)
- "." (matching any single character)
- "^" (matching the null string at the beginning of the input string)
- "$" (matching the null string at the end of the input string)
- a "\" followed by a single character (matching that character)
- a single character with no other significance (matching that
character).
A range is a sequence of characters enclosed in "[]".
The range normally matches any single character from the sequence.
If the sequence begins with "^", the range matches any single character
not from the rest of the sequence.
If two characters in the sequence are separated by "-", this is shorthand
for the full list of characters between them (e.g. "[0-9]" matches any
decimal digit). To include a literal "]" in the sequence, make it the
first character (following a possible "^"). To include a literal "-",
make it the first or last character.
In general there may be more than one way to match a regular expression
to an input string. For example, consider the command
String[] match = new String[2];
Regexp.match("(a*)b*", "aabaaabb", match);
Considering only the rules given so far, match[0] and
match[1] could end up with the values
- "aabb" and "aa"
- "aaab" and "aaa"
- "ab" and "a"
or any of several other combinations. To resolve this potential ambiguity,
Regexp chooses among alternatives using the rule "first then longest".
In other words, it considers the possible matches in order working
from left to right across the input string and the pattern, and it
attempts to match longer pieces of the input string before shorter
ones. More specifically, the following rules apply in decreasing
order of priority:
- If a regular expression could match two different parts of an input
string then it will match the one that begins earliest.
- If a regular expression contains "|" operators then the
leftmost matching sub-expression is chosen.
- In "*", "+", and "?" constructs, longer matches are chosen in
preference to shorter ones.
-
In sequences of expression components the components are considered
from left to right.
In the example from above, "(a*)b*" therefore matches exactly "aab"; the
"(a*)" portion of the pattern is matched first and it consumes the leading
"aa", then the "b*" portion of the pattern consumes the next "b". Or,
consider the following example:
String match = new String[3];
Regexp.match("(ab|a)(b*)c", "abc", match);
After this command, match[0] will be "abc",
match[1] will be "ab", and match[2] will be an
empty string.
Rule 4 specifies that the "(ab|a)" component gets first shot at the input
string and Rule 2 specifies that the "ab" sub-expression
is checked before the "a" sub-expression.
Thus the "b" has already been claimed before the "(b*)"
component is checked and therefore "(b*)" must match an empty string.
REGULAR EXPRESSION SUBSTITUTION
Regular expression substitution matches a string against a regular
expression, transforming the string by replacing the matched region(s)
with new substring(s).
What gets substituted into the result is controlled by a
subspec . The subspec is a formatting string that specifies
what portions of the matched region should be substituted into the
result.
- "&" or "\0" is replaced with a copy of the entire matched region.
- "\
n ", where n is a digit from 1 to 9,
is replaced with a copy of the n th subexpression.
- "\&" or "\\" are replaced with just "&" or "\" to escape their
special meaning.
- any other character is passed through.
In the above, strings like "\2" represents the two characters
backslash and "2", not the Unicode character 0002.
Here is an example of how to use Regexp
public static void
main(String[] args)
throws Exception
{
Regexp re;
String[] matches;
String s;
/*
A regular expression to match the first line of a HTTP request.
1. ^ - starting at the beginning of the line
2. ([A-Z]+) - match and remember some upper case characters
3. [ \t]+ - skip blank space
4. ([^ \t]*) - match and remember up to the next blank space
5. [ \t]+ - skip more blank space
6. (HTTP/1\\.[01]) - match and remember HTTP/1.0 or HTTP/1.1
7. $ - end of string - no chars left.
/
s = "GET http://a.b.com:1234/index.html HTTP/1.1";
re = new Regexp("^([A-Z]+)[ \t]+([^ \t]+)[ \t]+(HTTP/1\\.[01])$");
matches = new String[4];
if (re.match(s, matches)) {
System.out.println("METHOD " + matches[1]);
System.out.println("URL " + matches[2]);
System.out.println("VERSION " + matches[3]);
}
/*
A regular expression to extract some simple comma-separated data,
reorder some of the columns, and discard column 2.
/
s = "abc,def,ghi,klm,nop,pqr";
re = new Regexp("^([^,]+),([^,]+),([^,]+),(.*)");
System.out.println(re.sub(s, "\\3,\\1,\\4"));
}
author: Colin Stevens (colin.stevens@sun.com) version: 1.10, 00/11/06 See Also: Regsub |
Inner Class :public interface Filter | |
Inner Class :static class Compiler | |
Inner Class :static class Match | |
Field Summary | |
final static char | ANY | final static char | ANYBUT | final static char | ANYOF | final static char | BACK | final static char | BOL | final static char | BRANCH | final static char | CLOSE | final static char | END | final static char | EOL | final static char | EXACTLY | final static char | NOTHING | final static int | NSUBEXP | final static char | OPEN | final static char | PLUS | final static char | STAR | boolean | anchored true if the pattern must match the beginning of the
string, so we don't have to waste time matching against all possible
starting locations in the string. | boolean | ignoreCase Whether the regexp matching should be case insensitive. | String | must | int | npar The number of parenthesized subexpressions in the regexp pattern,
plus 1 for the match of the whole pattern itself. | final static String[] | opnames | char[] | program The bytecodes making up the regexp program. | int | startChar |
Constructor Summary | |
public | Regexp(String pat) Compiles a new Regexp object from the given regular expression
pattern.
It takes a certain amount of time to parse and validate a regular
expression pattern before it can be used to perform matches
or substitutions. | public | Regexp(String pat, boolean ignoreCase) Compiles a new Regexp object from the given regular expression
pattern.
Parameters: pat - The string holding the regular expression pattern. Parameters: ignoreCase - If true then this regular expression willdo case-insensitive matching. |
Method Summary | |
public static void | applySubspec(Regsub rs, String subspec, StringBuffer sb) Utility method to give access to the standard substitution algorithm
used by sub and subAll . | Match | exec(String str, int start, int off) | public static void | main(String[] args) | public String | match(String str) Matches the given string against this regular expression. | public boolean | match(String str, String[] substrs) Matches the given string against this regular expression, and computes
the set of substrings that matched the parenthesized subexpressions.
substrs[0] is set to the range of str
that matched the entire regular expression.
substrs[1] is set to the range of str
that matched the first (leftmost) parenthesized subexpression.
substrs[n] is set to the range that matched the
n th subexpression, and so on.
If subexpression n did not match, then
substrs[n] is set to null . | public boolean | match(String str, int[] indices) Matches the given string against this regular expression, and computes
the set of substrings that matched the parenthesized subexpressions.
For the indices specified below, the range extends from the character
at the starting index up to, but not including, the character at the
ending index.
indices[0] and indices[1] are set to
starting and ending indices of the range of str
that matched the entire regular expression.
indices[2] and indices[3] are set to the
starting and ending indices of the range of str that
matched the first (leftmost) parenthesized subexpression.
indices[n * 2] and indices[n * 2 + 1]
are set to the range that matched the n th
subexpression, and so on.
If subexpression n did not match, then
indices[n * 2] and indices[n * 2 + 1]
are both set to -1 .
The length that the caller should use when allocating the
indices array is twice the return value of
Regexp.subspecs . | public String | sub(String str, String subspec) Matches a string against a regular expression and replaces the first
match with the string generated from the substitution parameter.
Parameters: str - The string to match against this regular expression. Parameters: subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION. The string formed by replacing the first match instr with the string generated fromsubspec . | public String | sub(String str, Filter rf) | public String | subAll(String str, String subspec) Matches a string against a regular expression and replaces all
matches with the string generated from the substitution parameter.
After each substutition is done, the portions of the string already
examined, including the newly substituted region, are not checked
again for new matches -- only the rest of the string is examined.
Parameters: str - The string to match against this regular expression. Parameters: subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION. The string formed by replacing all the matches instr with the strings generated fromsubspec . | public int | subspecs() Returns the number of parenthesized subexpressions in this regular
expression, plus one more for this expression itself. | public String | toString() Returns a string representation of this compiled regular
expression. |
ANY | final static char ANY(Code) | | |
ANYBUT | final static char ANYBUT(Code) | | |
ANYOF | final static char ANYOF(Code) | | |
BACK | final static char BACK(Code) | | |
BOL | final static char BOL(Code) | | |
BRANCH | final static char BRANCH(Code) | | |
CLOSE | final static char CLOSE(Code) | | |
END | final static char END(Code) | | |
EOL | final static char EOL(Code) | | |
EXACTLY | final static char EXACTLY(Code) | | |
NOTHING | final static char NOTHING(Code) | | |
NSUBEXP | final static int NSUBEXP(Code) | | |
OPEN | final static char OPEN(Code) | | |
PLUS | final static char PLUS(Code) | | |
STAR | final static char STAR(Code) | | |
anchored | boolean anchored(Code) | | true if the pattern must match the beginning of the
string, so we don't have to waste time matching against all possible
starting locations in the string.
|
ignoreCase | boolean ignoreCase(Code) | | Whether the regexp matching should be case insensitive.
|
npar | int npar(Code) | | The number of parenthesized subexpressions in the regexp pattern,
plus 1 for the match of the whole pattern itself.
|
program | char[] program(Code) | | The bytecodes making up the regexp program.
|
Regexp | public Regexp(String pat) throws IllegalArgumentException(Code) | | Compiles a new Regexp object from the given regular expression
pattern.
It takes a certain amount of time to parse and validate a regular
expression pattern before it can be used to perform matches
or substitutions. If the caller caches the new Regexp object, that
parsing time will be saved because the same Regexp can be used with
respect to many different strings.
Parameters: pat - The string holding the regular expression pattern. throws: IllegalArgumentException - if the pattern is malformed.The detail message for the exception will be set to astring indicating how the pattern was malformed. |
Regexp | public Regexp(String pat, boolean ignoreCase) throws IllegalArgumentException(Code) | | Compiles a new Regexp object from the given regular expression
pattern.
Parameters: pat - The string holding the regular expression pattern. Parameters: ignoreCase - If true then this regular expression willdo case-insensitive matching. If false , thenthe matches are case-sensitive. Regular expressionsgenerated by Regexp(String) are case-sensitive. throws: IllegalArgumentException - if the pattern is malformed.The detail message for the exception will be set to astring indicating how the pattern was malformed. |
applySubspec | public static void applySubspec(Regsub rs, String subspec, StringBuffer sb)(Code) | | Utility method to give access to the standard substitution algorithm
used by sub and subAll . Appends to the
string buffer the string generated by applying the substitution
parameter to the matched region.
Parameters: rs - Information about the matched region. Parameters: subspec - The substitution parameter. Parameters: sb - StringBuffer to which the generated string is appended. |
match | public String match(String str)(Code) | | Matches the given string against this regular expression.
Parameters: str - The string to match. The substring of str that matched the entireregular expression, or null if the string did notmatch this regular expression. |
match | public boolean match(String str, String[] substrs)(Code) | | Matches the given string against this regular expression, and computes
the set of substrings that matched the parenthesized subexpressions.
substrs[0] is set to the range of str
that matched the entire regular expression.
substrs[1] is set to the range of str
that matched the first (leftmost) parenthesized subexpression.
substrs[n] is set to the range that matched the
n th subexpression, and so on.
If subexpression n did not match, then
substrs[n] is set to null . Not to
be confused with "", which is a valid value for a
subexpression that matched 0 characters.
The length that the caller should use when allocating the
substr array is the return value of
Regexp.subspecs . The array
can be shorter (in which case not all the information will
be returned), or longer (in which case the remainder of the
elements are initialized to null ), or
null (to ignore the subexpressions).
Parameters: str - The string to match. Parameters: substrs - An array of strings allocated by the caller, and filled inwith information about the portions of str thatmatched the regular expression. May be null . true if str that matched thisregular expression, false otherwise.If false is returned, then the contents ofsubstrs are unchanged. See Also: Regexp.subspecs |
match | public boolean match(String str, int[] indices)(Code) | | Matches the given string against this regular expression, and computes
the set of substrings that matched the parenthesized subexpressions.
For the indices specified below, the range extends from the character
at the starting index up to, but not including, the character at the
ending index.
indices[0] and indices[1] are set to
starting and ending indices of the range of str
that matched the entire regular expression.
indices[2] and indices[3] are set to the
starting and ending indices of the range of str that
matched the first (leftmost) parenthesized subexpression.
indices[n * 2] and indices[n * 2 + 1]
are set to the range that matched the n th
subexpression, and so on.
If subexpression n did not match, then
indices[n * 2] and indices[n * 2 + 1]
are both set to -1 .
The length that the caller should use when allocating the
indices array is twice the return value of
Regexp.subspecs . The array
can be shorter (in which case not all the information will
be returned), or longer (in which case the remainder of the
elements are initialized to -1 ), or
null (to ignore the subexpressions).
Parameters: str - The string to match. Parameters: indices - An array of integers allocated by the caller, and filled inwith information about the portions of str thatmatched all the parts of the regular expression.May be null . true if the string matched the regular expression,false otherwise. If false isreturned, then the contents of indices areunchanged. See Also: Regexp.subspecs |
sub | public String sub(String str, String subspec)(Code) | | Matches a string against a regular expression and replaces the first
match with the string generated from the substitution parameter.
Parameters: str - The string to match against this regular expression. Parameters: subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION. The string formed by replacing the first match instr with the string generated fromsubspec . If no matches were found, thenthe return value is null . |
subAll | public String subAll(String str, String subspec)(Code) | | Matches a string against a regular expression and replaces all
matches with the string generated from the substitution parameter.
After each substutition is done, the portions of the string already
examined, including the newly substituted region, are not checked
again for new matches -- only the rest of the string is examined.
Parameters: str - The string to match against this regular expression. Parameters: subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION. The string formed by replacing all the matches instr with the strings generated fromsubspec . If no matches were found, thenthe return value is a copy of str . |
subspecs | public int subspecs()(Code) | | Returns the number of parenthesized subexpressions in this regular
expression, plus one more for this expression itself.
The number. |
toString | public String toString()(Code) | | Returns a string representation of this compiled regular
expression. The format of the string representation is a
symbolic dump of the bytecodes.
A string representation of this regular expression. |
|
|