| java.lang.Object com.jcorporate.expresso.ext.regexp.RE
RE | public class RE (Code) | | RE is an efficient, lightweight regular expression evaluator/matcher class.
Regular expressions are pattern descriptions which enable sophisticated matching of
strings. In addition to being able to match a string against a pattern, you
can also extract parts of the match. This is especially useful in text parsing!
Details on the syntax of regular expression patterns are given below.
To compile a regular expression (RE), you can simply construct an RE matcher
object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to
perform matching on a String. For example:
boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the
pattern "a*b" matches the string "aaaab".
If you were interested in the number of a's which matched the first
part of our example expression, you could change the expression to
"(a*)b". Then when you compiled the expression and matched it against
something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"
String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
String insideParens = r.getParen(1); // insideParens will be 'aaaa'
int startWholeExpr = getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = getParenLength(0); // lenWholeExpr will be 5
int startInside = getParenStart(1); // startInside will be index 1
int endInside = getParenEnd(1); // endInside will be index 5
int lenInside = getParenLength(1); // lenInside will be 4
You can also refer to the contents of a parenthesized expression within
a regular expression itself. This is called a 'backreference'. The first
backreference in a regular expression is denoted by \1, the second by \2
and so on. So the expression:
([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is described here:
Characters
unicodeChar Matches any identical unicode character
\ Used to quote a meta-character (like '*')
\\ Matches a single '\' character
\0nnn Matches a given octal character
\xhh Matches a given 8-bit hexadecimal character
\\uhhhh Matches a given 16-bit hexadecimal character
\t Matches an ASCII tab character
\n Matches an ASCII newline character
\r Matches an ASCII return character
\f Matches an ASCII form feed character
Character Classes
[abc] Simple character class
[a-zA-Z] Character class with ranges
[^abc] Negated character class
Standard POSIX Character Classes
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are printable and are also visible. (A space is printable, but not visible, while an `a' is both.)
[:lower:] Lower-case alphabetic characters.
[:print:] Printable characters (characters that are not control characters.)
[:punct:] Punctuation characters (characters that are not letter, digits, control characters, or space characters).
[:space:] Space characters (such as space, tab, and formfeed, to name a few).
[:upper:] Upper-case alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
Non-standard POSIX-style Character Classes
[:javastart:] Start of a Java identifier
[:javapart:] Part of a Java identifier
Predefined Classes
. Matches any character other than newline
\w Matches a "word" character (alphanumeric plus "_")
\W Matches a non-word character
\s Matches a whitespace character
\S Matches a non-whitespace character
\d Matches a digit character
\D Matches a non-digit character
Boundary Matchers
^ Matches only at the beginning of a line
$ Matches only at the end of a line
\b Matches only at a word boundary
\B Matches only at a non-word boundary
Greedy Closures
A* Matches A 0 or more times (greedy)
A+ Matches A 1 or more times (greedy)
A? Matches A 1 or 0 times (greedy)
A{n} Matches A exactly n times (greedy)
A{n,} Matches A at least n times (greedy)
A{n,m} Matches A at least n but not more than m times (greedy)
Reluctant Closures
A*? Matches A 0 or more times (reluctant)
A+? Matches A 1 or more times (reluctant)
A?? Matches A 0 or 1 times (reluctant)
Logical Operators
AB Matches A followed by B
A|B Matches either A or B
(A) Used for subexpression grouping
Backreferences
\1 Backreference to 1st parenthesized subexpression
\2 Backreference to 2nd parenthesized subexpression
\3 Backreference to 3rd parenthesized subexpression
\4 Backreference to 4th parenthesized subexpression
\5 Backreference to 5th parenthesized subexpression
\6 Backreference to 6th parenthesized subexpression
\7 Backreference to 7th parenthesized subexpression
\8 Backreference to 8th parenthesized subexpression
\9 Backreference to 9th parenthesized subexpression
All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they
match as many elements of the string as possible without causing the overall
match to fail. If you want a closure to be reluctant (non-greedy), you can
simply follow it with a '?'. A reluctant closure will match as few elements
of the string as possible when finding matches. {m,n} closures don't currently
support reluctancy.
RE runs programs compiled by the RECompiler class. But the RE matcher class
does not include the actual regular expression compiler for reasons of
efficiency. In fact, if you want to pre-compile one or more regular expressions,
the 'recompile' class can be invoked from the command line to produce compiled
output like this:
// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};
REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from the pre-compiled
expression re1 and thus avoid the overhead of compiling the expression at runtime.
If you require more dynamic regular expressions, you can construct a single RECompiler
object and re-use it to compile each expression. Similarly, you can change the
program run by a given matcher object at any time. However, RE and RECompiler are
not threadsafe (for efficiency reasons, and because requiring thread safety in this
class is deemed to be a rare requirement), so you will need to construct a separate
compiler or matcher object for each thread (unless you do thread synchronization
yourself).
ISSUES:
- com.weusours.util.re is not currently compatible with all standard POSIX regcomp flags
- com.weusours.util.re does not support POSIX equivalence classes ([=foo=] syntax) (I18N/locale issue)
- com.weusours.util.re does not support nested POSIX character classes (definitely should, but not completely trivial)
- com.weusours.util.re Does not support POSIX character collation concepts ([.foo.] syntax) (I18N/locale issue)
- Should there be different matching styles (simple, POSIX, Perl etc?)
- Should RE support character iterators (for backwards RE matching!)?
- Should RE support reluctant {m,n} closures (does anyone care)?
- Not *all* possibilities are considered for greediness when backreferences
are involved (as POSIX suggests should be the case). The POSIX RE
"(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match
of acdacaa where \1 is "a". This is not the case in this RE package,
and actually Perl doesn't go to this extent either! Until someone
actually complains about this, I'm not sure it's worth "fixing".
If it ever is fixed, test #137 in RETest.txt should be updated.
author: Jonathan Locke version: $Id: RE.java,v 1.9 2004/11/18 02:03:28 lhamel Exp $ See Also: RECompiler |
Constructor Summary | |
public | RE() Constructs a regular expression matcher with no initial program. | public | RE(REProgram program) Construct a matcher for a pre-compiled regular expression from program
(bytecode) data. | public | RE(REProgram program, int matchFlags) Construct a matcher for a pre-compiled regular expression from program
(bytecode) data. | public | RE(String pattern) Constructs a regular expression matcher from a String by compiling it
using a new instance of RECompiler. | public | RE(String pattern, int matchFlags) Constructs a regular expression matcher from a String by compiling it
using a new instance of RECompiler. |
Method Summary | |
public int | getMatchFlags() Returns the current match behaviour flags. | public String | getParen(int which) Gets the contents of a parenthesized subexpression after a successful match. | public int | getParenCount() Returns the number of parenthesized subexpressions available after a successful match. | final public int | getParenEnd(int which) Returns the end index of a given paren level. | final public int | getParenLength(int which) Returns the length of a given paren level. | final public int | getParenStart(int which) Returns the start index of a given paren level. | public REProgram | getProgram() Returns the current regular expression program in use by this matcher object. | public String[] | grep(Object[] search) Returns an array of Strings, whose toString representation matches a regular
expression. | protected void | internalError(String s) Throws an Error representing an internal error condition probably resulting
from a bug in the regular expression compiler (or possibly data corruption). | public boolean | match(CharacterIterator search, int i) Matches the current regular expression program against a character array,
starting at a given index. | public boolean | match(String search) Matches the current regular expression program against a String. | public boolean | match(String search, int i) Matches the current regular expression program against a character array,
starting at a given index. | protected boolean | matchAt(int i) Match the current regular expression program against the current
input string, starting at index i of the input string. | protected int | matchNodes(int firstNode, int lastNode, int idxStart) Try to match a string against a subset of nodes in the program
Parameters: firstNode - Node to start at in program Parameters: lastNode - Last valid node (used for matching a subexpression withoutmatching the rest of the program as well). Parameters: idxStart - Starting position in character array Final input array index if match succeeded. | public void | setMatchFlags(int matchFlags) Sets match behaviour flags which alter the way RE does matching. | final protected void | setParenEnd(int which, int i) | final protected void | setParenStart(int which, int i) | public void | setProgram(REProgram program) Sets the current regular expression program used by this matcher object. | public static String | simplePatternToFullRegularExpression(String pattern) | public String[] | split(String s) Splits a string into an array of strings on regular expression boundaries. | public String | subst(String substituteIn, String substitution) Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
Parameters: substituteIn - String to substitute within Parameters: substitution - String to substitute for all matches of this regular expression. | public String | subst(String substituteIn, String substitution, int flags) Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
Parameters: substituteIn - String to substitute within Parameters: substitution - String to substitute for matches of this regular expression Parameters: flags - One or more bitwise flags from REPLACE_*. |
E_ALNUM | final static char E_ALNUM(Code) | | |
E_BOUND | final static char E_BOUND(Code) | | |
E_DIGIT | final static char E_DIGIT(Code) | | |
E_NALNUM | final static char E_NALNUM(Code) | | |
E_NBOUND | final static char E_NBOUND(Code) | | |
E_NDIGIT | final static char E_NDIGIT(Code) | | |
E_NSPACE | final static char E_NSPACE(Code) | | |
E_SPACE | final static char E_SPACE(Code) | | |
MATCH_CASEINDEPENDENT | final public static int MATCH_CASEINDEPENDENT(Code) | | Flag to indicate that matching should be case-independent (folded)
|
MATCH_MULTILINE | final public static int MATCH_MULTILINE(Code) | | Newlines should match as BOL/EOL (^ and $)
|
MATCH_NORMAL | final public static int MATCH_NORMAL(Code) | | Specifies normal, case-sensitive matching behaviour.
|
OP_ANY | final static char OP_ANY(Code) | | |
OP_ANYOF | final static char OP_ANYOF(Code) | | |
OP_ATOM | final static char OP_ATOM(Code) | | |
OP_BACKREF | final static char OP_BACKREF(Code) | | |
OP_BOL | final static char OP_BOL(Code) | | |
OP_BRANCH | final static char OP_BRANCH(Code) | | |
OP_CLOSE | final static char OP_CLOSE(Code) | | |
OP_END | final static char OP_END(Code) | | The format of a node in a program is: *
[ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] *
char OPCODE - instruction *
char OPDATA - modifying data *
char OPNEXT - next node (relative offset) *
|
OP_EOL | final static char OP_EOL(Code) | | |
OP_ESCAPE | final static char OP_ESCAPE(Code) | | |
OP_GOTO | final static char OP_GOTO(Code) | | |
OP_MAYBE | final static char OP_MAYBE(Code) | | |
OP_NOTHING | final static char OP_NOTHING(Code) | | |
OP_OPEN | final static char OP_OPEN(Code) | | |
OP_PLUS | final static char OP_PLUS(Code) | | |
OP_POSIXCLASS | final static char OP_POSIXCLASS(Code) | | |
OP_RELUCTANTMAYBE | final static char OP_RELUCTANTMAYBE(Code) | | |
OP_RELUCTANTPLUS | final static char OP_RELUCTANTPLUS(Code) | | |
OP_RELUCTANTSTAR | final static char OP_RELUCTANTSTAR(Code) | | |
OP_STAR | final static char OP_STAR(Code) | | |
POSIX_CLASS_ALNUM | final static char POSIX_CLASS_ALNUM(Code) | | |
POSIX_CLASS_ALPHA | final static char POSIX_CLASS_ALPHA(Code) | | |
POSIX_CLASS_BLANK | final static char POSIX_CLASS_BLANK(Code) | | |
POSIX_CLASS_CNTRL | final static char POSIX_CLASS_CNTRL(Code) | | |
POSIX_CLASS_DIGIT | final static char POSIX_CLASS_DIGIT(Code) | | |
POSIX_CLASS_GRAPH | final static char POSIX_CLASS_GRAPH(Code) | | |
POSIX_CLASS_JPART | final static char POSIX_CLASS_JPART(Code) | | |
POSIX_CLASS_JSTART | final static char POSIX_CLASS_JSTART(Code) | | |
POSIX_CLASS_LOWER | final static char POSIX_CLASS_LOWER(Code) | | |
POSIX_CLASS_PRINT | final static char POSIX_CLASS_PRINT(Code) | | |
POSIX_CLASS_PUNCT | final static char POSIX_CLASS_PUNCT(Code) | | |
POSIX_CLASS_SPACE | final static char POSIX_CLASS_SPACE(Code) | | |
POSIX_CLASS_UPPER | final static char POSIX_CLASS_UPPER(Code) | | |
POSIX_CLASS_XDIGIT | final static char POSIX_CLASS_XDIGIT(Code) | | |
REPLACE_ALL | final public static int REPLACE_ALL(Code) | | Flag bit that indicates that subst should replace all occurrences of this
regular expression.
|
REPLACE_FIRSTONLY | final public static int REPLACE_FIRSTONLY(Code) | | Flag bit that indicates that subst should only replace the first occurrence
of this regular expression.
|
endBackref | int[] endBackref(Code) | | |
matchFlags | int matchFlags(Code) | | |
maxNode | final static int maxNode(Code) | | |
maxParen | final static int maxParen(Code) | | |
nodeSize | final static int nodeSize(Code) | | |
offsetNext | final static int offsetNext(Code) | | |
offsetOpcode | final static int offsetOpcode(Code) | | |
offsetOpdata | final static int offsetOpdata(Code) | | |
parenCount | int parenCount(Code) | | |
startBackref | int[] startBackref(Code) | | |
RE | public RE()(Code) | | Constructs a regular expression matcher with no initial program.
This is likely to be an uncommon practice, but is still supported.
|
RE | public RE(REProgram program)(Code) | | Construct a matcher for a pre-compiled regular expression from program
(bytecode) data.
Parameters: program - Compiled regular expression program See Also: RECompiler |
RE | public RE(REProgram program, int matchFlags)(Code) | | Construct a matcher for a pre-compiled regular expression from program
(bytecode) data. Permits special flags to be passed in to modify matching
behaviour.
Parameters: program - Compiled regular expression program (see RECompiler and/or recompile) Parameters: matchFlags - One or more of the RE match behaviour flags (RE.MATCH_*):MATCH_NORMAL // Normal (case-sensitive) matchingMATCH_CASEINDEPENDENT // Case folded comparisonsMATCH_MULTILINE // Newline matches as BOL/EOL See Also: RECompiler See Also: REProgram |
RE | public RE(String pattern) throws RESyntaxException(Code) | | Constructs a regular expression matcher from a String by compiling it
using a new instance of RECompiler. If you will be compiling many
expressions, you may prefer to use a single RECompiler object instead.
Parameters: pattern - The regular expression pattern to compile. throws: RESyntaxException - Thrown if the regular expression has invalid syntax. See Also: RECompiler |
RE | public RE(String pattern, int matchFlags) throws RESyntaxException(Code) | | Constructs a regular expression matcher from a String by compiling it
using a new instance of RECompiler. If you will be compiling many
expressions, you may prefer to use a single RECompiler object instead.
Parameters: pattern - The regular expression pattern to compile. Parameters: matchFlags - The matching style throws: RESyntaxException - Thrown if the regular expression has invalid syntax. See Also: RECompiler |
getMatchFlags | public int getMatchFlags()(Code) | | Returns the current match behaviour flags.
Current match behaviour flags (RE.MATCH_*).MATCH_NORMAL // Normal (case-sensitive) matchingMATCH_CASEINDEPENDENT // Case folded comparisonsMATCH_MULTILINE // Newline matches as BOL/EOL See Also: RE.setMatchFlags |
getParen | public String getParen(int which)(Code) | | Gets the contents of a parenthesized subexpression after a successful match.
Parameters: which - Nesting level of subexpression String |
getParenCount | public int getParenCount()(Code) | | Returns the number of parenthesized subexpressions available after a successful match.
Number of available parenthesized subexpressions |
getParenEnd | final public int getParenEnd(int which)(Code) | | Returns the end index of a given paren level.
Parameters: which - Nesting level of subexpression String index |
getParenLength | final public int getParenLength(int which)(Code) | | Returns the length of a given paren level.
Parameters: which - Nesting level of subexpression Number of characters in the parenthesized subexpression |
getParenStart | final public int getParenStart(int which)(Code) | | Returns the start index of a given paren level.
Parameters: which - Nesting level of subexpression String index |
getProgram | public REProgram getProgram()(Code) | | Returns the current regular expression program in use by this matcher object.
Regular expression program See Also: RE.setProgram |
grep | public String[] grep(Object[] search)(Code) | | Returns an array of Strings, whose toString representation matches a regular
expression. This method works like the Perl function of the same name. Given
a regular expression of "a*b" and an array of String objects of [foo, aab, zzz,
aaaab], the array of Strings returned by grep would be [aab, aaaab].
Parameters: search - Array of Objects to search Array of Objects whose toString value matches this regular expression. |
internalError | protected void internalError(String s) throws Error(Code) | | Throws an Error representing an internal error condition probably resulting
from a bug in the regular expression compiler (or possibly data corruption).
In practice, this should be very rare.
Parameters: s - Error description |
match | public boolean match(CharacterIterator search, int i)(Code) | | Matches the current regular expression program against a character array,
starting at a given index.
Parameters: search - String to match against Parameters: i - Index to start searching at True if string matched |
match | public boolean match(String search)(Code) | | Matches the current regular expression program against a String.
Parameters: search - String to match against True if string matched |
match | public boolean match(String search, int i)(Code) | | Matches the current regular expression program against a character array,
starting at a given index.
Parameters: search - String to match against Parameters: i - Index to start searching at True if string matched |
matchAt | protected boolean matchAt(int i)(Code) | | Match the current regular expression program against the current
input string, starting at index i of the input string. This method
is only meant for internal use.
Parameters: i - The input string index to start matching at True if the input matched the expression |
matchNodes | protected int matchNodes(int firstNode, int lastNode, int idxStart)(Code) | | Try to match a string against a subset of nodes in the program
Parameters: firstNode - Node to start at in program Parameters: lastNode - Last valid node (used for matching a subexpression withoutmatching the rest of the program as well). Parameters: idxStart - Starting position in character array Final input array index if match succeeded. -1 if not. |
setMatchFlags | public void setMatchFlags(int matchFlags)(Code) | | Sets match behaviour flags which alter the way RE does matching.
Parameters: matchFlags - One or more of the RE match behaviour flags (RE.MATCH_*):MATCH_NORMAL // Normal (case-sensitive) matchingMATCH_CASEINDEPENDENT // Case folded comparisonsMATCH_MULTILINE // Newline matches as BOL/EOL |
setParenEnd | final protected void setParenEnd(int which, int i)(Code) | | Sets the end of a paren level
Parameters: which - Which paren level Parameters: i - Index in input array |
setParenStart | final protected void setParenStart(int which, int i)(Code) | | Sets the start of a paren level
Parameters: which - Which paren level Parameters: i - Index in input array |
setProgram | public void setProgram(REProgram program)(Code) | | Sets the current regular expression program used by this matcher object.
Parameters: program - Regular expression program compiled by RECompiler. See Also: RECompiler See Also: REProgram |
simplePatternToFullRegularExpression | public static String simplePatternToFullRegularExpression(String pattern)(Code) | | Converts a 'simplified' regular expression to a full regular expression
Parameters: pattern - The pattern to convert The full regular expression |
split | public String[] split(String s)(Code) | | Splits a string into an array of strings on regular expression boundaries.
This function works the same way as the Perl function of the same name.
Given a regular expression of "[ab]+" and a string to split of
"xyzzyababbayyzabbbab123", the result would be the array of Strings
"[xyzzy, yyz, 123]".
Parameters: s - String to split on this regular exression Array of strings |
subst | public String subst(String substituteIn, String substitution)(Code) | | Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
Parameters: substituteIn - String to substitute within Parameters: substitution - String to substitute for all matches of this regular expression. The string substituteIn with zero or more occurrences of the currentregular expression replaced with the substitution String (if this regularexpression object doesn't match at any position, the original String is returnedunchanged). |
subst | public String subst(String substituteIn, String substitution, int flags)(Code) | | Substitutes a string for this regular expression in another string.
This method works like the Perl function of the same name.
Given a regular expression of "a*b", a String to substituteIn of
"aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the
resulting String returned by subst would be "-foo-garply-wacky-".
Parameters: substituteIn - String to substitute within Parameters: substitution - String to substitute for matches of this regular expression Parameters: flags - One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLYflag bit is set, only the first occurrence of this regular expression is replaced.If the bit is not set (REPLACE_ALL), all occurrences of this pattern will bereplaced. The string substituteIn with zero or more occurrences of the currentregular expression replaced with the substitution String (if this regularexpression object doesn't match at any position, the original String is returnedunchanged). |
|
|