| RE is an efficient, lightweight regular expression evaluator/matcher
class. Regular expressions are pattern descriptions which enable
sophisticated matching of strings. In addition to being able to
match a string against a pattern, you can also extract parts of the
match. This is especially useful in text parsing! Details on the
syntax of regular expression patterns are given below.
To compile a regular expression (RE), you can simply construct an RE
matcher object from the string specification of the pattern, like this:
RE r = new RE("a*b");
Once you have done this, you can call either of the RE.match methods to
perform matching on a String. For example:
boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the
pattern "a*b" matches the string "aaaab".
If you were interested in the number of a's which matched the
first part of our example expression, you could change the expression to
"(a*)b". Then when you compiled the expression and matched it against
something like "xaaaab", you would get results like this:
RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"
String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
String insideParens = r.getParen(1); // insideParens will be 'aaaa'
int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5
int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
You can also refer to the contents of a parenthesized expression
within a regular expression itself. This is called a
'backreference'. The first backreference in a regular expression is
denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).
The full regular expression syntax accepted by RE is described here:
Characters
unicodeChar Matches any identical unicode character
\ Used to quote a meta-character (like '*')
\\ Matches a single '\' character
\0nnn Matches a given octal character
\xhh Matches a given 8-bit hexadecimal character
\\uhhhh Matches a given 16-bit hexadecimal character
\t Matches an ASCII tab character
\n Matches an ASCII newline character
\r Matches an ASCII return character
\f Matches an ASCII form feed character
Character Classes
[abc] Simple character class
[a-zA-Z] Character class with ranges
[^abc] Negated character class
NOTE: Incomplete ranges will be interpreted as "starts
from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF],
[-] means "all characters".
Standard POSIX Character Classes
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are printable and are also visible.
(A space is printable, but not visible, while an
`a' is both.)
[:lower:] Lower-case alphabetic characters.
[:print:] Printable characters (characters that are not
control characters.)
[:punct:] Punctuation characters (characters that are not letter,
digits, control characters, or space characters).
[:space:] Space characters (such as space, tab, and formfeed,
to name a few).
[:upper:] Upper-case alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
Non-standard POSIX-style Character Classes
[:javastart:] Start of a Java identifier
[:javapart:] Part of a Java identifier
Predefined Classes
. Matches any character other than newline
\w Matches a "word" character (alphanumeric plus "_")
\W Matches a non-word character
\s Matches a whitespace character
\S Matches a non-whitespace character
\d Matches a digit character
\D Matches a non-digit character
Boundary Matchers
^ Matches only at the beginning of a line
$ Matches only at the end of a line
\b Matches only at a word boundary
\B Matches only at a non-word boundary
Greedy Closures
A* Matches A 0 or more times (greedy)
A+ Matches A 1 or more times (greedy)
A? Matches A 1 or 0 times (greedy)
A{n} Matches A exactly n times (greedy)
A{n,} Matches A at least n times (greedy)
A{n,m} Matches A at least n but not more than m times (greedy)
Reluctant Closures
A*? Matches A 0 or more times (reluctant)
A+? Matches A 1 or more times (reluctant)
A?? Matches A 0 or 1 times (reluctant)
Logical Operators
AB Matches A followed by B
A|B Matches either A or B
(A) Used for subexpression grouping
(?:A) Used for subexpression clustering (just like grouping but
no backrefs)
Backreferences
\1 Backreference to 1st parenthesized subexpression
\2 Backreference to 2nd parenthesized subexpression
\3 Backreference to 3rd parenthesized subexpression
\4 Backreference to 4th parenthesized subexpression
\5 Backreference to 5th parenthesized subexpression
\6 Backreference to 6th parenthesized subexpression
\7 Backreference to 7th parenthesized subexpression
\8 Backreference to 8th parenthesized subexpression
\9 Backreference to 9th parenthesized subexpression
All closure operators (+, *, ?, {m,n}) are greedy by default, meaning
that they match as many elements of the string as possible without
causing the overall match to fail. If you want a closure to be
reluctant (non-greedy), you can simply follow it with a '?'. A
reluctant closure will match as few elements of the string as
possible when finding matches. {m,n} closures don't currently
support reluctancy.
Line terminators
A line terminator is a one- or two-character sequence that marks
the end of a line of the input character sequence. The following
are recognized as line terminators:
- A newline (line feed) character ('\n'),
- A carriage-return character followed immediately by a newline character ("\r\n"),
- A standalone carriage-return character ('\r'),
- A next-line character ('\u0085'),
- A line-separator character ('\u2028'), or
- A paragraph-separator character ('\u2029).
RE runs programs compiled by the RECompiler class. But the RE
matcher class does not include the actual regular expression compiler
for reasons of efficiency. In fact, if you want to pre-compile one
or more regular expressions, the 'recompile' class can be invoked
from the command line to produce compiled output like this:
// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};
REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from
the pre-compiled expression re1 and thus avoid the overhead of
compiling the expression at runtime. If you require more dynamic
regular expressions, you can construct a single RECompiler object and
re-use it to compile each expression. Similarly, you can change the
program run by a given matcher object at any time. However, RE and
RECompiler are not threadsafe (for efficiency reasons, and because
requiring thread safety in this class is deemed to be a rare
requirement), so you will need to construct a separate compiler or
matcher object for each thread (unless you do thread synchronization
yourself). Once expression compiled into the REProgram object, REProgram
can be safely shared across multiple threads and RE objects.
ISSUES:
- com.weusours.util.re is not currently compatible with all
standard POSIX regcomp flags
- com.weusours.util.re does not support POSIX equivalence classes
([=foo=] syntax) (I18N/locale issue)
- com.weusours.util.re does not support nested POSIX character
classes (definitely should, but not completely trivial)
- com.weusours.util.re Does not support POSIX character collation
concepts ([.foo.] syntax) (I18N/locale issue)
- Should there be different matching styles (simple, POSIX, Perl etc?)
- Should RE support character iterators (for backwards RE matching!)?
- Should RE support reluctant {m,n} closures (does anyone care)?
- Not *all* possibilities are considered for greediness when backreferences
are involved (as POSIX suggests should be the case). The POSIX RE
"(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match
of acdacaa where \1 is "a". This is not the case in this RE package,
and actually Perl doesn't go to this extent either! Until someone
actually complains about this, I'm not sure it's worth "fixing".
If it ever is fixed, test #137 in RETest.txt should be updated.
See Also: recompile See Also: RECompiler author: Jonathan Locke author: Tobias Schäfer version: $Id: RE.java 518156 2007-03-14 14:31:26Z vgritsenko $
|