RuleBasedTransliterator is a transliterator
that reads a set of rules in order to determine how to perform
translations. Rule sets are stored in resource bundles indexed by
name. Rules within a rule set are separated by semicolons (';').
To include a literal semicolon, prefix it with a backslash ('\').
Whitespace, as defined by UCharacterProperty.isRuleWhiteSpace() ,
is ignored. If the first non-blank character on a line is '#',
the entire line is ignored as a comment.
Each set of rules consists of two groups, one forward, and one
reverse. This is a convention that is not enforced; rules for one
direction may be omitted, with the result that translations in
that direction will not modify the source text. In addition,
bidirectional forward-reverse rules may be specified for
symmetrical transformations.
Rule syntax
Rule statements take one of the following forms:
$alefmadda=\u0622;
- Variable definition. The name on the
left is assigned the text on the right. In this example,
after this statement, instances of the left hand name,
"
$alefmadda ", will be replaced by
the Unicode character U+0622. Variable names must begin
with a letter and consist only of letters, digits, and
underscores. Case is significant. Duplicate names cause
an exception to be thrown, that is, variables cannot be
redefined. The right hand side may contain well-formed
text of any length, including no text at all ("$empty=; ").
The right hand side may contain embedded UnicodeSet
patterns, for example, "$softvowel=[eiyEIY] ".
-
ai>$alefmadda;
- Forward translation rule. This rule
states that the string on the left will be changed to the
string on the right when performing forward
transliteration.
-
ai<$alefmadda;
- Reverse translation rule. This rule
states that the string on the right will be changed to
the string on the left when performing reverse
transliteration.
ai<>$alefmadda;
- Bidirectional translation rule. This
rule states that the string on the right will be changed
to the string on the left when performing forward
transliteration, and vice versa when performing reverse
transliteration.
Translation rules consist of a match pattern and an output
string. The match pattern consists of literal characters,
optionally preceded by context, and optionally followed by
context. Context characters, like literal pattern characters,
must be matched in the text being transliterated. However, unlike
literal pattern characters, they are not replaced by the output
text. For example, the pattern "abc{def} "
indicates the characters "def " must be
preceded by "abc " for a successful match.
If there is a successful match, "def " will
be replaced, but not "abc ". The final '} '
is optional, so "abc{def " is equivalent to
"abc{def} ". Another example is "{123}456 "
(or "123}456 ") in which the literal
pattern "123 " must be followed by "456 ".
The output string of a forward or reverse rule consists of
characters to replace the literal pattern characters. If the
output string contains the character '| ', this is
taken to indicate the location of the cursor after
replacement. The cursor is the point in the text at which the
next replacement, if any, will be applied. The cursor is usually
placed within the replacement text; however, it can actually be
placed into the precending or following context by using the
special character '@ '. Examples:
a {foo} z > | @ bar; # foo -> bar, move cursor
before a
{foo} xyz > bar @@|; # foo -> bar, cursor between
y and z
UnicodeSet
UnicodeSet patterns may appear anywhere that
makes sense. They may appear in variable definitions.
Contrariwise, UnicodeSet patterns may themselves
contain variable references, such as "$a=[a-z];$not_a=[^$a] ",
or "$range=a-z;$ll=[$range] ".
UnicodeSet patterns may also be embedded directly
into rule strings. Thus, the following two rules are equivalent:
$vowel=[aeiou]; $vowel>'*'; # One way to do this
[aeiou]>'*';
#
Another way
See
UnicodeSet for more documentation and examples.
Segments
Segments of the input string can be matched and copied to the
output string. This makes certain sets of rules simpler and more
general, and makes reordering possible. For example:
([a-z]) > $1 $1;
#
double lowercase letters
([:Lu:]) ([:Ll:]) > $2 $1; # reverse order of Lu-Ll pairs
The segment of the input string to be copied is delimited by
"( " and ") ". Up to
nine segments may be defined. Segments may not overlap. In the
output string, "$1 " through "$9 "
represent the input string segments, in left-to-right order of
definition.
Anchors
Patterns can be anchored to the beginning or the end of the text. This is done with the
special characters '^ ' and '$ '. For example:
^ a > 'BEG_A'; # match 'a' at start of text
a > 'A'; # match other instances
of 'a'
z $ > 'END_Z'; # match 'z' at end of text
z > 'Z'; # match other instances
of 'z'
It is also possible to match the beginning or the end of the text using a UnicodeSet .
This is done by including a virtual anchor character '$ ' at the end of the
set pattern. Although this is usually the match chafacter for the end anchor, the set will
match either the beginning or the end of the text, depending on its placement. For
example:
$x = [a-z$]; # match 'a' through 'z' OR anchor
$x 1 > 2; # match '1' after a-z or at the start
3 $x > 4; # match '3' before a-z or at the end
Example
The following example rules illustrate many of the features of
the rule language.
Rule 1. |
abc{def}>x|y |
Rule 2. |
xyz>r |
Rule 3. |
yz>q |
Applying these rules to the string "adefabcdefz "
yields the following results:
|adefabcdefz |
Initial state, no rules match. Advance
cursor. |
a|defabcdefz |
Still no match. Rule 1 does not match
because the preceding context is not present. |
ad|efabcdefz |
Still no match. Keep advancing until
there is a match... |
ade|fabcdefz |
... |
adef|abcdefz |
... |
adefa|bcdefz |
... |
adefab|cdefz |
... |
adefabc|defz |
Rule 1 matches; replace "def "
with "xy " and back up the cursor
to before the 'y '. |
adefabcx|yz |
Although "xyz " is
present, rule 2 does not match because the cursor is
before the 'y ', not before the 'x '.
Rule 3 does match. Replace "yz "
with "q ". |
adefabcxq| |
The cursor is at the end;
transliteration is complete. |
The order of rules is significant. If multiple rules may match
at some point, the first matching rule is applied.
Forward and reverse rules may have an empty output string.
Otherwise, an empty left or right hand side of any statement is a
syntax error.
Single quotes are used to quote any character other than a
digit or letter. To specify a single quote itself, inside or
outside of quotes, use two single quotes in a row. For example,
the rule "'>'>o''clock " changes the
string "> " to the string "o'clock ".
Notes
While a RuleBasedTransliterator is being built, it checks that
the rules are added in proper order. For example, if the rule
"a>x" is followed by the rule "ab>y",
then the second rule will throw an exception. The reason is that
the second rule can never be triggered, since the first rule
always matches anything it matches. In other words, the first
rule masks the second rule.
Copyright (c) IBM Corporation 1999-2000. All rights reserved.
author: Alan Liu |