StringSearch is the concrete subclass of
SearchIterator that provides language-sensitive text searching
based on the comparison rules defined in a
RuleBasedCollator object.
StringSearch uses a version of the fast Boyer-Moore search
algorithm that has been adapted to work with the large character set of
Unicode. Refer to
"Efficient Text Searching in Java", published in the
Java Report on February, 1999, for further information on the
algorithm.
Users are also strongly encouraged to read the section on
String Search and
Collation in the user guide before attempting to use this class.
String searching gets alittle complicated when accents are encountered at
match boundaries. If a match is found and it has preceding or trailing
accents not part of the match, the result returned will include the
preceding accents up to the first base character, if the pattern searched
for starts an accent. Likewise,
if the pattern ends with an accent, all trailing accents up to the first
base character will be included in the result.
For example, if a match is found in target text "a\u0325\u0300" for
the pattern
"a\u0325", the result returned by StringSearch will be the index 0 and
length 3 <0, 3>. If a match is found in the target
"a\u0325\u0300"
for the pattern "\u0300", then the result will be index 1 and length 2
<1, 2>.
In the case where the decomposition mode is on for the RuleBasedCollator,
all matches that starts or ends with an accent will have its results include
preceding or following accents respectively. For example, if pattern "a" is
looked for in the target text "á\u0325", the result will be
index 0 and length 2 <0, 2>.
The StringSearch class provides two options to handle accent matching
described below:
Let S' be the sub-string of a text string S between the offsets start and
end <start, end>.
A pattern string P matches a text string S at the offsets <start,
length>
if
option 1. P matches some canonical equivalent string of S'. Suppose the
RuleBasedCollator used for searching has a collation strength of
TERTIARY, all accents are non-ignorable. If the pattern
"a\u0300" is searched in the target text
"a\u0325\u0300",
a match will be found, since the target text is canonically
equivalent to "a\u0300\u0325"
option 2. P matches S' and if P starts or ends with a combining mark,
there exists no non-ignorable combining mark before or after S'
in S respectively. Following the example above, the pattern
"a\u0300" will not find a match in "a\u0325\u0300",
since
there exists a non-ignorable accent '\u0325' in the middle of
'a' and '\u0300'. Even with a target text of
"a\u0300\u0325" a match will not be found because of the
non-ignorable trailing accent \u0325.
Option 2. will be the default mode for dealing with boundary accents unless
specified via the API setCanonical(boolean).
One restriction is to be noted for option 1. Currently there are no
composite characters that consists of a character with combining class > 0
before a character with combining class == 0. However, if such a character
exists in the future, the StringSearch may not work correctly with option 1
when such characters are encountered.
SearchIterator provides APIs to specify the starting position
within the text string to be searched, e.g. setIndex,
preceding and following. Since the starting position will
be set as it is specified, please take note that there are some dangerous
positions which the search may render incorrect results:
- The midst of a substring that requires decomposition.
- If the following match is to be found, the position should not be the
second character which requires to be swapped with the preceding
character. Vice versa, if the preceding match is to be found,
position to search from should not be the first character which
requires to be swapped with the next character. E.g certain Thai and
Lao characters require swapping.
- If a following pattern match is to be found, any position within a
contracting sequence except the first will fail. Vice versa if a
preceding pattern match is to be found, a invalid starting point
would be any character within a contracting sequence except the last.
Though collator attributes will be taken into consideration while
performing matches, there are no APIs provided in StringSearch for setting
and getting the attributes. These attributes can be set by getting the
collator from getCollator and using the APIs in
com.ibm.icu.text.Collator. To update StringSearch to the new
collator attributes, reset() or
setCollator(RuleBasedCollator) has to be called.
Consult the
String Search user guide and the SearchIterator
documentation for more information and examples of use.
This class is not subclassable
See Also: SearchIterator See Also: RuleBasedCollator author: Laura Werner, synwee |