Source Code Cross Referenced for BreakIteratorRules_en_US_TEST.java in » Internationalization-Localization » icu4j » com » ibm » icu » dev » test » rbbi » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Internationalization Localization » icu4j » com.ibm.icu.dev.test.rbbi

Source Cross Referenced Class Diagram Java Document (Java Doc)

001:        /*
002:         *******************************************************************************
003:         * Copyright (C) 1996-2004, International Business Machines Corporation and    *
004:         * others. All Rights Reserved.                                                *
005:         *******************************************************************************
006:         */
007:        package com.ibm.icu.dev.test.rbbi;
008:
009:        import java.util.ListResourceBundle;
010:
011:        /**
012:         * This resource bundle is included for testing and demonstration purposes only.
013:         * It applies the dictionary-based algorithm to English text that has had all the
014:         * spaces removed.  Once we have good test cases for Thai, we will replace this
015:         * with good resource data (and a good dictionary file) for Thai
016:         */
017:        public class BreakIteratorRules_en_US_TEST extends ListResourceBundle {
018:            private static final String DATA_NAME = "/com/ibm/icu/dev/data/rbbi/english.dict";
019:
020:            // calling code will handle case where dictionary does not exist
021:
022:            public Object[][] getContents() {
023:                return new Object[][] {
024:                        // names of classes to instantiate for the different kinds of break
025:                        // iterator.  Notice we're now using DictionaryBasedBreakIterator
026:                        // for word and line breaking.
027:                        { "BreakIteratorClasses",
028:                                new String[] { "RuleBasedBreakIterator",
029:                                // character-break iterator class
030:                                        "DictionaryBasedBreakIterator",
031:                                        // word-break iterator class
032:                                        "DictionaryBasedBreakIterator",
033:                                        // line-break iterator class
034:                                        "RuleBasedBreakIterator" } // sentence-break iterator class
035:                        },
036:
037:                        // These are the same word-breaking rules as are specified in the default
038:                        // resource, except that the Latin letters, apostrophe, and hyphen are
039:                        // specified as dictionary characters
040:                        { "WordBreakRules",
041:                        // ignore non-spacing marks, enclosing marks, and format characters,
042:                                // all of which should not influence the algorithm
043:                                "$_ignore_=[[:Mn:][:Me:][:Cf:]];"
044:
045:                                        // lower and upper case Roman letters, apostrophy and dash are
046:                                        // in the English dictionary
047:                                        + "$_dictionary_=[a-zA-Z\\'\\-];"
048:
049:                                        // Hindi phrase separator, kanji, katakana, hiragana, CJK diacriticals,
050:                                        // other letters, and digits
051:                                        + "$danda=[\u0964\u0965];"
052:                                        + "$kanji=[\u3005\u4e00-\u9fa5\uf900-\ufa2d];"
053:                                        + "$kata=[\u3099-\u309c\u30a1-\u30fe];"
054:                                        + "$hira=[\u3041-\u309e\u30fc];"
055:                                        + "$let=[[[:L:][:Mc:]]-[$kanji$kata$hira]];"
056:                                        + "$dgt=[:N:];"
057:
058:                                        // punctuation that can occur in the middle of a word: currently
059:                                        // dashes, apostrophes, and quotation marks
060:                                        + "$mid_word=[[:Pd:]\u00ad\u2027\\\"\\\'];"
061:
062:                                        // punctuation that can occur in the middle of a number: currently
063:                                        // apostrophes, qoutation marks, periods, commas, and the Arabic
064:                                        // decimal point
065:                                        + "$mid_num=[\\\"\\\'\\,\u066b\\.];"
066:
067:                                        // punctuation that can occur at the beginning of a number: currently
068:                                        // the period, the number sign, and all currency symbols except the cents sign
069:                                        + "$pre_num=[[[:Sc:]-[\u00a2]]\\#\\.];"
070:
071:                                        // punctuation that can occur at the end of a number: currently
072:                                        // the percent, per-thousand, per-ten-thousand, and Arabic percent
073:                                        // signs, the cents sign, and the ampersand
074:                                        + "$post_num=[\\%\\&\u00a2\u066a\u2030\u2031];"
075:
076:                                        // line separators: currently LF, FF, PS, and LS
077:                                        + "$ls=[\n\u000c\u2028\u2029];"
078:
079:                                        // whitespace: all space separators and the tab character
080:                                        + "$ws=[[:Zs:]\t];"
081:
082:                                        // a word is a sequence of letters that may contain internal
083:                                        // punctuation, as long as it begins and ends with a letter and
084:                                        // never contains two punctuation marks in a row
085:                                        + "$word=($let+($mid_word$let+)*$danda?);"
086:
087:                                        // a number is a sequence of digits that may contain internal
088:                                        // punctuation, as long as it begins and ends with a digit and
089:                                        // never contains two punctuation marks in a row.
090:                                        + "$number=($dgt+($mid_num$dgt+)*);"
091:
092:                                        // break after every character, with the following exceptions
093:                                        // (this will cause punctuation marks that aren't considered
094:                                        // part of words or numbers to be treated as words unto themselves)
095:                                        + ".;"
096:
097:                                        // keep together any sequence of contiguous words and numbers
098:                                        // (including just one of either), plus an optional trailing
099:                                        // number-suffix character
100:                                        + "$word?($number$word)*($number$post_num?)?;"
101:
102:                                        // keep together and sequence of contiguous words and numbers
103:                                        // that starts with a number-prefix character and a number,
104:                                        // and may end with a number-suffix character
105:                                        + "$pre_num($number$word)*($number$post_num?)?;"
106:
107:                                        // keep together runs of whitespace (optionally with a single trailing
108:                                        // line separator or CRLF sequence)
109:                                        + "$ws*\r?$ls?;"
110:
111:                                        // keep together runs of Katakana
112:                                        + "$kata*;"
113:
114:                                        // keep together runs of Hiragana
115:                                        + "$hira*;"
116:
117:                                        // keep together runs of Kanji
118:                                        + "$kanji*;" },
119:
120:                        // These are the same line-breaking rules as are specified in the default
121:                        // resource, except that the Latin letters, apostrophe, and hyphen are
122:                        // specified as dictionary characters
123:                        {
124:                                "LineBreakRules",
125:                                // ignore non-spacing marks, enclosing marks, and format characters
126:                                "$_ignore_=[[:Mn:][:Me:][:Cf:]];"
127:
128:                                        // lower and upper case Roman letters, apostrophy and dash
129:                                        // are in the English dictionary
130:                                        + "$_dictionary_=[a-zA-Z\\'\\-];"
131:
132:                                        // Hindi phrase separators
133:                                        + "$danda=[\u0964\u0965];"
134:
135:                                        // characters that always cause a break: ETX, tab, LF, FF, LS, and PS
136:                                        + "$break=[\u0003\t\n\f\u2028\u2029];"
137:
138:                                        // characters that always prevent a break: the non-breaking space
139:                                        // and similar characters
140:                                        + "$nbsp=[\u00a0\u2007\u2011\ufeff];"
141:
142:                                        // whitespace: space separators and control characters, except for
143:                                        // CR and the other characters mentioned above
144:                                        + "$space=[[[:Zs:][:Cc:]]-[$nbsp$break\r]];"
145:
146:                                        // dashes: dash punctuation and the discretionary hyphen, except for
147:                                        // non-breaking hyphens
148:                                        + "$dash=[[[:Pd:]\u00ad]-[$nbsp]];"
149:
150:                                        // characters that stick to a word if they precede it: currency symbols
151:                                        // (except the cents sign) and starting punctuation
152:                                        + "$pre_word=[[[:Sc:]-[\u00a2]][:Ps:]\\\"\\\'];"
153:
154:                                        // characters that stick to a word if they follow it: ending punctuation,
155:                                        // other punctuation that usually occurs at the end of a sentence,
156:                                        // small Kana characters, some CJK diacritics, etc.
157:                                        + "$post_word=[[:Pe:]\\!\\\"\\\'\\%\\.\\,\\:\\;\\?\u00a2\u00b0\u066a\u2030-\u2034"
158:                                        + "\u2103\u2105\u2109\u3001\u3002\u3005\u3041\u3043\u3045\u3047\u3049\u3063"
159:                                        + "\u3083\u3085\u3087\u308e\u3099-\u309e\u30a1\u30a3\u30a5\u30a7\u30a9"
160:                                        + "\u30c3\u30e3\u30e5\u30e7\u30ee\u30f5\u30f6\u30fc-\u30fe\uff01\uff0c"
161:                                        + "\uff0e\uff1f];"
162:
163:                                        // Kanji: actually includes both Kanji and Kana, except for small Kana and
164:                                        // CJK diacritics
165:                                        + "$kanji=[[\u4e00-\u9fa5\uf900-\ufa2d\u3041-\u3094\u30a1-\u30fa]-[$post_word$_ignore_]];"
166:
167:                                        // digits
168:                                        + "$digit=[[:Nd:][:No:]];"
169:
170:                                        // punctuation that can occur in the middle of a number: periods and commas
171:                                        + "$mid_num=[\\.\\,];"
172:
173:                                        // everything not mentioned above, plus the quote marks (which are both
174:                                        // <pre-word>, <post-word>, and <char>)
175:                                        + "$char=[^$break$space$dash$kanji$nbsp$_ignore_$pre_word$post_word$mid_num$danda\r\\\"\\\'];"
176:
177:                                        // a "number" is a run of prefix characters and dashes, followed by one or
178:                                        // more digits with isolated number-punctuation characters interspersed
179:                                        + "$number=([$pre_word$dash]*$digit+($mid_num$digit+)*);"
180:
181:                                        // the basic core of a word can be either a "number" as defined above, a single
182:                                        // "Kanji" character, or a run of any number of not-explicitly-mentioned
183:                                        // characters (this includes Latin letters)
184:                                        + "$word_core=([$pre_word$char]*|$kanji|$number);"
185:
186:                                        // a word may end with an optional suffix that be either a run of one or
187:                                        // more dashes or a run of word-suffix characters, followed by an optional
188:                                        // run of whitespace
189:                                        + "$word_suffix=(($dash+|$post_word*)$space*);"
190:
191:                                        // a word, thus, is an optional run of word-prefix characters, followed by
192:                                        // a word core and a word suffix (the syntax of <word-core> and <word-suffix>
193:                                        // actually allows either of them to match the empty string, putting a break
194:                                        // between things like ")(" or "aaa(aaa"
195:                                        + "$word=($pre_word*$word_core$word_suffix);"
196:
197:                                        // finally, the rule that does the work: Keep together any run of words that
198:                                        // are joined by runs of one of more non-spacing mark.  Also keep a trailing
199:                                        // line-break character or CRLF combination with the word.  (line separators
200:                                        // "win" over nbsp's)
201:                                        + "$word($nbsp+$word)*\r?$break?;" },
202:
203:                        // these two resources specify the pathnames of the dictionary files to
204:                        // use for word breaking and line breaking.  Both currently refer to
205:                        // a file called english.dict placed in com.ibm.icu.impl.data
206:                        // somewhere in the class path.  It's important to note that
207:                        // english.dict was created for testing purposes only, and doesn't
208:                        // come anywhere close to being an exhaustive dictionary of English
209:                        // words (basically, it contains all the words in the Declaration of
210:                        // Independence, and the Revised Standard Version of the book of Genesis,
211:                        // plus a few other words thrown in to show more interesting cases).
212:                        // { "WordBreakDictionary", "com\\ibm\\text\\resources\\english.dict" },
213:                        // { "LineBreakDictionary", "com\\ibm\\text\\resources\\english.dict" }
214:                        { "WordBreakDictionary", DATA_NAME },
215:                        { "LineBreakDictionary", DATA_NAME } };
216:            }
217:        }

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.