Source Code Cross Referenced for BreakIteratorRules.java in » 6.0-JDK-Modules » j2me » sun » text » resources » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI
Java
Java Tutorial
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » 6.0 JDK Modules » j2me » sun.text.resources
Source Cross Referenced Class Diagram Java Document (Java Doc)
001:        /*
002:         * 
003:         * @(#)BreakIteratorRules.java	1.22 06/10/10
004:         * 
005:         * Portions Copyright  2000-2006 Sun Microsystems, Inc. All Rights
006:         * Reserved.  Use is subject to license terms.
007:         * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER
008:         * 
009:         * This program is free software; you can redistribute it and/or
010:         * modify it under the terms of the GNU General Public License version
011:         * 2 only, as published by the Free Software Foundation.
012:         * 
013:         * This program is distributed in the hope that it will be useful, but
014:         * WITHOUT ANY WARRANTY; without even the implied warranty of
015:         * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
016:         * General Public License version 2 for more details (a copy is
017:         * included at /legal/license.txt).
018:         * 
019:         * You should have received a copy of the GNU General Public License
020:         * version 2 along with this work; if not, write to the Free Software
021:         * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
022:         * 02110-1301 USA
023:         * 
024:         * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa
025:         * Clara, CA 95054 or visit www.sun.com if you need additional
026:         * information or have any questions.
027:         */
028:
029:        /*
030:         * Licensed Materials - Property of IBM
031:         *
032:         * (C) Copyright IBM Corp. 1999 All Rights Reserved.
033:         * (C) IBM Corp. 1997-1998.  All Rights Reserved.
034:         *
035:         * The program is provided "as is" without any warranty express or
036:         * implied, including the warranty of non-infringement and the implied
037:         * warranties of merchantibility and fitness for a particular purpose.
038:         * IBM will not be liable for any damages suffered by you as a result
039:         * of using the Program. In no event will IBM be liable for any
040:         * special, indirect or consequential damages or lost profits even if
041:         * IBM has been advised of the possibility of their occurrence. IBM
042:         * will not be liable for any third party claims against you.
043:         */
044:
045:        package sun.text.resources;
046:
047:        import java.util.ListResourceBundle;
048:
049:        /**
050:         * Default break-iterator rules.  These rules are more or less general for
051:         * all locales, although there are probably a few we're missing.  The
052:         * behavior currently mimics the behavior of BreakIterator in JDK 1.2.
053:         * There are known deficiencies in this behavior, including the fact that
054:         * the logic for handling CJK characters works for Japanese but not for
055:         * Chinese, and that we don't currently have an appropriate locale for
056:         * Thai.  The resources will eventually be updated to fix these problems.
057:         */
058:
059:        /* Modified for Hindi 3/1/99. */
060:
061:        public class BreakIteratorRules extends ListResourceBundle {
062:            public Object[][] getContents() {
063:                return contents;
064:            }
065:
066:            static final Object[][] contents = {
067:            // BreakIteratorClasses lists the class names to instantiate for each
068:                    // built-in type of BreakIterator
069:                    { "BreakIteratorClasses",
070:                            new String[] { "RuleBasedBreakIterator", // character-break iterator class
071:                                    "RuleBasedBreakIterator", // word-break iterator class
072:                                    "RuleBasedBreakIterator", // line-break iterator class
073:                                    "RuleBasedBreakIterator" } // sentence-break iterator class
074:                    },
075:
076:                    // rules describing how to break between logical characters
077:                    { "CharacterBreakRules",
078:
079:                    // ignore non-spacing marks and enclosing marks (since we never
080:                            // put a break before ignore characters, this keeps combining
081:                            // accents with the base characters they modify)
082:                            "<enclosing>=[:Mn::Me:];"
083:
084:                                    // other category definitions
085:                                    + "<choseong>=[\u1100-\u115f];"
086:                                    + "<jungseong>=[\u1160-\u11a7];"
087:                                    + "<jongseong>=[\u11a8-\u11ff];"
088:                                    + "<surr-hi>=[\ud800-\udbff];"
089:                                    + "<surr-lo>=[\udc00-\udfff];"
090:
091:                                    // break after every character, except as follows:
092:                                    + ".;"
093:
094:                                    // keep base and combining characters togethers
095:                                    + "<base>=[^<enclosing>^[:Cc::Cf::Zl::Zp:]];"
096:                                    + "<base><enclosing><enclosing>*;"
097:
098:                                    // keep CRLF sequences together
099:                                    + "\r\n;"
100:
101:                                    // keep surrogate pairs together
102:                                    + "<surr-hi><surr-lo>;"
103:
104:                                    // keep Hangul syllables spelled out using conjoining jamo together
105:                                    + "<choseong>*<jungseong>*<jongseong>*;"
106:
107:                                    // various additions for Hindi support
108:                                    + "<nukta>=[\u093c];"
109:                                    + "<danda>=[\u0964\u0965];"
110:                                    + "<virama>=[\u094d];"
111:                                    + "<devVowelSign>=[\u093e-\u094c\u0962\u0963];"
112:                                    + "<devConsonant>=[\u0915-\u0939];"
113:                                    + "<devNuktaConsonant>=[\u0958-\u095f];"
114:                                    + "<devCharEnd>=[\u0902\u0903\u0951-\u0954];"
115:                                    + "<devCAMN>=(<devConsonant>{<nukta>});"
116:                                    + "<devConsonant1>=(<devNuktaConsonant>|<devCAMN>);"
117:                                    + "<zwj>=[\u200d];"
118:                                    + "<devConjunct>=({<devConsonant1><virama>{<zwj>}}<devConsonant1>);"
119:                                    + "<devConjunct>{<devVowelSign>}{<devCharEnd>};"
120:                                    + "<danda><nukta>;" },
121:
122:                    // default rules for finding word boundaries
123:                    { "WordBreakRules",
124:                    // ignore non-spacing marks, enclosing marks, and format characters,
125:                            // all of which should not influence the algorithm
126:                            //"<ignore>=[:Mn::Me::Cf:];"
127:                            "<ignore>=[:Cf:];"
128:
129:                                    + "<enclosing>=[:Mn::Me:];"
130:
131:                                    // Hindi phrase separator, kanji, katakana, hiragana, CJK diacriticals,
132:                                    // other letters, and digits
133:                                    + "<danda>=[\u0964\u0965];"
134:                                    + "<kanji>=[\u3005\u4e00-\u9fa5\uf900-\ufa2d];"
135:                                    + "<kata>=[\u30a1-\u30fa\u30fd\u30fe];"
136:                                    + "<hira>=[\u3041-\u3094\u309d\u309e];"
137:                                    + "<cjk-diacrit>=[\u3099-\u309c\u30fb\u30fc];"
138:                                    + "<letter-base>=[:L::Mc:^[<kanji><kata><hira><cjk-diacrit>]];"
139:                                    + "<let>=(<letter-base><enclosing>*);"
140:                                    + "<digit-base>=[:N:];"
141:                                    + "<dgt>=(<digit-base><enclosing>*);"
142:
143:                                    // punctuation that can occur in the middle of a word: currently
144:                                    // dashes, apostrophes, quotation marks, and periods
145:                                    + "<mid-word>=[:Pd::Pc:\u00ad\u2027\\\"\\\'\\.];"
146:
147:                                    // punctuation that can occur in the middle of a number: currently
148:                                    // apostrophes, qoutation marks, periods, commas, and the Arabic
149:                                    // decimal point
150:                                    + "<mid-num>=[\\\"\\\'\\,\u066b\\.];"
151:
152:                                    // punctuation that can occur at the beginning of a number: currently
153:                                    // the period, the number sign, and all currency symbols except the cents sign
154:                                    + "<pre-num>=[:Sc:\\#\\.^\u00a2];"
155:
156:                                    // punctuation that can occur at the end of a number: currently
157:                                    // the percent, per-thousand, per-ten-thousand, and Arabic percent
158:                                    // signs, the cents sign, and the ampersand
159:                                    + "<post-num>=[\\%\\&\u00a2\u066a\u2030\u2031];"
160:
161:                                    // line separators: currently LF, FF, PS, and LS
162:                                    + "<ls>=[\n\u000c\u2028\u2029];"
163:
164:                                    // whitespace: all space separators and the tab character
165:                                    + "<ws-base>=[:Zs:\t];"
166:                                    + "<ws>=(<ws-base><enclosing>*);"
167:
168:                                    // a word is a sequence of letters that may contain internal
169:                                    // punctuation, as long as it begins and ends with a letter and
170:                                    // never contains two punctuation marks in a row
171:                                    + "<word>=((<let><let>*(<mid-word><let><let>*)*){<danda>});"
172:
173:                                    // a number is a sequence of digits that may contain internal
174:                                    // punctuation, as long as it begins and ends with a digit and
175:                                    // never contains two punctuation marks in a row.
176:                                    + "<number>=(<dgt><dgt>*(<mid-num><dgt><dgt>*)*);"
177:
178:                                    // break after every character, with the following exceptions
179:                                    // (this will cause punctuation marks that aren't considered
180:                                    // part of words or numbers to be treated as words unto themselves)
181:                                    + ".;"
182:
183:                                    // keep together any sequence of contiguous words and numbers
184:                                    // (including just one of either), plus an optional trailing
185:                                    // number-suffix character
186:                                    + "{<word>}(<number><word>)*{<number>{<post-num>}};"
187:
188:                                    // keep together and sequence of contiguous words and numbers
189:                                    // that starts with a number-prefix character and a number,
190:                                    // and may end with a number-suffix character
191:                                    + "<pre-num>(<number><word>)*{<number>{<post-num>}};"
192:
193:                                    // keep together runs of whitespace (optionally with a single trailing
194:                                    // line separator or CRLF sequence)
195:                                    + "<ws>*{\r}{<ls>};"
196:
197:                                    // keep together runs of Katakana and CJK diacritical marks
198:                                    + "[<kata><cjk-diacrit>]*;"
199:
200:                                    // keep together runs of Hiragana and CJK diacritical marks
201:                                    + "[<hira><cjk-diacrit>]*;"
202:
203:                                    // keep together runs of Kanji
204:                                    + "<kanji>*;"
205:
206:                                    // keep together anything else and an enclosing mark
207:                                    + "<base>=[^<enclosing>^[:Cc::Cf::Zl::Zp:]];"
208:                                    + "<base><enclosing><enclosing>*;" },
209:
210:                    // default rules for determining legal line-breaking positions
211:                    {
212:                            "LineBreakRules",
213:                            // characters that always cause a break: ETX, tab, LF, FF, LS, and PS
214:                            "<break>=[\u0003\t\n\f\u2028\u2029];"
215:
216:                                    // ignore format characters and control characters EXCEPT for breaking chars
217:                                    + "<ignore>=[:Cf:[:Cc:^[<break>\r]]];"
218:
219:                                    // enclosing marks
220:                                    + "<enclosing>=[:Mn::Me:];"
221:
222:                                    // Hindi phrase separators
223:                                    + "<danda>=[\u0964\u0965];"
224:
225:                                    // characters that always prevent a break: the non-breaking space
226:                                    // and similar characters
227:                                    + "<glue>=[\u00a0\u0f0c\u2007\u2011\u202f\ufeff];"
228:
229:                                    // whitespace: space separators and control characters, except for
230:                                    // CR and the other characters mentioned above
231:                                    + "<space>=[:Zs::Cc:^[<glue><break>\r]];"
232:
233:                                    // dashes: dash punctuation and the discretionary hyphen, except for
234:                                    // non-breaking hyphens
235:                                    + "<dash>=[:Pd:\u00ad^<glue>];"
236:
237:                                    // characters that stick to a word if they precede it: currency symbols
238:                                    // (except the cents sign) and starting punctuation
239:                                    + "<pre-word>=[:Sc::Ps::Pi:^[\u00a2]\\\"\\\'];"
240:
241:                                    // characters that stick to a word if they follow it: ending punctuation,
242:                                    // other punctuation that usually occurs at the end of a sentence,
243:                                    // small Kana characters, some CJK diacritics, etc.
244:                                    + "<post-word>=[\\\":Pe::Pf:\\!\\%\\.\\,\\:\\;\\?\u00a2\u00b0\u066a\u2030-\u2034\u2103"
245:                                    + "\u2105\u2109\u3001\u3002\u3005\u3041\u3043\u3045\u3047\u3049\u3063"
246:                                    + "\u3083\u3085\u3087\u308e\u3099-\u309e\u30a1\u30a3\u30a5\u30a7\u30a9"
247:                                    + "\u30c3\u30e3\u30e5\u30e7\u30ee\u30f5\u30f6\u30fc-\u30fe\uff01\uff05"
248:                                    + "\uff0c\uff0e\uff1a\uff1b\uff1f];"
249:
250:                                    // Kanji: actually includes both Kanji and Kana, except for small Kana and
251:                                    // CJK diacritics
252:                                    + "<kanji>=[\u4e00-\u9fa5\uf900-\ufa2d\u3041-\u3094\u30a1-\u30fa^[<post-word><ignore>]];"
253:
254:                                    // digits
255:                                    + "<digit>=[:Nd::No:];"
256:
257:                                    // punctuation that can occur in the middle of a number: periods and commas
258:                                    + "<mid-num>=[\\.\\,];"
259:
260:                                    // everything not mentioned above
261:                                    + "<char>=[^[<break><space><dash><kanji><glue><ignore><pre-word><post-word><mid-num>\r<danda>]];"
262:
263:                                    // a "number" is a run of prefix characters and dashes, followed by one or
264:                                    // more digits with isolated number-punctuation characters interspersed
265:                                    + "<number>=([<pre-word><dash>]*<digit><digit>*(<mid-num><digit><digit>*)*);"
266:
267:                                    // the basic core of a word can be either a "number" as defined above, a single
268:                                    // "Kanji" character, or a run of any number of not-explicitly-mentioned
269:                                    // characters (this includes Latin letters)
270:                                    + "<word-core>=(<char>*|<kanji>|<number>);"
271:
272:                                    // a word may end with an optional suffix that be either a run of one or
273:                                    // more dashes or a run of word-suffix characters
274:                                    + "<word-suffix>=((<dash><dash>*|<post-word>*));"
275:
276:                                    // a word, thus, is an optional run of word-prefix characters, followed by
277:                                    // a word core and a word suffix (the syntax of <word-core> and <word-suffix>
278:                                    // actually allows either of them to match the empty string, putting a break
279:                                    // between things like ")(" or "aaa(aaa"
280:                                    + "<word>=(<pre-word>*<word-core><word-suffix>);"
281:
282:                                    + "<hack1>=[\\(];"
283:                                    + "<hack2>=[\\)];"
284:                                    + "<hack3>=[\\$\\'];"
285:
286:                                    // finally, the rule that does the work: Keep together any run of words that
287:                                    // are joined by runs of one of more non-spacing mark.  Also keep a trailing
288:                                    // line-break character or CRLF combination with the word.  (line separators
289:                                    // "win" over nbsp's)
290:                                    + "<word>(((<space>*<glue><glue>*{<space>})|<hack3>)<word>)*<space>*{<enclosing>*}{<hack1><hack2><post-word>*}{<enclosing>*}{\r}{<break>};"
291:                                    + "\r<break>;" },
292:
293:                    // default rules for finding sentence boundaries
294:                    {
295:                            "SentenceBreakRules",
296:                            // ignore non-spacing marks, enclosing marks, and format characters
297:                            "<ignore>=[:Mn::Me::Cf:];"
298:
299:                                    // letters
300:                                    + "<letter>=[:L:];"
301:
302:                                    // lowercase letters
303:                                    + "<lc>=[:Ll:];"
304:
305:                                    // uppercase letters
306:                                    + "<uc>=[:Lu:];"
307:
308:                                    // NOT lowercase letters
309:                                    + "<notlc>=[<letter>^<lc>];"
310:
311:                                    // whitespace (line separators are treated as whitespace)
312:                                    + "<space>=[\t\r\f\n\u2028:Zs:];"
313:
314:                                    // punctuation which may occur at the beginning of a sentence: "starting
315:                                    // punctuation" and quotation marks
316:                                    + "<start-punctuation>=[:Ps::Pi:\\\"\\\'];"
317:
318:                                    // punctuation with may occur at the end of a sentence: "ending punctuation"
319:                                    // and quotation marks
320:                                    + "<end>=[:Pe::Pf:\\\"\\\'];"
321:
322:                                    // digits
323:                                    + "<digit>=[:N:];"
324:
325:                                    // characters that unambiguously signal the end of a sentence
326:                                    + "<term>=[\\!\\?\u3002\uff01\uff1f];"
327:
328:                                    // periods, which MAY signal the end of a sentence
329:                                    + "<period>=[\\.\uff0e];"
330:
331:                                    // characters that may occur at the beginning of a sentence: basically anything
332:                                    // not mentioned above (letters and digits are specifically excluded)
333:                                    + "<sent-start>=[^[:L:<space><start-punctuation><end><digit><term><period>\u2029<ignore>]];"
334:
335:                                    // Hindi phrase separator
336:                                    + "<danda>=[\u0964\u0965];"
337:
338:                                    // always break sentences after paragraph separators
339:                                    + ".*?{\u2029};"
340:
341:                                    // always break after a danda, if it's followed by whitespace
342:                                    + ".*?<danda><space>*;"
343:
344:                                    // if you see a period, skip over additional periods and ending punctuation
345:                                    // and if the next character is a paragraph separator, break after the
346:                                    // paragraph separator
347:                                    //+ ".*?<period>[<period><end>]*<space>*\u2029;"
348:                                    //+ ".*?[<period><end>]*<space>*\u2029;"
349:
350:                                    // if you see a period, skip over additional periods and ending punctuation,
351:                                    // followed by optional whitespace, followed by optional starting punctuation,
352:                                    // and if the next character is something that can start a sentence
353:                                    // (basically, a capital letter), then put the sentence break between the
354:                                    // whitespace and the opening punctuation
355:                                    + ".*?<period>[<period><end>]*<space><space>*/<notlc>;"
356:                                    + ".*?<period>[<period><end>]*<space>*/[<start-punctuation><sent-start>][<start-punctuation><sent-start>]*<letter>;"
357:
358:                                    // if you see a sentence-terminating character, skip over any additional
359:                                    // terminators, periods, or ending punctuation, followed by any whitespace,
360:                                    // followed by a SINGLE optional paragraph separator, and put the break there
361:                                    + ".*?<term>[<term><period><end>]*<space>*{\u2029};"
362:
363:                                    // The following rules are here to aid in backwards iteration.  The automatically
364:                                    // generated backwards state table will rewind to the beginning of the
365:                                    // paragraph all the time (or all the way to the beginning of the document
366:                                    // if the document doesn't use the Unicode PS character) because the only
367:                                    // unambiguous character pairs are those involving paragraph separators.
368:                                    // These specify a few more unambiguous breaking situations.
369:
370:                                    // if you see a sentence-starting character, followed by starting punctuation
371:                                    // (remember, we're iterating backwards), followed by an optional run of
372:                                    // whitespace, followed by an optional run of ending punctuation, followed
373:                                    // by a period, this is a safe place to turn around
374:                                    + "!<sent-start><start-punctuation>*<space>*<end>*<period>;"
375:
376:                                    // if you see a letter or a digit, followed by an optional run of
377:                                    // starting punctuation, followed by an optional run of whitespace,
378:                                    // followed by an optional run of ending punctuation, followed by
379:                                    // a sentence terminator, this is a safe place to turn around
380:                                    + "![<sent-start><lc><digit>]<start-punctuation>*<space>*<end>*<term>;" } };
381:        }
www.java2java.com | Contact Us
All other trademarks are property of their respective owners.