Source Code Cross Referenced for BreakIteratorRules.java in » 6.0-JDK-Modules-sun » text » sun » text » resources » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI
Java
Java Tutorial
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » 6.0 JDK Modules sun » text » sun.text.resources
Source Cross Referenced Class Diagram Java Document (Java Doc)
001:        /*
002:         * Portions Copyright 1999-2007 Sun Microsystems, Inc.  All Rights Reserved.
003:         * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
004:         *
005:         * This code is free software; you can redistribute it and/or modify it
006:         * under the terms of the GNU General Public License version 2 only, as
007:         * published by the Free Software Foundation.  Sun designates this
008:         * particular file as subject to the "Classpath" exception as provided
009:         * by Sun in the LICENSE file that accompanied this code.
010:         *
011:         * This code is distributed in the hope that it will be useful, but WITHOUT
012:         * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
013:         * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
014:         * version 2 for more details (a copy is included in the LICENSE file that
015:         * accompanied this code).
016:         *
017:         * You should have received a copy of the GNU General Public License version
018:         * 2 along with this work; if not, write to the Free Software Foundation,
019:         * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
020:         *
021:         * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
022:         * CA 95054 USA or visit www.sun.com if you need additional information or
023:         * have any questions.
024:         */
025:
026:        /*
027:         * @(#)BreakIteratorRules.java	1.34 07/05/05
028:         */
029:
030:        /*
031:         * Licensed Materials - Property of IBM
032:         *
033:         * (C) Copyright IBM Corp. 1999 All Rights Reserved.
034:         * (C) IBM Corp. 1997-1998.  All Rights Reserved.
035:         *
036:         * The program is provided "as is" without any warranty express or
037:         * implied, including the warranty of non-infringement and the implied
038:         * warranties of merchantibility and fitness for a particular purpose.
039:         * IBM will not be liable for any damages suffered by you as a result
040:         * of using the Program. In no event will IBM be liable for any
041:         * special, indirect or consequential damages or lost profits even if
042:         * IBM has been advised of the possibility of their occurrence. IBM
043:         * will not be liable for any third party claims against you.
044:         */
045:
046:        package sun.text.resources;
047:
048:        import java.util.ListResourceBundle;
049:
050:        /**
051:         * Default break-iterator rules.  These rules are more or less general for
052:         * all locales, although there are probably a few we're missing.  The
053:         * behavior currently mimics the behavior of BreakIterator in JDK 1.2.
054:         * There are known deficiencies in this behavior, including the fact that
055:         * the logic for handling CJK characters works for Japanese but not for
056:         * Chinese, and that we don't currently have an appropriate locale for
057:         * Thai.  The resources will eventually be updated to fix these problems.
058:         */
059:
060:        /* Modified for Hindi 3/1/99. */
061:
062:        /*
063:         * Since JDK 1.5.0, this file no longer goes to runtime and is used at J2SE
064:         * build phase in order to create [Character|Word|Line|Sentence]BreakIteratorData
065:         * files which are used on runtime instead.
066:         */
067:
068:        public class BreakIteratorRules extends ListResourceBundle {
069:            protected final Object[][] getContents() {
070:                return new Object[][] {
071:                // rules describing how to break between logical characters
072:                        { "CharacterBreakRules",
073:
074:                        // ignore non-spacing marks and enclosing marks (since we never
075:                                // put a break before ignore characters, this keeps combining
076:                                // accents with the base characters they modify)
077:                                "<enclosing>=[:Mn::Me:];"
078:
079:                                        // other category definitions
080:                                        + "<choseong>=[\u1100-\u115f];"
081:                                        + "<jungseong>=[\u1160-\u11a7];"
082:                                        + "<jongseong>=[\u11a8-\u11ff];"
083:                                        + "<surr-hi>=[\ud800-\udbff];"
084:                                        + "<surr-lo>=[\udc00-\udfff];"
085:
086:                                        // break after every character, except as follows:
087:                                        + ".;"
088:
089:                                        // keep base and combining characters togethers
090:                                        + "<base>=[^<enclosing>^[:Cc::Cf::Zl::Zp:]];"
091:                                        + "<base><enclosing><enclosing>*;"
092:
093:                                        // keep CRLF sequences together
094:                                        + "\r\n;"
095:
096:                                        // keep surrogate pairs together
097:                                        + "<surr-hi><surr-lo>;"
098:
099:                                        // keep Hangul syllables spelled out using conjoining jamo together
100:                                        + "<choseong>*<jungseong>*<jongseong>*;"
101:
102:                                        // various additions for Hindi support
103:                                        + "<nukta>=[\u093c];"
104:                                        + "<danda>=[\u0964\u0965];"
105:                                        + "<virama>=[\u094d];"
106:                                        + "<devVowelSign>=[\u093e-\u094c\u0962\u0963];"
107:                                        + "<devConsonant>=[\u0915-\u0939];"
108:                                        + "<devNuktaConsonant>=[\u0958-\u095f];"
109:                                        + "<devCharEnd>=[\u0902\u0903\u0951-\u0954];"
110:                                        + "<devCAMN>=(<devConsonant>{<nukta>});"
111:                                        + "<devConsonant1>=(<devNuktaConsonant>|<devCAMN>);"
112:                                        + "<zwj>=[\u200d];"
113:                                        + "<devConjunct>=({<devConsonant1><virama>{<zwj>}}<devConsonant1>);"
114:                                        + "<devConjunct>{<devVowelSign>}{<devCharEnd>};"
115:                                        + "<danda><nukta>;" },
116:
117:                        // default rules for finding word boundaries
118:                        { "WordBreakRules",
119:                        // ignore non-spacing marks, enclosing marks, and format characters,
120:                                // all of which should not influence the algorithm
121:                                //"<ignore>=[:Mn::Me::Cf:];"
122:                                "<ignore>=[:Cf:];"
123:
124:                                        + "<enclosing>=[:Mn::Me:];"
125:
126:                                        // Hindi phrase separator, kanji, katakana, hiragana, CJK diacriticals,
127:                                        // other letters, and digits
128:                                        + "<danda>=[\u0964\u0965];"
129:                                        + "<kanji>=[\u3005\u4e00-\u9fa5\uf900-\ufa2d];"
130:                                        + "<kata>=[\u30a1-\u30fa\u30fd\u30fe];"
131:                                        + "<hira>=[\u3041-\u3094\u309d\u309e];"
132:                                        + "<cjk-diacrit>=[\u3099-\u309c\u30fb\u30fc];"
133:                                        + "<letter-base>=[:L::Mc:^[<kanji><kata><hira><cjk-diacrit>]];"
134:                                        + "<let>=(<letter-base><enclosing>*);"
135:                                        + "<digit-base>=[:N:];"
136:                                        + "<dgt>=(<digit-base><enclosing>*);"
137:
138:                                        // punctuation that can occur in the middle of a word: currently
139:                                        // dashes, apostrophes, quotation marks, and periods
140:                                        + "<mid-word>=[:Pd::Pc:\u00ad\u2027\\\"\\\'\\.];"
141:
142:                                        // punctuation that can occur in the middle of a number: currently
143:                                        // apostrophes, qoutation marks, periods, commas, and the Arabic
144:                                        // decimal point
145:                                        + "<mid-num>=[\\\"\\\'\\,\u066b\\.];"
146:
147:                                        // punctuation that can occur at the beginning of a number: currently
148:                                        // the period, the number sign, and all currency symbols except the cents sign
149:                                        + "<pre-num>=[:Sc:\\#\\.^\u00a2];"
150:
151:                                        // punctuation that can occur at the end of a number: currently
152:                                        // the percent, per-thousand, per-ten-thousand, and Arabic percent
153:                                        // signs, the cents sign, and the ampersand
154:                                        + "<post-num>=[\\%\\&\u00a2\u066a\u2030\u2031];"
155:
156:                                        // line separators: currently LF, FF, PS, and LS
157:                                        + "<ls>=[\n\u000c\u2028\u2029];"
158:
159:                                        // whitespace: all space separators and the tab character
160:                                        + "<ws-base>=[:Zs:\t];"
161:                                        + "<ws>=(<ws-base><enclosing>*);"
162:
163:                                        // a word is a sequence of letters that may contain internal
164:                                        // punctuation, as long as it begins and ends with a letter and
165:                                        // never contains two punctuation marks in a row
166:                                        + "<word>=((<let><let>*(<mid-word><let><let>*)*){<danda>});"
167:
168:                                        // a number is a sequence of digits that may contain internal
169:                                        // punctuation, as long as it begins and ends with a digit and
170:                                        // never contains two punctuation marks in a row.
171:                                        + "<number>=(<dgt><dgt>*(<mid-num><dgt><dgt>*)*);"
172:
173:                                        // break after every character, with the following exceptions
174:                                        // (this will cause punctuation marks that aren't considered
175:                                        // part of words or numbers to be treated as words unto themselves)
176:                                        + ".;"
177:
178:                                        // keep together any sequence of contiguous words and numbers
179:                                        // (including just one of either), plus an optional trailing
180:                                        // number-suffix character
181:                                        + "{<word>}(<number><word>)*{<number>{<post-num>}};"
182:
183:                                        // keep together and sequence of contiguous words and numbers
184:                                        // that starts with a number-prefix character and a number,
185:                                        // and may end with a number-suffix character
186:                                        + "<pre-num>(<number><word>)*{<number>{<post-num>}};"
187:
188:                                        // keep together runs of whitespace (optionally with a single trailing
189:                                        // line separator or CRLF sequence)
190:                                        + "<ws>*{\r}{<ls>};"
191:
192:                                        // keep together runs of Katakana and CJK diacritical marks
193:                                        + "[<kata><cjk-diacrit>]*;"
194:
195:                                        // keep together runs of Hiragana and CJK diacritical marks
196:                                        + "[<hira><cjk-diacrit>]*;"
197:
198:                                        // keep together runs of Kanji
199:                                        + "<kanji>*;"
200:
201:                                        // keep together anything else and an enclosing mark
202:                                        + "<base>=[^<enclosing>^[:Cc::Cf::Zl::Zp:]];"
203:                                        + "<base><enclosing><enclosing>*;" },
204:
205:                        // default rules for determining legal line-breaking positions
206:                        {
207:                                "LineBreakRules",
208:                                // characters that always cause a break: ETX, tab, LF, FF, LS, and PS
209:                                "<break>=[\u0003\t\n\f\u2028\u2029];"
210:
211:                                        // ignore format characters and control characters EXCEPT for breaking chars
212:                                        + "<ignore>=[:Cf:[:Cc:^[<break>\r]]];"
213:
214:                                        // enclosing marks
215:                                        + "<enclosing>=[:Mn::Me:];"
216:
217:                                        // Hindi phrase separators
218:                                        + "<danda>=[\u0964\u0965];"
219:
220:                                        // characters that always prevent a break: the non-breaking space
221:                                        // and similar characters
222:                                        + "<glue>=[\u00a0\u0f0c\u2007\u2011\u202f\ufeff];"
223:
224:                                        // whitespace: space separators and control characters, except for
225:                                        // CR and the other characters mentioned above
226:                                        + "<space>=[:Zs::Cc:^[<glue><break>\r]];"
227:
228:                                        // dashes: dash punctuation and the discretionary hyphen, except for
229:                                        // non-breaking hyphens
230:                                        + "<dash>=[:Pd:\u00ad^<glue>];"
231:
232:                                        // characters that stick to a word if they precede it: currency symbols
233:                                        // (except the cents sign) and starting punctuation
234:                                        + "<pre-word>=[:Sc::Ps::Pi:^[\u00a2]\\\"\\\'];"
235:
236:                                        // characters that stick to a word if they follow it: ending punctuation,
237:                                        // other punctuation that usually occurs at the end of a sentence,
238:                                        // small Kana characters, some CJK diacritics, etc.
239:                                        + "<post-word>=[\\\":Pe::Pf:\\!\\%\\.\\,\\:\\;\\?\u00a2\u00b0\u066a\u2030-\u2034\u2103"
240:                                        + "\u2105\u2109\u3001\u3002\u3005\u3041\u3043\u3045\u3047\u3049\u3063"
241:                                        + "\u3083\u3085\u3087\u308e\u3099-\u309e\u30a1\u30a3\u30a5\u30a7\u30a9"
242:                                        + "\u30c3\u30e3\u30e5\u30e7\u30ee\u30f5\u30f6\u30fc-\u30fe\uff01\uff05"
243:                                        + "\uff0c\uff0e\uff1a\uff1b\uff1f];"
244:
245:                                        // Kanji: actually includes Kanji,Kana and Hangul syllables,
246:                                        // except for small Kana and CJK diacritics
247:                                        + "<kanji>=[\u4e00-\u9fa5\uac00-\ud7a3\uf900-\ufa2d\ufa30-\ufa6a\u3041-\u3094\u30a1-\u30fa^[<post-word><ignore>]];"
248:
249:                                        // digits
250:                                        + "<digit>=[:Nd::No:];"
251:
252:                                        // punctuation that can occur in the middle of a number: periods and commas
253:                                        + "<mid-num>=[\\.\\,];"
254:
255:                                        // everything not mentioned above
256:                                        + "<char>=[^[<break><space><dash><kanji><glue><ignore><pre-word><post-word><mid-num>\r<danda>]];"
257:
258:                                        // a "number" is a run of prefix characters and dashes, followed by one or
259:                                        // more digits with isolated number-punctuation characters interspersed
260:                                        + "<number>=([<pre-word><dash>]*<digit><digit>*(<mid-num><digit><digit>*)*);"
261:
262:                                        // the basic core of a word can be either a "number" as defined above, a single
263:                                        // "Kanji" character, or a run of any number of not-explicitly-mentioned
264:                                        // characters (this includes Latin letters)
265:                                        + "<word-core>=(<char>*|<kanji>|<number>);"
266:
267:                                        // a word may end with an optional suffix that be either a run of one or
268:                                        // more dashes or a run of word-suffix characters
269:                                        + "<word-suffix>=((<dash><dash>*|<post-word>*));"
270:
271:                                        // a word, thus, is an optional run of word-prefix characters, followed by
272:                                        // a word core and a word suffix (the syntax of <word-core> and <word-suffix>
273:                                        // actually allows either of them to match the empty string, putting a break
274:                                        // between things like ")(" or "aaa(aaa"
275:                                        + "<word>=(<pre-word>*<word-core><word-suffix>);"
276:
277:                                        + "<hack1>=[\\(];"
278:                                        + "<hack2>=[\\)];"
279:                                        + "<hack3>=[\\$\\'];"
280:
281:                                        // finally, the rule that does the work: Keep together any run of words that
282:                                        // are joined by runs of one of more non-spacing mark.  Also keep a trailing
283:                                        // line-break character or CRLF combination with the word.  (line separators
284:                                        // "win" over nbsp's)
285:                                        + "<word>(((<space>*<glue><glue>*{<space>})|<hack3>)<word>)*<space>*{<enclosing>*}{<hack1><hack2><post-word>*}{<enclosing>*}{\r}{<break>};"
286:                                        + "\r<break>;" },
287:
288:                        // default rules for finding sentence boundaries
289:                        {
290:                                "SentenceBreakRules",
291:                                // ignore non-spacing marks, enclosing marks, and format characters
292:                                "<ignore>=[:Mn::Me::Cf:];"
293:
294:                                        // letters
295:                                        + "<letter>=[:L:];"
296:
297:                                        // lowercase letters
298:                                        + "<lc>=[:Ll:];"
299:
300:                                        // uppercase letters
301:                                        + "<uc>=[:Lu:];"
302:
303:                                        // NOT lowercase letters
304:                                        + "<notlc>=[<letter>^<lc>];"
305:
306:                                        // whitespace (line separators are treated as whitespace)
307:                                        + "<space>=[\t\r\f\n\u2028:Zs:];"
308:
309:                                        // punctuation which may occur at the beginning of a sentence: "starting
310:                                        // punctuation" and quotation marks
311:                                        + "<start-punctuation>=[:Ps::Pi:\\\"\\\'];"
312:
313:                                        // punctuation with may occur at the end of a sentence: "ending punctuation"
314:                                        // and quotation marks
315:                                        + "<end>=[:Pe::Pf:\\\"\\\'];"
316:
317:                                        // digits
318:                                        + "<digit>=[:N:];"
319:
320:                                        // characters that unambiguously signal the end of a sentence
321:                                        + "<term>=[\\!\\?\u3002\uff01\uff1f];"
322:
323:                                        // periods, which MAY signal the end of a sentence
324:                                        + "<period>=[\\.\uff0e];"
325:
326:                                        // characters that may occur at the beginning of a sentence: basically anything
327:                                        // not mentioned above (letters and digits are specifically excluded)
328:                                        + "<sent-start>=[^[:L:<space><start-punctuation><end><digit><term><period>\u2029<ignore>]];"
329:
330:                                        // Hindi phrase separator
331:                                        + "<danda>=[\u0964\u0965];"
332:
333:                                        // always break sentences after paragraph separators
334:                                        + ".*?{\u2029};"
335:
336:                                        // always break after a danda, if it's followed by whitespace
337:                                        + ".*?<danda><space>*;"
338:
339:                                        // if you see a period, skip over additional periods and ending punctuation
340:                                        // and if the next character is a paragraph separator, break after the
341:                                        // paragraph separator
342:                                        //+ ".*?<period>[<period><end>]*<space>*\u2029;"
343:                                        //+ ".*?[<period><end>]*<space>*\u2029;"
344:
345:                                        // if you see a period, skip over additional periods and ending punctuation,
346:                                        // followed by optional whitespace, followed by optional starting punctuation,
347:                                        // and if the next character is something that can start a sentence
348:                                        // (basically, a capital letter), then put the sentence break between the
349:                                        // whitespace and the opening punctuation
350:                                        + ".*?<period>[<period><end>]*<space><space>*/<notlc>;"
351:                                        + ".*?<period>[<period><end>]*<space>*/[<start-punctuation><sent-start>][<start-punctuation><sent-start>]*<letter>;"
352:
353:                                        // if you see a sentence-terminating character, skip over any additional
354:                                        // terminators, periods, or ending punctuation, followed by any whitespace,
355:                                        // followed by a SINGLE optional paragraph separator, and put the break there
356:                                        + ".*?<term>[<term><period><end>]*<space>*{\u2029};"
357:
358:                                        // The following rules are here to aid in backwards iteration.  The automatically
359:                                        // generated backwards state table will rewind to the beginning of the
360:                                        // paragraph all the time (or all the way to the beginning of the document
361:                                        // if the document doesn't use the Unicode PS character) because the only
362:                                        // unambiguous character pairs are those involving paragraph separators.
363:                                        // These specify a few more unambiguous breaking situations.
364:
365:                                        // if you see a sentence-starting character, followed by starting punctuation
366:                                        // (remember, we're iterating backwards), followed by an optional run of
367:                                        // whitespace, followed by an optional run of ending punctuation, followed
368:                                        // by a period, this is a safe place to turn around
369:                                        + "!<sent-start><start-punctuation>*<space>*<end>*<period>;"
370:
371:                                        // if you see a letter or a digit, followed by an optional run of
372:                                        // starting punctuation, followed by an optional run of whitespace,
373:                                        // followed by an optional run of ending punctuation, followed by
374:                                        // a sentence terminator, this is a safe place to turn around
375:                                        + "![<sent-start><lc><digit>]<start-punctuation>*<space>*<end>*<term>;" } };
376:            }
377:        }
www.java2java.com | Contact Us
All other trademarks are property of their respective owners.