comparison cWnn/manual.en/chap3 @ 0:bbc77ca4def5

initial import
author Yoshiki Yazawa <yaz@cc.rim.or.jp>
date Thu, 13 Dec 2007 04:30:14 +0900
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:bbc77ca4def5
1
2 *************************************************
3 * Chapter 3 PINYIN INPUT AND *
4 * PINYIN HANZI CONVERSION *
5 *************************************************
6
7
8
9
10 3.1 CONCEPT ON PINYIN INPUT
11 ===========================
12
13 In Chapter 2, we have already given a brief introduction on Pinyin input. This
14 will be explained in greater detail now.
15
16 Pinyin input method refers to the input Chinese characters via Pinyin and
17 Pinyin- Hanzi conversion. A good Pinyin input method should provide users with a good
18 Pinyin input environment as well as a conversion mechanism with high accuracy.
19
20 Pinyin-Hanzi conversion has the following 3 categories :
21 (a) Conversion based on character
22 (b) Conversion based on word
23 (c) Conversion based on phrase or any arbitrary Pinyin string
24
25 (a) Conversion based on character
26 The result of this conversion is a Chinese character (Hanzi), which has the same
27 pronunciation as the input Pinyin.
28 We must take note that there are several Hanzi that have the same pronunciation.
29 Hence, one Pinyin corresponds to many Hanzi. In order to obtain the correct Hanzi,
30 it has to be selected among all the candidates. This is a rather inconvenient way
31 of conversion.
32
33 (b) Conversion based on word
34 In this conversion, the result is a word. This word may consist of two or
35 more characters. Hence, the number of candidates is much reduced. However, the
36 need to select candidates still exists. Also, we need to take note that in such a
37 system, only words that are registed can be found, and users need to have the
38 concept of words.
39
40 (c) Conversion based on phrase or any arbitrary Pinyin string
41 For this conversion, the user is able to input any arbitrary length of Pinyin, and
42 is able to perform the conversion at any position of the input string. The system
43 analyses the input string, performs the necessary grammatical analysis and word
44 segmentation, and subsequently produces a more accurate conversion output.
45
46
47
48
49
50 - 3-1 -
51 The diagram below shows the conversion process for the entire system.
52
53 <Table-c-3.1>
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72 We will now explain the Pinyin input environment, Pinyin-Hanzi conversion and the
73 related environment operations.
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100 - 3-2 -
101
102 3.2 PINYIN INPUT ENVIRONMENT
103 ============================
104
105 Pinyin Input and its Internal/External Representations
106 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
107
108 Pinyin can be input via three methods : Quanpin, Erpin and Sanpin (as described
109 in Chapter 2).
110 These methods of input are not performed internally, but through the definition of
111 external environment of the system. This external environment definition is known as
112 Input Automaton (refer to Chapter 6). It provides different input environment for
113 different users (clients), according to their needs.
114
115 Through the input automaton, the user input will be converted into the standard
116 Pinyin defined by the system. For example :
117
118 <Table-c-3.2>
119
120 The system does not require user to segment the Pinyin input string. The users only
121 needs to input the correct Pinyin and the system will perform the segmentation on
122 the input. For example, the input "hanyuyuyin" will be segmented to "han yu yu yin"
123 automatically by the system.
124
125 The Pinyin input interface is an editor by itself. Besides having the input
126 feature, facility such as cursor movement, inserting and deleting operations on the
127 input string are also provided. To the user, one Pinyin is just like an individual
128 character. For example, "han" is not considered as three characters "h", "a", "n", but
129 is as a single unit "han".
130
131 At the user interface, the Pinyin input is represented as it is. However, within
132 the system, each Pinyin is represented by an internal code defined by the system.
133 Hence during the conversion process, these internal representations are used instead of
134 the external representations of the Pinyin.
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150 - 3-3 -
151
152 3.3 PINYIN HANZI CONVERSION
153 ===========================
154
155 In cWnn system, there are two types of conversion : (1) Forward conversion
156 (2) Reverse conversion
157 Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion
158 refers to Hanzi-Pinyin conversion, ie, the input is Hanzi and the conversion result is
159 the corresponding Pinyin. We will now only explain the Pinyin-Hanzi conversion.
160
161 We have to take note that Pinyin-Hanzi conversion does not always get the
162 accurate result. Hence, besides providing a multi-phrase conversion mechanism, cWnn
163 also provides facilities to perform re-editing, re-conversion as well as manual word and
164 phrase segmentation.
165
166
167 1. Conversion Command
168 ~~~~~~~~~~~~~~~~~~~~~
169 There are five conversion methods for Pinyin-Hanzi conversion. The first three methods
170 listed below are most commonly used. The last two methods are meant for system
171 developers to check on grammatical analysis.
172
173 (a) Multi-phrase conversion
174 Once a Pinyin string is sent for conversion, the system will perform the
175 conversion based on the current environment (refer to Chapter 5) as well as the
176 conversion parameters of the current environment. After conversion, the result
177 will appear on the input line, with the cursor positioned at the first word of
178 the sentense. If a re-conversion is required (done by pressing the confirm key
179 again), the conversion method as in (c) will be performed.
180
181 (b) Word conversion
182 Treat the portion of the input string indicated by the cursor as a word and
183 perform word conversion. Output the candidate word that has the highest
184 assessment value as result.
185
186 (c) Word candidates extraction
187 Treat the portion of the input string indicated by the cursor as a word and
188 perform word conversion. Output the possible word candidates under the
189 particular environment.
190
191 (d) Phrase conversion
192 Treat the portion of the input string indicated by the cursor as a phrase and
193 perform phrase conversion. Output the candidate phrase that has the highest
194 assessment value as result.
195
196
197
198
199
200 - 3-4 -
201 (e) Phrase candidates extraction
202 Treat the portion of the input string indicated by the cursor as a phrase and
203 perform phrase conversion. Output the possible phrase candidates under the
204 particular environment.
205
206
207 2. Manual Word Segmentation
208 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
209 One difficulty faced in Pinyin-Hanzi conversion is to perform automatic word segmentation.
210 When the conversion result is incorrect, the user needs to segment the words by using
211 the segmentation keys (^O or ^I). The word indicated by the cursor will be segmented.
212 To complete the manual segmentation process, press the conversion key again.
213
214
215 3. Assessment Formula for Multi-Phrase Conversion
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217 The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion. The level
218 of accuracy for this conversion has a direct effect on the effectiveness of the system.
219 There are several factors that affect the conversion result of Pinyin-Hanzi conversion,
220 each differs according to different conditions. The followings are the assessment
221 formula for multi-phrase conversion. Users are able to change the corresponding
222 conversion parameters in order to obtain the most suitable conversion environment.
223
224 (a) Assessment parameters
225 parameter (0) Number of phrase "n"
226 During the assessment process, this is the maximum number of
227 phrases that can be assessed at one time.
228
229 parameter (1) Number of words "m"
230 During the assessment process, this is the maximum number of
231 words that can be in a phrase.
232
233 (b) Word assessment parameters
234 parameter (2) Usage frequency weight
235 This is the usage frequency for each word in the dictionary.
236 When a user uses the dictionary, the system will create as
237 well as manage a usage frequency file for the user. As the
238 user uses the system, the usage frequency of each word in
239 the dictionary will be updated according to how often the
240 user uses each word. Hence, each user will have his
241 individual usage frequency file.
242
243 parameter (3) Word length weight
244 Word length refers to the number of characters in a word.
245
246
247
248
249
250 - 3-5 -
251 parameter (4) Tone correctness weight
252 This is the accuracy of the four tones in the Pinyin input
253 by the user compared to that in the dictionary. cWnn allows
254 input with or without four tones.
255
256 parameter (5) Last used weight
257 Last used refers to the most recently used word for a Pinyin.
258 By increasing the weight of this parameter, the assessment
259 value of each word can be increased dynamically.
260
261 parameter (6) Dictionary priority weight
262 Each dictionary has a priority defined by the environment.
263 By changing this value, assessment values may be biased
264 towards certain dictionaries.
265
266 (c) Phrase assessment parameters
267 parameter (7) Average word assessment value weight
268 A phrase consists of several words, and each word has its
269 own word assessment value as described above. The average
270 of these values is the average word assessment value.
271
272 parameter (8) Phrase length weight
273 Phrase length refers to the number of characters in a phrase.
274
275 parameter (9) Number of words weight
276 This refers to the the number of words in a phrase. Larger
277 number of words in a phrase shows greater grammatical
278 certainty among the words, and hence higher reliability.
279
280 (d) Other paramters
281 Characters other than Pinyin that appear at the input line have their own
282 individual usage frequency values.
283
284 (e) Assessment formula for multi-phrase conversion
285 Assessment value for word :
286 f = (c1 x frequency) + (c2 x word length) + (c3 x tone correctness)
287 + (c4 x last used) + (c5 x dictionary priority)
288
289 Assessment value for phrase :
290 F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length)
291 + (k3 x number of words in phrase)
292
293 Total assessment value for candidates of a phrase :
294 Vi = avg( Fi1 + Fi2 + ... + Fin )
295
296 Best assessment value for a phrase :
297 MAX( V1, V2, ... Vk )
298
299
300 - 3-6 -
301 Note : * c1 = parameter (2)
302 c2 = parameter (3)
303 c3 = parameter (4)
304 c4 = parameter (5)
305 c5 = parameter (6)
306 * k1 = parameter (7)
307 k2 = parameter (8)
308 k3 = parameter (9)
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350 - 3-7 -
351
352 3.4 ENVIRONMENT OPERATING FUNCTIONS
353 ===================================
354
355 The cserver manages several resources such as dictionaries and grammar files.
356 Besides, it creates an environment for every user (client). One user may have more
357 than one environment. In different input mode, each environment has defined its
358 dictionary files, corresponding usage frequency files and the grammar files.
359 When a user starts up uum (client), cserver will create an environment as well
360 as set the dictionaries for the user. After that, the user is able to obtain the usage
361 status of the dictionaries from the system.
362
363 1. Environment Operation
364 ~~~~~~~~~~~~~~~~~~~~~~~~
365 <Table-c-3.3>
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400 - 3-8 -
401 2. Parameter Update
402 ~~~~~~~~~~~~~~~~~~~
403
404 <Table-c-3.4>
405
406
407
408
409 These are the assessment parameters for multi-phrase conversion mentioned above.
410 The number in the square bracket indicates the current parameter value. To change
411 the value, simply move the cursor to the parameter and press return, then enter the
412 new parameter value.
413
414
415
416
417 The input at the input line is not only restricted to Pinyin. Other
418 characters are also allowed. For example, numbers, ASCII characters, punctuations
419 and brackets. These characters will undergo conversion together with the Pinyin
420 input. Just like Pinyin, these characters have parameters which can be defined
421 externally. The parameters are classified into the following categories.
422
423 (a) Usage frequency for numbers
424 This includes 0,1,2,3,4,5,6,7,8,9. Besides, the system has the facility to change
425 the numbers into other format. For example, "1234567" can be changed to 1,234,567,
426 <Table-c-3.5>
427
428 (b) Usage frequency for ASCII characters
429 (c) Usage frequency for punctuations
430 (d) Usage frequency for open brackets
431 (e) Usage frequency for close brackets
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450 - 3-9 -