0
|
1
|
|
2 *************************************************
|
|
3 * Chapter 3 PINYIN INPUT AND *
|
|
4 * PINYIN HANZI CONVERSION *
|
|
5 *************************************************
|
|
6
|
|
7
|
|
8
|
|
9
|
|
10 3.1 CONCEPT ON PINYIN INPUT
|
|
11 ===========================
|
|
12
|
|
13 In Chapter 2, we have already given a brief introduction on Pinyin input. This
|
|
14 will be explained in greater detail now.
|
|
15
|
|
16 Pinyin input method refers to the input Chinese characters via Pinyin and
|
|
17 Pinyin- Hanzi conversion. A good Pinyin input method should provide users with a good
|
|
18 Pinyin input environment as well as a conversion mechanism with high accuracy.
|
|
19
|
|
20 Pinyin-Hanzi conversion has the following 3 categories :
|
|
21 (a) Conversion based on character
|
|
22 (b) Conversion based on word
|
|
23 (c) Conversion based on phrase or any arbitrary Pinyin string
|
|
24
|
|
25 (a) Conversion based on character
|
|
26 The result of this conversion is a Chinese character (Hanzi), which has the same
|
|
27 pronunciation as the input Pinyin.
|
|
28 We must take note that there are several Hanzi that have the same pronunciation.
|
|
29 Hence, one Pinyin corresponds to many Hanzi. In order to obtain the correct Hanzi,
|
|
30 it has to be selected among all the candidates. This is a rather inconvenient way
|
|
31 of conversion.
|
|
32
|
|
33 (b) Conversion based on word
|
|
34 In this conversion, the result is a word. This word may consist of two or
|
|
35 more characters. Hence, the number of candidates is much reduced. However, the
|
|
36 need to select candidates still exists. Also, we need to take note that in such a
|
|
37 system, only words that are registed can be found, and users need to have the
|
|
38 concept of words.
|
|
39
|
|
40 (c) Conversion based on phrase or any arbitrary Pinyin string
|
|
41 For this conversion, the user is able to input any arbitrary length of Pinyin, and
|
|
42 is able to perform the conversion at any position of the input string. The system
|
|
43 analyses the input string, performs the necessary grammatical analysis and word
|
|
44 segmentation, and subsequently produces a more accurate conversion output.
|
|
45
|
|
46
|
|
47
|
|
48
|
|
49
|
|
50 - 3-1 -
|
|
51 The diagram below shows the conversion process for the entire system.
|
|
52
|
|
53 <Table-c-3.1>
|
|
54
|
|
55
|
|
56
|
|
57
|
|
58
|
|
59
|
|
60
|
|
61
|
|
62
|
|
63
|
|
64
|
|
65
|
|
66
|
|
67
|
|
68
|
|
69
|
|
70
|
|
71
|
|
72 We will now explain the Pinyin input environment, Pinyin-Hanzi conversion and the
|
|
73 related environment operations.
|
|
74
|
|
75
|
|
76
|
|
77
|
|
78
|
|
79
|
|
80
|
|
81
|
|
82
|
|
83
|
|
84
|
|
85
|
|
86
|
|
87
|
|
88
|
|
89
|
|
90
|
|
91
|
|
92
|
|
93
|
|
94
|
|
95
|
|
96
|
|
97
|
|
98
|
|
99
|
|
100 - 3-2 -
|
|
101
|
|
102 3.2 PINYIN INPUT ENVIRONMENT
|
|
103 ============================
|
|
104
|
|
105 Pinyin Input and its Internal/External Representations
|
|
106 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
107
|
|
108 Pinyin can be input via three methods : Quanpin, Erpin and Sanpin (as described
|
|
109 in Chapter 2).
|
|
110 These methods of input are not performed internally, but through the definition of
|
|
111 external environment of the system. This external environment definition is known as
|
|
112 Input Automaton (refer to Chapter 6). It provides different input environment for
|
|
113 different users (clients), according to their needs.
|
|
114
|
|
115 Through the input automaton, the user input will be converted into the standard
|
|
116 Pinyin defined by the system. For example :
|
|
117
|
|
118 <Table-c-3.2>
|
|
119
|
|
120 The system does not require user to segment the Pinyin input string. The users only
|
|
121 needs to input the correct Pinyin and the system will perform the segmentation on
|
|
122 the input. For example, the input "hanyuyuyin" will be segmented to "han yu yu yin"
|
|
123 automatically by the system.
|
|
124
|
|
125 The Pinyin input interface is an editor by itself. Besides having the input
|
|
126 feature, facility such as cursor movement, inserting and deleting operations on the
|
|
127 input string are also provided. To the user, one Pinyin is just like an individual
|
|
128 character. For example, "han" is not considered as three characters "h", "a", "n", but
|
|
129 is as a single unit "han".
|
|
130
|
|
131 At the user interface, the Pinyin input is represented as it is. However, within
|
|
132 the system, each Pinyin is represented by an internal code defined by the system.
|
|
133 Hence during the conversion process, these internal representations are used instead of
|
|
134 the external representations of the Pinyin.
|
|
135
|
|
136
|
|
137
|
|
138
|
|
139
|
|
140
|
|
141
|
|
142
|
|
143
|
|
144
|
|
145
|
|
146
|
|
147
|
|
148
|
|
149
|
|
150 - 3-3 -
|
|
151
|
|
152 3.3 PINYIN HANZI CONVERSION
|
|
153 ===========================
|
|
154
|
|
155 In cWnn system, there are two types of conversion : (1) Forward conversion
|
|
156 (2) Reverse conversion
|
|
157 Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion
|
|
158 refers to Hanzi-Pinyin conversion, ie, the input is Hanzi and the conversion result is
|
|
159 the corresponding Pinyin. We will now only explain the Pinyin-Hanzi conversion.
|
|
160
|
|
161 We have to take note that Pinyin-Hanzi conversion does not always get the
|
|
162 accurate result. Hence, besides providing a multi-phrase conversion mechanism, cWnn
|
|
163 also provides facilities to perform re-editing, re-conversion as well as manual word and
|
|
164 phrase segmentation.
|
|
165
|
|
166
|
|
167 1. Conversion Command
|
|
168 ~~~~~~~~~~~~~~~~~~~~~
|
|
169 There are five conversion methods for Pinyin-Hanzi conversion. The first three methods
|
|
170 listed below are most commonly used. The last two methods are meant for system
|
|
171 developers to check on grammatical analysis.
|
|
172
|
|
173 (a) Multi-phrase conversion
|
|
174 Once a Pinyin string is sent for conversion, the system will perform the
|
|
175 conversion based on the current environment (refer to Chapter 5) as well as the
|
|
176 conversion parameters of the current environment. After conversion, the result
|
|
177 will appear on the input line, with the cursor positioned at the first word of
|
|
178 the sentense. If a re-conversion is required (done by pressing the confirm key
|
|
179 again), the conversion method as in (c) will be performed.
|
|
180
|
|
181 (b) Word conversion
|
|
182 Treat the portion of the input string indicated by the cursor as a word and
|
|
183 perform word conversion. Output the candidate word that has the highest
|
|
184 assessment value as result.
|
|
185
|
|
186 (c) Word candidates extraction
|
|
187 Treat the portion of the input string indicated by the cursor as a word and
|
|
188 perform word conversion. Output the possible word candidates under the
|
|
189 particular environment.
|
|
190
|
|
191 (d) Phrase conversion
|
|
192 Treat the portion of the input string indicated by the cursor as a phrase and
|
|
193 perform phrase conversion. Output the candidate phrase that has the highest
|
|
194 assessment value as result.
|
|
195
|
|
196
|
|
197
|
|
198
|
|
199
|
|
200 - 3-4 -
|
|
201 (e) Phrase candidates extraction
|
|
202 Treat the portion of the input string indicated by the cursor as a phrase and
|
|
203 perform phrase conversion. Output the possible phrase candidates under the
|
|
204 particular environment.
|
|
205
|
|
206
|
|
207 2. Manual Word Segmentation
|
|
208 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
209 One difficulty faced in Pinyin-Hanzi conversion is to perform automatic word segmentation.
|
|
210 When the conversion result is incorrect, the user needs to segment the words by using
|
|
211 the segmentation keys (^O or ^I). The word indicated by the cursor will be segmented.
|
|
212 To complete the manual segmentation process, press the conversion key again.
|
|
213
|
|
214
|
|
215 3. Assessment Formula for Multi-Phrase Conversion
|
|
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
217 The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion. The level
|
|
218 of accuracy for this conversion has a direct effect on the effectiveness of the system.
|
|
219 There are several factors that affect the conversion result of Pinyin-Hanzi conversion,
|
|
220 each differs according to different conditions. The followings are the assessment
|
|
221 formula for multi-phrase conversion. Users are able to change the corresponding
|
|
222 conversion parameters in order to obtain the most suitable conversion environment.
|
|
223
|
|
224 (a) Assessment parameters
|
|
225 parameter (0) Number of phrase "n"
|
|
226 During the assessment process, this is the maximum number of
|
|
227 phrases that can be assessed at one time.
|
|
228
|
|
229 parameter (1) Number of words "m"
|
|
230 During the assessment process, this is the maximum number of
|
|
231 words that can be in a phrase.
|
|
232
|
|
233 (b) Word assessment parameters
|
|
234 parameter (2) Usage frequency weight
|
|
235 This is the usage frequency for each word in the dictionary.
|
|
236 When a user uses the dictionary, the system will create as
|
|
237 well as manage a usage frequency file for the user. As the
|
|
238 user uses the system, the usage frequency of each word in
|
|
239 the dictionary will be updated according to how often the
|
|
240 user uses each word. Hence, each user will have his
|
|
241 individual usage frequency file.
|
|
242
|
|
243 parameter (3) Word length weight
|
|
244 Word length refers to the number of characters in a word.
|
|
245
|
|
246
|
|
247
|
|
248
|
|
249
|
|
250 - 3-5 -
|
|
251 parameter (4) Tone correctness weight
|
|
252 This is the accuracy of the four tones in the Pinyin input
|
|
253 by the user compared to that in the dictionary. cWnn allows
|
|
254 input with or without four tones.
|
|
255
|
|
256 parameter (5) Last used weight
|
|
257 Last used refers to the most recently used word for a Pinyin.
|
|
258 By increasing the weight of this parameter, the assessment
|
|
259 value of each word can be increased dynamically.
|
|
260
|
|
261 parameter (6) Dictionary priority weight
|
|
262 Each dictionary has a priority defined by the environment.
|
|
263 By changing this value, assessment values may be biased
|
|
264 towards certain dictionaries.
|
|
265
|
|
266 (c) Phrase assessment parameters
|
|
267 parameter (7) Average word assessment value weight
|
|
268 A phrase consists of several words, and each word has its
|
|
269 own word assessment value as described above. The average
|
|
270 of these values is the average word assessment value.
|
|
271
|
|
272 parameter (8) Phrase length weight
|
|
273 Phrase length refers to the number of characters in a phrase.
|
|
274
|
|
275 parameter (9) Number of words weight
|
|
276 This refers to the the number of words in a phrase. Larger
|
|
277 number of words in a phrase shows greater grammatical
|
|
278 certainty among the words, and hence higher reliability.
|
|
279
|
|
280 (d) Other paramters
|
|
281 Characters other than Pinyin that appear at the input line have their own
|
|
282 individual usage frequency values.
|
|
283
|
|
284 (e) Assessment formula for multi-phrase conversion
|
|
285 Assessment value for word :
|
|
286 f = (c1 x frequency) + (c2 x word length) + (c3 x tone correctness)
|
|
287 + (c4 x last used) + (c5 x dictionary priority)
|
|
288
|
|
289 Assessment value for phrase :
|
|
290 F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length)
|
|
291 + (k3 x number of words in phrase)
|
|
292
|
|
293 Total assessment value for candidates of a phrase :
|
|
294 Vi = avg( Fi1 + Fi2 + ... + Fin )
|
|
295
|
|
296 Best assessment value for a phrase :
|
|
297 MAX( V1, V2, ... Vk )
|
|
298
|
|
299
|
|
300 - 3-6 -
|
|
301 Note : * c1 = parameter (2)
|
|
302 c2 = parameter (3)
|
|
303 c3 = parameter (4)
|
|
304 c4 = parameter (5)
|
|
305 c5 = parameter (6)
|
|
306 * k1 = parameter (7)
|
|
307 k2 = parameter (8)
|
|
308 k3 = parameter (9)
|
|
309
|
|
310
|
|
311
|
|
312
|
|
313
|
|
314
|
|
315
|
|
316
|
|
317
|
|
318
|
|
319
|
|
320
|
|
321
|
|
322
|
|
323
|
|
324
|
|
325
|
|
326
|
|
327
|
|
328
|
|
329
|
|
330
|
|
331
|
|
332
|
|
333
|
|
334
|
|
335
|
|
336
|
|
337
|
|
338
|
|
339
|
|
340
|
|
341
|
|
342
|
|
343
|
|
344
|
|
345
|
|
346
|
|
347
|
|
348
|
|
349
|
|
350 - 3-7 -
|
|
351
|
|
352 3.4 ENVIRONMENT OPERATING FUNCTIONS
|
|
353 ===================================
|
|
354
|
|
355 The cserver manages several resources such as dictionaries and grammar files.
|
|
356 Besides, it creates an environment for every user (client). One user may have more
|
|
357 than one environment. In different input mode, each environment has defined its
|
|
358 dictionary files, corresponding usage frequency files and the grammar files.
|
|
359 When a user starts up uum (client), cserver will create an environment as well
|
|
360 as set the dictionaries for the user. After that, the user is able to obtain the usage
|
|
361 status of the dictionaries from the system.
|
|
362
|
|
363 1. Environment Operation
|
|
364 ~~~~~~~~~~~~~~~~~~~~~~~~
|
|
365 <Table-c-3.3>
|
|
366
|
|
367
|
|
368
|
|
369
|
|
370
|
|
371
|
|
372
|
|
373
|
|
374
|
|
375
|
|
376
|
|
377
|
|
378
|
|
379
|
|
380
|
|
381
|
|
382
|
|
383
|
|
384
|
|
385
|
|
386
|
|
387
|
|
388
|
|
389
|
|
390
|
|
391
|
|
392
|
|
393
|
|
394
|
|
395
|
|
396
|
|
397
|
|
398
|
|
399
|
|
400 - 3-8 -
|
|
401 2. Parameter Update
|
|
402 ~~~~~~~~~~~~~~~~~~~
|
|
403
|
|
404 <Table-c-3.4>
|
|
405
|
|
406
|
|
407
|
|
408
|
|
409 These are the assessment parameters for multi-phrase conversion mentioned above.
|
|
410 The number in the square bracket indicates the current parameter value. To change
|
|
411 the value, simply move the cursor to the parameter and press return, then enter the
|
|
412 new parameter value.
|
|
413
|
|
414
|
|
415
|
|
416
|
|
417 The input at the input line is not only restricted to Pinyin. Other
|
|
418 characters are also allowed. For example, numbers, ASCII characters, punctuations
|
|
419 and brackets. These characters will undergo conversion together with the Pinyin
|
|
420 input. Just like Pinyin, these characters have parameters which can be defined
|
|
421 externally. The parameters are classified into the following categories.
|
|
422
|
|
423 (a) Usage frequency for numbers
|
|
424 This includes 0,1,2,3,4,5,6,7,8,9. Besides, the system has the facility to change
|
|
425 the numbers into other format. For example, "1234567" can be changed to 1,234,567,
|
|
426 <Table-c-3.5>
|
|
427
|
|
428 (b) Usage frequency for ASCII characters
|
|
429 (c) Usage frequency for punctuations
|
|
430 (d) Usage frequency for open brackets
|
|
431 (e) Usage frequency for close brackets
|
|
432
|
|
433
|
|
434
|
|
435
|
|
436
|
|
437
|
|
438
|
|
439
|
|
440
|
|
441
|
|
442
|
|
443
|
|
444
|
|
445
|
|
446
|
|
447
|
|
448
|
|
449
|
|
450 - 3-9 -
|