Mercurial > freewnn
comparison cWnn/manual.en/chap3 @ 0:bbc77ca4def5
initial import
author | Yoshiki Yazawa <yaz@cc.rim.or.jp> |
---|---|
date | Thu, 13 Dec 2007 04:30:14 +0900 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:bbc77ca4def5 |
---|---|
1 | |
2 ************************************************* | |
3 * Chapter 3 PINYIN INPUT AND * | |
4 * PINYIN HANZI CONVERSION * | |
5 ************************************************* | |
6 | |
7 | |
8 | |
9 | |
10 3.1 CONCEPT ON PINYIN INPUT | |
11 =========================== | |
12 | |
13 In Chapter 2, we have already given a brief introduction on Pinyin input. This | |
14 will be explained in greater detail now. | |
15 | |
16 Pinyin input method refers to the input Chinese characters via Pinyin and | |
17 Pinyin- Hanzi conversion. A good Pinyin input method should provide users with a good | |
18 Pinyin input environment as well as a conversion mechanism with high accuracy. | |
19 | |
20 Pinyin-Hanzi conversion has the following 3 categories : | |
21 (a) Conversion based on character | |
22 (b) Conversion based on word | |
23 (c) Conversion based on phrase or any arbitrary Pinyin string | |
24 | |
25 (a) Conversion based on character | |
26 The result of this conversion is a Chinese character (Hanzi), which has the same | |
27 pronunciation as the input Pinyin. | |
28 We must take note that there are several Hanzi that have the same pronunciation. | |
29 Hence, one Pinyin corresponds to many Hanzi. In order to obtain the correct Hanzi, | |
30 it has to be selected among all the candidates. This is a rather inconvenient way | |
31 of conversion. | |
32 | |
33 (b) Conversion based on word | |
34 In this conversion, the result is a word. This word may consist of two or | |
35 more characters. Hence, the number of candidates is much reduced. However, the | |
36 need to select candidates still exists. Also, we need to take note that in such a | |
37 system, only words that are registed can be found, and users need to have the | |
38 concept of words. | |
39 | |
40 (c) Conversion based on phrase or any arbitrary Pinyin string | |
41 For this conversion, the user is able to input any arbitrary length of Pinyin, and | |
42 is able to perform the conversion at any position of the input string. The system | |
43 analyses the input string, performs the necessary grammatical analysis and word | |
44 segmentation, and subsequently produces a more accurate conversion output. | |
45 | |
46 | |
47 | |
48 | |
49 | |
50 - 3-1 - | |
51 The diagram below shows the conversion process for the entire system. | |
52 | |
53 <Table-c-3.1> | |
54 | |
55 | |
56 | |
57 | |
58 | |
59 | |
60 | |
61 | |
62 | |
63 | |
64 | |
65 | |
66 | |
67 | |
68 | |
69 | |
70 | |
71 | |
72 We will now explain the Pinyin input environment, Pinyin-Hanzi conversion and the | |
73 related environment operations. | |
74 | |
75 | |
76 | |
77 | |
78 | |
79 | |
80 | |
81 | |
82 | |
83 | |
84 | |
85 | |
86 | |
87 | |
88 | |
89 | |
90 | |
91 | |
92 | |
93 | |
94 | |
95 | |
96 | |
97 | |
98 | |
99 | |
100 - 3-2 - | |
101 | |
102 3.2 PINYIN INPUT ENVIRONMENT | |
103 ============================ | |
104 | |
105 Pinyin Input and its Internal/External Representations | |
106 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
107 | |
108 Pinyin can be input via three methods : Quanpin, Erpin and Sanpin (as described | |
109 in Chapter 2). | |
110 These methods of input are not performed internally, but through the definition of | |
111 external environment of the system. This external environment definition is known as | |
112 Input Automaton (refer to Chapter 6). It provides different input environment for | |
113 different users (clients), according to their needs. | |
114 | |
115 Through the input automaton, the user input will be converted into the standard | |
116 Pinyin defined by the system. For example : | |
117 | |
118 <Table-c-3.2> | |
119 | |
120 The system does not require user to segment the Pinyin input string. The users only | |
121 needs to input the correct Pinyin and the system will perform the segmentation on | |
122 the input. For example, the input "hanyuyuyin" will be segmented to "han yu yu yin" | |
123 automatically by the system. | |
124 | |
125 The Pinyin input interface is an editor by itself. Besides having the input | |
126 feature, facility such as cursor movement, inserting and deleting operations on the | |
127 input string are also provided. To the user, one Pinyin is just like an individual | |
128 character. For example, "han" is not considered as three characters "h", "a", "n", but | |
129 is as a single unit "han". | |
130 | |
131 At the user interface, the Pinyin input is represented as it is. However, within | |
132 the system, each Pinyin is represented by an internal code defined by the system. | |
133 Hence during the conversion process, these internal representations are used instead of | |
134 the external representations of the Pinyin. | |
135 | |
136 | |
137 | |
138 | |
139 | |
140 | |
141 | |
142 | |
143 | |
144 | |
145 | |
146 | |
147 | |
148 | |
149 | |
150 - 3-3 - | |
151 | |
152 3.3 PINYIN HANZI CONVERSION | |
153 =========================== | |
154 | |
155 In cWnn system, there are two types of conversion : (1) Forward conversion | |
156 (2) Reverse conversion | |
157 Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion | |
158 refers to Hanzi-Pinyin conversion, ie, the input is Hanzi and the conversion result is | |
159 the corresponding Pinyin. We will now only explain the Pinyin-Hanzi conversion. | |
160 | |
161 We have to take note that Pinyin-Hanzi conversion does not always get the | |
162 accurate result. Hence, besides providing a multi-phrase conversion mechanism, cWnn | |
163 also provides facilities to perform re-editing, re-conversion as well as manual word and | |
164 phrase segmentation. | |
165 | |
166 | |
167 1. Conversion Command | |
168 ~~~~~~~~~~~~~~~~~~~~~ | |
169 There are five conversion methods for Pinyin-Hanzi conversion. The first three methods | |
170 listed below are most commonly used. The last two methods are meant for system | |
171 developers to check on grammatical analysis. | |
172 | |
173 (a) Multi-phrase conversion | |
174 Once a Pinyin string is sent for conversion, the system will perform the | |
175 conversion based on the current environment (refer to Chapter 5) as well as the | |
176 conversion parameters of the current environment. After conversion, the result | |
177 will appear on the input line, with the cursor positioned at the first word of | |
178 the sentense. If a re-conversion is required (done by pressing the confirm key | |
179 again), the conversion method as in (c) will be performed. | |
180 | |
181 (b) Word conversion | |
182 Treat the portion of the input string indicated by the cursor as a word and | |
183 perform word conversion. Output the candidate word that has the highest | |
184 assessment value as result. | |
185 | |
186 (c) Word candidates extraction | |
187 Treat the portion of the input string indicated by the cursor as a word and | |
188 perform word conversion. Output the possible word candidates under the | |
189 particular environment. | |
190 | |
191 (d) Phrase conversion | |
192 Treat the portion of the input string indicated by the cursor as a phrase and | |
193 perform phrase conversion. Output the candidate phrase that has the highest | |
194 assessment value as result. | |
195 | |
196 | |
197 | |
198 | |
199 | |
200 - 3-4 - | |
201 (e) Phrase candidates extraction | |
202 Treat the portion of the input string indicated by the cursor as a phrase and | |
203 perform phrase conversion. Output the possible phrase candidates under the | |
204 particular environment. | |
205 | |
206 | |
207 2. Manual Word Segmentation | |
208 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
209 One difficulty faced in Pinyin-Hanzi conversion is to perform automatic word segmentation. | |
210 When the conversion result is incorrect, the user needs to segment the words by using | |
211 the segmentation keys (^O or ^I). The word indicated by the cursor will be segmented. | |
212 To complete the manual segmentation process, press the conversion key again. | |
213 | |
214 | |
215 3. Assessment Formula for Multi-Phrase Conversion | |
216 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
217 The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion. The level | |
218 of accuracy for this conversion has a direct effect on the effectiveness of the system. | |
219 There are several factors that affect the conversion result of Pinyin-Hanzi conversion, | |
220 each differs according to different conditions. The followings are the assessment | |
221 formula for multi-phrase conversion. Users are able to change the corresponding | |
222 conversion parameters in order to obtain the most suitable conversion environment. | |
223 | |
224 (a) Assessment parameters | |
225 parameter (0) Number of phrase "n" | |
226 During the assessment process, this is the maximum number of | |
227 phrases that can be assessed at one time. | |
228 | |
229 parameter (1) Number of words "m" | |
230 During the assessment process, this is the maximum number of | |
231 words that can be in a phrase. | |
232 | |
233 (b) Word assessment parameters | |
234 parameter (2) Usage frequency weight | |
235 This is the usage frequency for each word in the dictionary. | |
236 When a user uses the dictionary, the system will create as | |
237 well as manage a usage frequency file for the user. As the | |
238 user uses the system, the usage frequency of each word in | |
239 the dictionary will be updated according to how often the | |
240 user uses each word. Hence, each user will have his | |
241 individual usage frequency file. | |
242 | |
243 parameter (3) Word length weight | |
244 Word length refers to the number of characters in a word. | |
245 | |
246 | |
247 | |
248 | |
249 | |
250 - 3-5 - | |
251 parameter (4) Tone correctness weight | |
252 This is the accuracy of the four tones in the Pinyin input | |
253 by the user compared to that in the dictionary. cWnn allows | |
254 input with or without four tones. | |
255 | |
256 parameter (5) Last used weight | |
257 Last used refers to the most recently used word for a Pinyin. | |
258 By increasing the weight of this parameter, the assessment | |
259 value of each word can be increased dynamically. | |
260 | |
261 parameter (6) Dictionary priority weight | |
262 Each dictionary has a priority defined by the environment. | |
263 By changing this value, assessment values may be biased | |
264 towards certain dictionaries. | |
265 | |
266 (c) Phrase assessment parameters | |
267 parameter (7) Average word assessment value weight | |
268 A phrase consists of several words, and each word has its | |
269 own word assessment value as described above. The average | |
270 of these values is the average word assessment value. | |
271 | |
272 parameter (8) Phrase length weight | |
273 Phrase length refers to the number of characters in a phrase. | |
274 | |
275 parameter (9) Number of words weight | |
276 This refers to the the number of words in a phrase. Larger | |
277 number of words in a phrase shows greater grammatical | |
278 certainty among the words, and hence higher reliability. | |
279 | |
280 (d) Other paramters | |
281 Characters other than Pinyin that appear at the input line have their own | |
282 individual usage frequency values. | |
283 | |
284 (e) Assessment formula for multi-phrase conversion | |
285 Assessment value for word : | |
286 f = (c1 x frequency) + (c2 x word length) + (c3 x tone correctness) | |
287 + (c4 x last used) + (c5 x dictionary priority) | |
288 | |
289 Assessment value for phrase : | |
290 F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length) | |
291 + (k3 x number of words in phrase) | |
292 | |
293 Total assessment value for candidates of a phrase : | |
294 Vi = avg( Fi1 + Fi2 + ... + Fin ) | |
295 | |
296 Best assessment value for a phrase : | |
297 MAX( V1, V2, ... Vk ) | |
298 | |
299 | |
300 - 3-6 - | |
301 Note : * c1 = parameter (2) | |
302 c2 = parameter (3) | |
303 c3 = parameter (4) | |
304 c4 = parameter (5) | |
305 c5 = parameter (6) | |
306 * k1 = parameter (7) | |
307 k2 = parameter (8) | |
308 k3 = parameter (9) | |
309 | |
310 | |
311 | |
312 | |
313 | |
314 | |
315 | |
316 | |
317 | |
318 | |
319 | |
320 | |
321 | |
322 | |
323 | |
324 | |
325 | |
326 | |
327 | |
328 | |
329 | |
330 | |
331 | |
332 | |
333 | |
334 | |
335 | |
336 | |
337 | |
338 | |
339 | |
340 | |
341 | |
342 | |
343 | |
344 | |
345 | |
346 | |
347 | |
348 | |
349 | |
350 - 3-7 - | |
351 | |
352 3.4 ENVIRONMENT OPERATING FUNCTIONS | |
353 =================================== | |
354 | |
355 The cserver manages several resources such as dictionaries and grammar files. | |
356 Besides, it creates an environment for every user (client). One user may have more | |
357 than one environment. In different input mode, each environment has defined its | |
358 dictionary files, corresponding usage frequency files and the grammar files. | |
359 When a user starts up uum (client), cserver will create an environment as well | |
360 as set the dictionaries for the user. After that, the user is able to obtain the usage | |
361 status of the dictionaries from the system. | |
362 | |
363 1. Environment Operation | |
364 ~~~~~~~~~~~~~~~~~~~~~~~~ | |
365 <Table-c-3.3> | |
366 | |
367 | |
368 | |
369 | |
370 | |
371 | |
372 | |
373 | |
374 | |
375 | |
376 | |
377 | |
378 | |
379 | |
380 | |
381 | |
382 | |
383 | |
384 | |
385 | |
386 | |
387 | |
388 | |
389 | |
390 | |
391 | |
392 | |
393 | |
394 | |
395 | |
396 | |
397 | |
398 | |
399 | |
400 - 3-8 - | |
401 2. Parameter Update | |
402 ~~~~~~~~~~~~~~~~~~~ | |
403 | |
404 <Table-c-3.4> | |
405 | |
406 | |
407 | |
408 | |
409 These are the assessment parameters for multi-phrase conversion mentioned above. | |
410 The number in the square bracket indicates the current parameter value. To change | |
411 the value, simply move the cursor to the parameter and press return, then enter the | |
412 new parameter value. | |
413 | |
414 | |
415 | |
416 | |
417 The input at the input line is not only restricted to Pinyin. Other | |
418 characters are also allowed. For example, numbers, ASCII characters, punctuations | |
419 and brackets. These characters will undergo conversion together with the Pinyin | |
420 input. Just like Pinyin, these characters have parameters which can be defined | |
421 externally. The parameters are classified into the following categories. | |
422 | |
423 (a) Usage frequency for numbers | |
424 This includes 0,1,2,3,4,5,6,7,8,9. Besides, the system has the facility to change | |
425 the numbers into other format. For example, "1234567" can be changed to 1,234,567, | |
426 <Table-c-3.5> | |
427 | |
428 (b) Usage frequency for ASCII characters | |
429 (c) Usage frequency for punctuations | |
430 (d) Usage frequency for open brackets | |
431 (e) Usage frequency for close brackets | |
432 | |
433 | |
434 | |
435 | |
436 | |
437 | |
438 | |
439 | |
440 | |
441 | |
442 | |
443 | |
444 | |
445 | |
446 | |
447 | |
448 | |
449 | |
450 - 3-9 - |