0
|
1 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
|
2 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
|
3 ┃ Chapter 4 PINYIN-HANZI CONVERSION ┃
|
|
4 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
5 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
6
|
|
7
|
|
8 ┏━━━━━━━━┓
|
|
9 ┃ 4.1 OVERVIEW ┃
|
|
10 ┗━━━━━━━━┛
|
|
11
|
|
12 In Chapter 3, we have described Pinyin input. In this chapter, we will explain
|
|
13 in greater details on the Pinyin input environment, and how the input is being processed
|
|
14 in the system. General concepts on Pinyin-Hanzi conversion are also explained, as well
|
|
15 as the conversion methods used in cWnn.
|
|
16
|
|
17
|
|
18
|
|
19 ┏━━━━━━━━━━━━━━━┓
|
|
20 ┃ 4.2 PINYIN INPUT ENVIRONMENT ┃
|
|
21 ┗━━━━━━━━━━━━━━━┛
|
|
22
|
|
23 As described in Chapter 3, Pinyin can be input via three methods: Quanpin, Erpin
|
|
24 and Sanpin. The implementation of these methods is not performed internally, but is
|
|
25 through some definitions in an external environment of the system. The function in the
|
|
26 external environment which allows such definitions is known as Input Automaton. Refer
|
|
27 to Chapter 7 for details.
|
|
28 Input automaton provides different input environments for different users. For
|
|
29 example, a user who needs to input Pinyin may use the Pinyin centred input environment,
|
|
30 as explained in Chapter 3.
|
|
31 However, besides the input automaton, user may specify their own Pinyin input
|
|
32 methods in certain environment files.
|
|
33
|
|
34
|
|
35
|
|
36
|
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
|
42
|
|
43
|
|
44
|
|
45
|
|
46
|
|
47
|
|
48
|
|
49
|
|
50 - 4-1 -
|
|
51 Internal/External Representations of Chinese Pronuncations
|
|
52 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
53 Pinyin is the external representation of Chinese pronunciation. When a user inputs a
|
|
54 Pinyin at the user interface, the input will first be processed in the input automaton.
|
|
55 Through the input automaton, it will be converted into the standard Pinyin as defined in
|
|
56 the system. This standard Pinyin is known as the internal representation (音码).
|
|
57 For example:
|
|
58
|
|
59 ┌────────────────────────────────────┐
|
|
60 │ The phrase 汉语语音在计算机中的表现 during user input will be: │
|
|
61 │ han4yu3yu3yin1zai4ji4suan4ji1zhong1debiao3xian4 │
|
|
62 │ │
|
|
63 │ However, the system will automatically convert these to the standard │
|
|
64 │ Pinyin defined in the system, that is: │
|
|
65 │ H帳n Y幊 Y幊 Y帺n Z帳i J幀 Su帳n J帺 Zh幁ng De Bi帲o Xi帳n │
|
|
66 └────────────────────────────────────┘
|
|
67
|
|
68 Within the system, each Pinyin is represented by an individual internal code as defined
|
|
69 in the system. Before the process of Hanzi conversion, the user Pinyin input will first
|
|
70 be converted into its corresponding internal representations.
|
|
71
|
|
72 You may observe from the above that the system does not require the user to segment the
|
|
73 Pinyin input string. The user only needs to input the correct Pinyin, and the system
|
|
74 will perform the segmentation on the input. For example, the input "han4yu3yu3yin1"
|
|
75 will be segmented to "H帳n Y幊 Y幊 Y帺n" automatically by the system. To the system, one
|
|
76 Pinyin is represented as an individual unit. For example, 汉 is not considered as four
|
|
77 characters "h", "a", "n", "4", but is represented as a single unit "H帳n". Hence, the
|
|
78 input string will be segmented and displayed as a more readable form to the user.
|
|
79
|
|
80 The Pinyin input interface is an editor by itself. Besides having the input feature
|
|
81 mentioned above, facilities such as cursor movement, inserting and deleting operations
|
|
82 on the input string are also provided.
|
|
83
|
|
84
|
|
85
|
|
86
|
|
87
|
|
88
|
|
89
|
|
90
|
|
91
|
|
92
|
|
93
|
|
94
|
|
95
|
|
96
|
|
97
|
|
98
|
|
99
|
|
100 - 4-2 -
|
|
101 ┏━━━━━━━━━━━━━━━━━━━━┓
|
|
102 ┃ 4.3 CONCEPT ON PINYIN-HANZI CONVERSION ┃
|
|
103 ┗━━━━━━━━━━━━━━━━━━━━┛
|
|
104
|
|
105 A good Pinyin input system should provide users with a good Pinyin input environment
|
|
106 and a Pinyin-Hanzi conversion mechanism with high accuracy. Pinyin-Hanzi conversion
|
|
107 refers to converting the input from Pinyin to the expected Chinese character(Hanzi).
|
|
108
|
|
109 There are 3 categories of conversion mechanism:
|
|
110 (a) Conversion based on character
|
|
111 (b) Conversion based on word
|
|
112 (c) Conversion based on phrase or any arbitrary Pinyin string
|
|
113
|
|
114 (a) Conversion based on character
|
|
115 This conversion mechanism only allows one Pinyin to be input. The conversion result
|
|
116 is a Chinese character (Hanzi), which has the same pronunciation as the Pinyin input.
|
|
117 We must take note that there are several Hanzi that have the same pronunciation.
|
|
118 This would mean that the Pinyin that has been input will correspond to many Hanzi.
|
|
119 In order to obtain the correct Hanzi, it has to be selected manually among all the
|
|
120 candidates. For example,
|
|
121
|
|
122 The Pinyin "Zh幁ng幚" corresponds to Hanzi 中, 钟, 仲 ..etc. Hence, if the
|
|
123 user wants the word 中国, then he has to select the Hanzi 中.
|
|
124
|
|
125 This mechanism of conversion is time consuming and is not a convenient way of
|
|
126 conversion.
|
|
127
|
|
128
|
|
129 (b) Conversion based on word
|
|
130 In this conversion mechanism, more than one Pinyin is allowed. This Pinyin input
|
|
131 will correspond to the expected Chinese word. A word may consist of more than
|
|
132 one character. Hence, by having word based conversion mechanism, the number of
|
|
133 candidates is much reduced. For example,
|
|
134
|
|
135 The word 中国 consist of characters 中 and 国. If the user wants this word,
|
|
136 he only needs to input "Zh幁ng幚Gu幃幚" and the conversion result will be 中国.
|
|
137
|
|
138 We can see from the above example that the number of candidate selections is reduced.
|
|
139 However, user must have the concept of word during input, and we need to take note
|
|
140 that only words that are registered in the system are valid. Hence, the need of
|
|
141 candidate selection still exists.
|
|
142
|
|
143
|
|
144
|
|
145
|
|
146
|
|
147
|
|
148
|
|
149
|
|
150 - 4-3 -
|
|
151 (c) Conversion based on phrase or any arbitrary Pinyin string
|
|
152 For this conversion, the user is able to input any arbitrary length of Pinyin. That
|
|
153 is, the user terminates the Pinyin input string whenever he thinks is suitable. The
|
|
154 system will analyse the input string, performs the necessary grammatical analysis
|
|
155 and word segmentation, and subsequently produces a more accurate conversion output.
|
|
156 The number of conversions is very much reduced than in (b).
|
|
157
|
|
158 cWnn system makes use of this mechanism of conversion, hence provides a more flexible
|
|
159 user input interface. The diagram below shows the conversion process for the entire
|
|
160 cWnn system.
|
|
161
|
|
162
|
|
163 ↓ Input Output ↑
|
|
164 ━┿━━━ ━━┿━━ User Interface
|
|
165 ┌────┼────────────┼──────────────────┐
|
|
166 │ │ │ │
|
|
167 │ │ ↓ Internal / External │
|
|
168 │ │ │ │
|
|
169 │ ┏━┷━━━━┓ ┏━┷━━┓ ┏━━━━━━┓ │
|
|
170 │ ┃ Input ┠──↓──┨ Editor ┠────┨ Conversion ┃ │
|
|
171 │ ┃ Automaton ┃External/ ┗━━━━┛ ┃ Mechanism ┃ │
|
|
172 │ ┗━┯━━━━┛Internal ┗━━━┯━━┛ │
|
|
173 │ │ │ │
|
|
174 │ ┏━┷━━━━━┓ ┏━━━┷━━━┓ │
|
|
175 │ ┃ Input ┃ ┃ Conversion ┃ │
|
|
176 │ ┃ Environment ┃ ┃ Environment ┃ │
|
|
177 │ ┗━━━━━━━┛ ┗━━━━━━━┛ │
|
|
178 │ │
|
|
179 └────────────────────────────────────┘
|
|
180
|
|
181
|
|
182
|
|
183
|
|
184
|
|
185
|
|
186
|
|
187
|
|
188
|
|
189
|
|
190
|
|
191
|
|
192
|
|
193
|
|
194
|
|
195
|
|
196
|
|
197
|
|
198
|
|
199
|
|
200 - 4-4 -
|
|
201 ┏━━━━━━━━━━━━━━━━━━━┓
|
|
202 ┃ 4.4 PINYIN-HANZI CONVERSION IN CWNN ┃
|
|
203 ┗━━━━━━━━━━━━━━━━━━━┛
|
|
204
|
|
205 In cWnn system, there are two ways of conversion: (1) Forward conversion (正向变换)
|
|
206 (2) Reverse conversion (逆向变换)
|
|
207 Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion
|
|
208 refers to Hanzi-Pinyin conversion. In Pinyin-Hanzi conversion, Pinyin is the input and
|
|
209 the conversion result is the corresponding Hanzi. Vice versa for Hanzi-Pinyin conversion.
|
|
210
|
|
211 As mentioned in Section 4.3, no conversion mechanism is able to perform a 100%
|
|
212 accuracy conversion. Hence, besides providing a multi-phrase conversion mechanism, cWnn
|
|
213 also provides facilities to perform re-editing, re-conversion as well as to allow the
|
|
214 user to segment the words and phrases manually.
|
|
215
|
|
216 We will now explain the Pinyin-Hanzi conversion mechanism, as well as the assessment
|
|
217 formula for the multi-phrase conversion in cWnn system.
|
|
218
|
|
219
|
|
220 1. Conversion Mechanism
|
|
221 ━━━━━━━━━━━━
|
|
222 Pinyin-Hanzi conversion includes the following five conversions. The first three listed
|
|
223 below are most commonly used. The last two conversions are meant for system developers
|
|
224 to check on grammatical analysis.
|
|
225
|
|
226 (a) Multi-phrase conversion
|
|
227 The concept of multi-phrase conversion has been mentioned in Section 4.3. In
|
|
228 cWnn, once a Pinyin input string is sent for conversion, the system will perform
|
|
229 the conversion based on the current environment (refer to Chapter 5) as well as
|
|
230 the conversion parameters of the current environment. After conversion, the
|
|
231 result will appear on the input line, with the cursor positioned at the first
|
|
232 word of the sentence. If a re-conversion is required (done by pressing the
|
|
233 conversion 变换 key again), the conversion method as in (c) will be performed.
|
|
234
|
|
235 (b) Word conversion
|
|
236 The concept of word conversion has been mentioned in Section 4.3. In cWnn, the
|
|
237 portion of the input string indicated by the cursor is treated as a word and
|
|
238 conversion is performed based on this word. The candidate word that has the
|
|
239 highest assessment value is output as the result.
|
|
240 For example, Pinyin "Shi幚Yong幚" corresponds to Hanzi such as 使用, 适用, 施用,
|
|
241 实用, 食用 and 试用 ...etc. However, 使用 has the highest assessment value in
|
|
242 the system. Hence, 使用 will be the initial conversion result.
|
|
243
|
|
244 (c) Word candidates extraction
|
|
245 Treat the portion of the input string indicated by the cursor as a word and
|
|
246 perform word conversion. Output all the possible word candidates.
|
|
247 From the above example of (a), if 使用 is not the word that you want, press the
|
|
248 conversion 变换 key again to get all the possible candidates such as 适用, 施用,
|
|
249 实用, 食用 and 试用 ...etc. and select accordingly.
|
|
250 - 4-5 -
|
|
251 (d) Phrase conversion
|
|
252 Treat the portion of the input string indicated by the cursor as a phrase and
|
|
253 perform phrase conversion. Output the candidate phrase that has the highest
|
|
254 assessment value as result.
|
|
255
|
|
256 (e) Phrase candidates extraction
|
|
257 Treat the portion of the input string indicated by the cursor as a phrase and
|
|
258 perform phrase conversion. Output all the possible phrase candidates.
|
|
259
|
|
260
|
|
261
|
|
262 2. Manual Word Segmentation
|
|
263 ━━━━━━━━━━━━━━
|
|
264 Automatic character segmentation has been mentioned in Section 4.2. cWnn also performs
|
|
265 word segmentation. However, in Pinyin-Hanzi conversion, automatic word segmentation may
|
|
266 not be 100% accurate. Hence, when the conversion result is incorrect, the user needs to
|
|
267 segment the words manually by using the segmentation keys (^O or ^I). The word indicated
|
|
268 by the cursor will be segmented. To complete the manual segmentation process, press the
|
|
269 conversion key again. For example,
|
|
270 ┎────────────────────────────────────┐
|
|
271 │ In the phrase 今天天气正好, 今天 ,天气 and 正好 are treated as words. │
|
|
272 │ We know that 正好 is not the correct word. Hence we need to segment │
|
|
273 │ the word 正好 to individual characters, then perform a re-conversion. │
|
|
274 │ (1) First, move the cursor to 正好, then press ^I to separate the │
|
|
275 │ word. │
|
|
276 │ (2) The word will be converted back to individual Pinyin. You may now │
|
|
277 │ do a re-conversion be pressing the conversion 变换 key again. │
|
|
278 └────────────────────────────────────┘
|
|
279
|
|
280 Similarly, to ^O may be used to combine characters into one unit. For exmaple,
|
|
281 ┎────────────────────────────────────┐
|
|
282 │ 你好 is treated as separate characters. In order to make these two │
|
|
283 │ characters as one unit. You may do the following: │
|
|
284 │ (1) Place the cursor at 你, then press ^O to combine 你 and 好. │
|
|
285 │ (2) The characters will be converted back to Pinyin. You may now do a │
|
|
286 │ re-conversion be pressing the conversion 变换 key again. If no │
|
|
287 │ candidate for this word exists, a message will be displayed │
|
|
288 │ " 侯补1个也没有 (怎么办)". This means that the word does not │
|
|
289 │ exist in the current dictionaries. │
|
|
290 └────────────────────────────────────┘
|
|
291
|
|
292 NOTE: Another way to perform ^O is by using the 文节→ key.
|
|
293 Another way to perform ^I is by using the 文节← key.
|
|
294
|
|
295
|
|
296
|
|
297
|
|
298
|
|
299
|
|
300 - 4-6 -
|
|
301 3. Assessment Formula for Multi-Phrase Conversion
|
|
302 ━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
303 The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion. The level
|
|
304 of accuracy for this conversion has a direct effect on the effectiveness of the system.
|
|
305 There are several factors that affect the conversion result of Pinyin-Hanzi conversion,
|
|
306 each differs according to different conditions. The followings are the assessment
|
|
307 formula for multi-phrase conversion. Users are able to change the corresponding
|
|
308 conversion parameters in order to obtain the most suitable conversion environment.
|
|
309
|
|
310 (a) Assessment parameters
|
|
311 parameter (0) Number of phrase "n"
|
|
312 During the assessment process, this is the maximum number of
|
|
313 phrases that can be assessed at one time.
|
|
314 The default "n" value in cWnn is "1".
|
|
315
|
|
316 parameter (1) Number of words "m"
|
|
317 During the assessment process, this is the maximum number of
|
|
318 words that can be in a phrase.
|
|
319 The default "m" value in cWnn is "5".
|
|
320
|
|
321 (b) Word assessment parameters
|
|
322 parameter (2) Usage frequency weight
|
|
323 A usage frequency is given to each word in the dictionary.
|
|
324 When a user uses the dictionary, the system will create as
|
|
325 well as manage a usage frequency file for the user. As the
|
|
326 user uses the system, the usage frequency of each word in
|
|
327 the dictionary will be updated according to how often the
|
|
328 user uses each word. Hence, each user will have his
|
|
329 individual usage frequency file.
|
|
330 The default value in cWnn is "2".
|
|
331
|
|
332 parameter (3) Word length weight
|
|
333 Word length refers to the number of characters in a word.
|
|
334 The default value in cWnn is "750".
|
|
335
|
|
336 parameter (4) Tone correctness weight
|
|
337 This gives higher assessment values to words entered with
|
|
338 correct four tones, although cWnn allows input with or
|
|
339 without four tones. The default value in cWnn is "10".
|
|
340
|
|
341 parameter (5) Last used weight
|
|
342 Last used refers to the most recently used word for a Pinyin.
|
|
343 By increasing the weight of this parameter, the assessment
|
|
344 value of recently used word can be increased.
|
|
345 The default value in cWnn is "80".
|
|
346
|
|
347
|
|
348
|
|
349
|
|
350 - 4-7 -
|
|
351 parameter (6) Dictionary priority weight
|
|
352 Each dictionary has a priority defined by the environment.
|
|
353 By changing this value, assessment values may be biased
|
|
354 towards certain dictionaries.
|
|
355 The default value in cWnn is "10".
|
|
356
|
|
357 (c) Phrase assessment parameters
|
|
358 parameter (7) Average word assessment value weight
|
|
359 A phrase consists of several words, and each word has its
|
|
360 own word assessment value as described above. The average
|
|
361 of these values is the average word assessment value.
|
|
362 The default value in cWnn is "5".
|
|
363
|
|
364 parameter (8) Phrase length weight
|
|
365 Phrase length refers to the number of characters in a phrase.
|
|
366 The default value in cWnn is "1000".
|
|
367
|
|
368 parameter (9) Number of words in phrase weight
|
|
369 This refers to the the number of words in a phrase. Larger
|
|
370 number of words in a phrase shows greater grammatical
|
|
371 certainty among the words, and hence higher reliability.
|
|
372 The default value in cWnn is "50".
|
|
373
|
|
374 (d) Other paramters
|
|
375 Characters other than Hanzi that appear at the input line have their own
|
|
376 individual weights. The followings are the parameters:
|
|
377
|
|
378 parameter (10) Usage frequency of numerals
|
|
379 The default value in cWnn is "0".
|
|
380
|
|
381 parameter (11) Usage frequency of alphabets
|
|
382 The default value in cWnn is "-200".
|
|
383
|
|
384 parameter (12) Usage frequency of symbols
|
|
385 The default value in cWnn is "0".
|
|
386
|
|
387 parameter (13) Usage frequency of open parentheses
|
|
388 The default value in cWnn is "0".
|
|
389
|
|
390 parameter (14) Usage frequency of close parentheses
|
|
391 The default value in cWnn is "0".
|
|
392
|
|
393 parameter (16) Maximum number of candidates allowed during conversion
|
|
394 The default value in cWnn is "16".
|
|
395
|
|
396
|
|
397
|
|
398
|
|
399
|
|
400 - 4-8 -
|
|
401 (e) Assessment formula for multi-phrase conversion
|
|
402 Pinyin-Hanzi conversion in cWnn is based on an assessment formula. We can see
|
|
403 from the above that each parameter has its value. By increasing their values,
|
|
404 their weightage in the conversion process will increase.
|
|
405
|
|
406 The formulae shown below determine the assessment values for a word, a phrase
|
|
407 and the total assessment value for candidates of a phrase.
|
|
408
|
|
409 Assessment value for word :
|
|
410 f = (c1 x frequency) + (c2 x word length)
|
|
411 + (c3 x tone correctness) + (c4 x last used)
|
|
412 + (c5 x dictionary priority)
|
|
413
|
|
414 Assessment value for phrase :
|
|
415 F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length)
|
|
416 + (k3 x number of words in phrase)
|
|
417
|
|
418 Total assessment value for candidates of a phrase :
|
|
419 Vi = avg( Fi1 + Fi2 + ... + Fin )
|
|
420
|
|
421 Best assessment value for a phrase :
|
|
422 MAX( V1, V2, ... Vk )
|
|
423
|
|
424
|
|
425 NOTE:
|
|
426 * c1 = parameter (2)
|
|
427 c2 = parameter (3)
|
|
428 c3 = parameter (4)
|
|
429 c4 = parameter (5)
|
|
430 c5 = parameter (6)
|
|
431
|
|
432 * k1 = parameter (7)
|
|
433 k2 = parameter (8)
|
|
434 k3 = parameter (9)
|
|
435
|
|
436
|
|
437 The above mentioned parameter values are the default values set in cWnn. These
|
|
438 default values may be set in a environment file. Refer to Section 5.3.
|
|
439
|
|
440 The default values may be changed dynamically by using the environment operation
|
|
441 functions. Refer to Section 5.2 for explanations.
|
|
442
|
|
443
|
|
444
|
|
445
|
|
446
|
|
447
|
|
448
|
|
449
|
|
450 - 4-9 -
|