Mercurial > freewnn
view cWnn/manual/chap4 @ 7:6ab41ec6f895
fix dtoa crash when it encounters malformed entry.
author | Yoshiki Yazawa <yaz@cc.rim.or.jp> |
---|---|
date | Tue, 18 Dec 2007 23:25:17 +0900 |
parents | bbc77ca4def5 |
children |
line wrap: on
line source
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Chapter 4 PINYIN-HANZI CONVERSION ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┏━━━━━━━━┓ ┃ 4.1 OVERVIEW ┃ ┗━━━━━━━━┛ In Chapter 3, we have described Pinyin input. In this chapter, we will explain in greater details on the Pinyin input environment, and how the input is being processed in the system. General concepts on Pinyin-Hanzi conversion are also explained, as well as the conversion methods used in cWnn. ┏━━━━━━━━━━━━━━━┓ ┃ 4.2 PINYIN INPUT ENVIRONMENT ┃ ┗━━━━━━━━━━━━━━━┛ As described in Chapter 3, Pinyin can be input via three methods: Quanpin, Erpin and Sanpin. The implementation of these methods is not performed internally, but is through some definitions in an external environment of the system. The function in the external environment which allows such definitions is known as Input Automaton. Refer to Chapter 7 for details. Input automaton provides different input environments for different users. For example, a user who needs to input Pinyin may use the Pinyin centred input environment, as explained in Chapter 3. However, besides the input automaton, user may specify their own Pinyin input methods in certain environment files. - 4-1 - Internal/External Representations of Chinese Pronuncations ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Pinyin is the external representation of Chinese pronunciation. When a user inputs a Pinyin at the user interface, the input will first be processed in the input automaton. Through the input automaton, it will be converted into the standard Pinyin as defined in the system. This standard Pinyin is known as the internal representation (音码). For example: ┌────────────────────────────────────┐ │ The phrase 汉语语音在计算机中的表现 during user input will be: │ │ han4yu3yu3yin1zai4ji4suan4ji1zhong1debiao3xian4 │ │ │ │ However, the system will automatically convert these to the standard │ │ Pinyin defined in the system, that is: │ │ H帳n Y幊 Y幊 Y帺n Z帳i J幀 Su帳n J帺 Zh幁ng De Bi帲o Xi帳n │ └────────────────────────────────────┘ Within the system, each Pinyin is represented by an individual internal code as defined in the system. Before the process of Hanzi conversion, the user Pinyin input will first be converted into its corresponding internal representations. You may observe from the above that the system does not require the user to segment the Pinyin input string. The user only needs to input the correct Pinyin, and the system will perform the segmentation on the input. For example, the input "han4yu3yu3yin1" will be segmented to "H帳n Y幊 Y幊 Y帺n" automatically by the system. To the system, one Pinyin is represented as an individual unit. For example, 汉 is not considered as four characters "h", "a", "n", "4", but is represented as a single unit "H帳n". Hence, the input string will be segmented and displayed as a more readable form to the user. The Pinyin input interface is an editor by itself. Besides having the input feature mentioned above, facilities such as cursor movement, inserting and deleting operations on the input string are also provided. - 4-2 - ┏━━━━━━━━━━━━━━━━━━━━┓ ┃ 4.3 CONCEPT ON PINYIN-HANZI CONVERSION ┃ ┗━━━━━━━━━━━━━━━━━━━━┛ A good Pinyin input system should provide users with a good Pinyin input environment and a Pinyin-Hanzi conversion mechanism with high accuracy. Pinyin-Hanzi conversion refers to converting the input from Pinyin to the expected Chinese character(Hanzi). There are 3 categories of conversion mechanism: (a) Conversion based on character (b) Conversion based on word (c) Conversion based on phrase or any arbitrary Pinyin string (a) Conversion based on character This conversion mechanism only allows one Pinyin to be input. The conversion result is a Chinese character (Hanzi), which has the same pronunciation as the Pinyin input. We must take note that there are several Hanzi that have the same pronunciation. This would mean that the Pinyin that has been input will correspond to many Hanzi. In order to obtain the correct Hanzi, it has to be selected manually among all the candidates. For example, The Pinyin "Zh幁ng幚" corresponds to Hanzi 中, 钟, 仲 ..etc. Hence, if the user wants the word 中国, then he has to select the Hanzi 中. This mechanism of conversion is time consuming and is not a convenient way of conversion. (b) Conversion based on word In this conversion mechanism, more than one Pinyin is allowed. This Pinyin input will correspond to the expected Chinese word. A word may consist of more than one character. Hence, by having word based conversion mechanism, the number of candidates is much reduced. For example, The word 中国 consist of characters 中 and 国. If the user wants this word, he only needs to input "Zh幁ng幚Gu幃幚" and the conversion result will be 中国. We can see from the above example that the number of candidate selections is reduced. However, user must have the concept of word during input, and we need to take note that only words that are registered in the system are valid. Hence, the need of candidate selection still exists. - 4-3 - (c) Conversion based on phrase or any arbitrary Pinyin string For this conversion, the user is able to input any arbitrary length of Pinyin. That is, the user terminates the Pinyin input string whenever he thinks is suitable. The system will analyse the input string, performs the necessary grammatical analysis and word segmentation, and subsequently produces a more accurate conversion output. The number of conversions is very much reduced than in (b). cWnn system makes use of this mechanism of conversion, hence provides a more flexible user input interface. The diagram below shows the conversion process for the entire cWnn system. ↓ Input Output ↑ ━┿━━━ ━━┿━━ User Interface ┌────┼────────────┼──────────────────┐ │ │ │ │ │ │ ↓ Internal / External │ │ │ │ │ │ ┏━┷━━━━┓ ┏━┷━━┓ ┏━━━━━━┓ │ │ ┃ Input ┠──↓──┨ Editor ┠────┨ Conversion ┃ │ │ ┃ Automaton ┃External/ ┗━━━━┛ ┃ Mechanism ┃ │ │ ┗━┯━━━━┛Internal ┗━━━┯━━┛ │ │ │ │ │ │ ┏━┷━━━━━┓ ┏━━━┷━━━┓ │ │ ┃ Input ┃ ┃ Conversion ┃ │ │ ┃ Environment ┃ ┃ Environment ┃ │ │ ┗━━━━━━━┛ ┗━━━━━━━┛ │ │ │ └────────────────────────────────────┘ - 4-4 - ┏━━━━━━━━━━━━━━━━━━━┓ ┃ 4.4 PINYIN-HANZI CONVERSION IN CWNN ┃ ┗━━━━━━━━━━━━━━━━━━━┛ In cWnn system, there are two ways of conversion: (1) Forward conversion (正向变换) (2) Reverse conversion (逆向变换) Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion refers to Hanzi-Pinyin conversion. In Pinyin-Hanzi conversion, Pinyin is the input and the conversion result is the corresponding Hanzi. Vice versa for Hanzi-Pinyin conversion. As mentioned in Section 4.3, no conversion mechanism is able to perform a 100% accuracy conversion. Hence, besides providing a multi-phrase conversion mechanism, cWnn also provides facilities to perform re-editing, re-conversion as well as to allow the user to segment the words and phrases manually. We will now explain the Pinyin-Hanzi conversion mechanism, as well as the assessment formula for the multi-phrase conversion in cWnn system. 1. Conversion Mechanism ━━━━━━━━━━━━ Pinyin-Hanzi conversion includes the following five conversions. The first three listed below are most commonly used. The last two conversions are meant for system developers to check on grammatical analysis. (a) Multi-phrase conversion The concept of multi-phrase conversion has been mentioned in Section 4.3. In cWnn, once a Pinyin input string is sent for conversion, the system will perform the conversion based on the current environment (refer to Chapter 5) as well as the conversion parameters of the current environment. After conversion, the result will appear on the input line, with the cursor positioned at the first word of the sentence. If a re-conversion is required (done by pressing the conversion 变换 key again), the conversion method as in (c) will be performed. (b) Word conversion The concept of word conversion has been mentioned in Section 4.3. In cWnn, the portion of the input string indicated by the cursor is treated as a word and conversion is performed based on this word. The candidate word that has the highest assessment value is output as the result. For example, Pinyin "Shi幚Yong幚" corresponds to Hanzi such as 使用, 适用, 施用, 实用, 食用 and 试用 ...etc. However, 使用 has the highest assessment value in the system. Hence, 使用 will be the initial conversion result. (c) Word candidates extraction Treat the portion of the input string indicated by the cursor as a word and perform word conversion. Output all the possible word candidates. From the above example of (a), if 使用 is not the word that you want, press the conversion 变换 key again to get all the possible candidates such as 适用, 施用, 实用, 食用 and 试用 ...etc. and select accordingly. - 4-5 - (d) Phrase conversion Treat the portion of the input string indicated by the cursor as a phrase and perform phrase conversion. Output the candidate phrase that has the highest assessment value as result. (e) Phrase candidates extraction Treat the portion of the input string indicated by the cursor as a phrase and perform phrase conversion. Output all the possible phrase candidates. 2. Manual Word Segmentation ━━━━━━━━━━━━━━ Automatic character segmentation has been mentioned in Section 4.2. cWnn also performs word segmentation. However, in Pinyin-Hanzi conversion, automatic word segmentation may not be 100% accurate. Hence, when the conversion result is incorrect, the user needs to segment the words manually by using the segmentation keys (^O or ^I). The word indicated by the cursor will be segmented. To complete the manual segmentation process, press the conversion key again. For example, ┎────────────────────────────────────┐ │ In the phrase 今天天气正好, 今天 ,天气 and 正好 are treated as words. │ │ We know that 正好 is not the correct word. Hence we need to segment │ │ the word 正好 to individual characters, then perform a re-conversion. │ │ (1) First, move the cursor to 正好, then press ^I to separate the │ │ word. │ │ (2) The word will be converted back to individual Pinyin. You may now │ │ do a re-conversion be pressing the conversion 变换 key again. │ └────────────────────────────────────┘ Similarly, to ^O may be used to combine characters into one unit. For exmaple, ┎────────────────────────────────────┐ │ 你好 is treated as separate characters. In order to make these two │ │ characters as one unit. You may do the following: │ │ (1) Place the cursor at 你, then press ^O to combine 你 and 好. │ │ (2) The characters will be converted back to Pinyin. You may now do a │ │ re-conversion be pressing the conversion 变换 key again. If no │ │ candidate for this word exists, a message will be displayed │ │ " 侯补1个也没有 (怎么办)". This means that the word does not │ │ exist in the current dictionaries. │ └────────────────────────────────────┘ NOTE: Another way to perform ^O is by using the 文节→ key. Another way to perform ^I is by using the 文节← key. - 4-6 - 3. Assessment Formula for Multi-Phrase Conversion ━━━━━━━━━━━━━━━━━━━━━━━━━ The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion. The level of accuracy for this conversion has a direct effect on the effectiveness of the system. There are several factors that affect the conversion result of Pinyin-Hanzi conversion, each differs according to different conditions. The followings are the assessment formula for multi-phrase conversion. Users are able to change the corresponding conversion parameters in order to obtain the most suitable conversion environment. (a) Assessment parameters parameter (0) Number of phrase "n" During the assessment process, this is the maximum number of phrases that can be assessed at one time. The default "n" value in cWnn is "1". parameter (1) Number of words "m" During the assessment process, this is the maximum number of words that can be in a phrase. The default "m" value in cWnn is "5". (b) Word assessment parameters parameter (2) Usage frequency weight A usage frequency is given to each word in the dictionary. When a user uses the dictionary, the system will create as well as manage a usage frequency file for the user. As the user uses the system, the usage frequency of each word in the dictionary will be updated according to how often the user uses each word. Hence, each user will have his individual usage frequency file. The default value in cWnn is "2". parameter (3) Word length weight Word length refers to the number of characters in a word. The default value in cWnn is "750". parameter (4) Tone correctness weight This gives higher assessment values to words entered with correct four tones, although cWnn allows input with or without four tones. The default value in cWnn is "10". parameter (5) Last used weight Last used refers to the most recently used word for a Pinyin. By increasing the weight of this parameter, the assessment value of recently used word can be increased. The default value in cWnn is "80". - 4-7 - parameter (6) Dictionary priority weight Each dictionary has a priority defined by the environment. By changing this value, assessment values may be biased towards certain dictionaries. The default value in cWnn is "10". (c) Phrase assessment parameters parameter (7) Average word assessment value weight A phrase consists of several words, and each word has its own word assessment value as described above. The average of these values is the average word assessment value. The default value in cWnn is "5". parameter (8) Phrase length weight Phrase length refers to the number of characters in a phrase. The default value in cWnn is "1000". parameter (9) Number of words in phrase weight This refers to the the number of words in a phrase. Larger number of words in a phrase shows greater grammatical certainty among the words, and hence higher reliability. The default value in cWnn is "50". (d) Other paramters Characters other than Hanzi that appear at the input line have their own individual weights. The followings are the parameters: parameter (10) Usage frequency of numerals The default value in cWnn is "0". parameter (11) Usage frequency of alphabets The default value in cWnn is "-200". parameter (12) Usage frequency of symbols The default value in cWnn is "0". parameter (13) Usage frequency of open parentheses The default value in cWnn is "0". parameter (14) Usage frequency of close parentheses The default value in cWnn is "0". parameter (16) Maximum number of candidates allowed during conversion The default value in cWnn is "16". - 4-8 - (e) Assessment formula for multi-phrase conversion Pinyin-Hanzi conversion in cWnn is based on an assessment formula. We can see from the above that each parameter has its value. By increasing their values, their weightage in the conversion process will increase. The formulae shown below determine the assessment values for a word, a phrase and the total assessment value for candidates of a phrase. Assessment value for word : f = (c1 x frequency) + (c2 x word length) + (c3 x tone correctness) + (c4 x last used) + (c5 x dictionary priority) Assessment value for phrase : F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length) + (k3 x number of words in phrase) Total assessment value for candidates of a phrase : Vi = avg( Fi1 + Fi2 + ... + Fin ) Best assessment value for a phrase : MAX( V1, V2, ... Vk ) NOTE: * c1 = parameter (2) c2 = parameter (3) c3 = parameter (4) c4 = parameter (5) c5 = parameter (6) * k1 = parameter (7) k2 = parameter (8) k3 = parameter (9) The above mentioned parameter values are the default values set in cWnn. These default values may be set in a environment file. Refer to Section 5.3. The default values may be changed dynamically by using the environment operation functions. Refer to Section 5.2 for explanations. - 4-9 -