Mercurial > freewnn

        底岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩弧
         底岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩弧
         岱  ␃�€���� ㄣ    ␐␉␎␙␉␎-␈␁␎␚␉  ␃␏␎␖␅␒␓␉␏␎  岱
         彿岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩忽
        彿岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩忽


底岩岩岩岩岩岩岩岩弧
岱 4.1  OVERVIEW  岱
彿岩岩岩岩岩岩岩岩忽

	In Chapter 3, we have described Pinyin input.  In this chapter, we will explain
in greater details on the Pinyin input environment, and how the input is being processed
in the system.  General concepts on  Pinyin-Hanzi conversion are also explained, as well
as the conversion methods used in cWnn.


底岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩弧
岱 4.2 PINYIN INPUT ENVIRONMENT 岱
彿岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩忽

	As described in Chapter 3, Pinyin can be input via three methods: Quanpin, Erpin
and Sanpin.  The  implementation of these  methods is not  performed  internally, but is
through some  definitions in an external environment of the system.  The function in the
external environment  which allows  such definitions is known as Input Automaton.  Refer
to Chapter 7 for details.
	Input automaton  provides different input environments for different users.  For
example, a  user who needs to input Pinyin may use the Pinyin centred input environment,
as explained in Chapter 3.
	However, besides the input automaton, user  may  specify  their own Pinyin input
methods in certain environment files.


					- 4-1 -
Internal/External Representations of Chinese Pronuncations
岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩
Pinyin is the  external  representation of  Chinese pronunciation.  When a user inputs a
Pinyin at the user interface, the input  will first be processed in the input automaton.
Through the input automaton, it will be converted into the standard Pinyin as defined in
the system.  This standard Pinyin is known as the internal representation (秞鎢).
For example:

      	庚岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸庖
	岫  The phrase 犖逄逄秞婓數呾儂笢腔桶珋 during user input will be:	  岫
	岫      han4yu3yu3yin1zai4ji4suan4ji1zhong1debiao3xian4			  岫
     	岫					 				  岫
	岫  However, the system will automatically convert these to the standard  岫
	岫  Pinyin defined in the system, that is:				  岫
  	岫	H絝n Y𨍭 Y𨍭 Y焵n Z絝i J𤁗 Su絝n J焵 Zh𦀩ng De Bi𦹄o Xi絝n			  岫
	弩岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸彼

Within the system, each Pinyin is represented  by an individual internal code as defined
in the system.  Before the process of Hanzi conversion, the user Pinyin input will first
be converted into its corresponding internal representations.

You may observe from the above that the system  does not require the user to segment the
Pinyin input string.  The  user only  needs to input the correct  Pinyin, and the system
will perform  the segmentation  on  the input.  For example, the input  "han4yu3yu3yin1"
will be segmented to  "H絝n Y𨍭 Y𨍭 Y焵n"  automatically by  the system.  To the system, one
Pinyin is represented as an  individual unit.  For example, 犖 is not considered as four
characters  "h", "a", "n", "4",  but is represented as a single unit  "H絝n".  Hence, the
input string will be segmented and displayed as a more readable form to the user.

The Pinyin input  interface is an  editor by itself.  Besides  having the  input feature
mentioned above, facilities such as  cursor movement, inserting and  deleting operations
on the input string are also provided.


					- 4-2 -
底岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩弧
岱 4.3 CONCEPT ON PINYIN-HANZI CONVERSION 岱
彿岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩忽

A good Pinyin input system should  provide users with a good Pinyin input environment
and a Pinyin-Hanzi conversion mechanism  with high accuracy.  Pinyin-Hanzi conversion
refers to converting the input from Pinyin to the expected Chinese character(Hanzi).

There are 3 categories of conversion mechanism:
	(a) Conversion based on character
	(b) Conversion based on word
	(c) Conversion based on phrase or any arbitrary Pinyin string

(a) Conversion based on character
    This conversion mechanism only allows one Pinyin to be input.  The conversion result
    is a Chinese character (Hanzi), which has the same pronunciation as the Pinyin input.
    We must  take note that there are  several  Hanzi that  have the  same pronunciation.
    This would  mean that the Pinyin that  has been input will correspond to  many Hanzi.
    In order to obtain the correct  Hanzi, it has to be selected manually among all the
    candidates. For example,

	The Pinyin "Zh𦀩ng𦋐" corresponds to Hanzi 笢, 笘, 笯 ..etc.  Hence, if the
	user wants the word 笢弊, then he has to select the Hanzi 笢.

    This  mechanism of  conversion is  time consuming and is not a convenient way of
    conversion.


(b) Conversion based on word
    In this  conversion mechanism,  more than  one Pinyin is allowed.  This Pinyin input
    will  correspond to the  expected  Chinese  word.  A word  may consist of  more than
    one  character.  Hence, by  having  word based  conversion mechanism, the  number of
    candidates is much reduced.  For example,

	The word 笢弊 consist of characters 笢 and 弊.  If the user wants this word,
 	he only needs to input  "Zh𦀩ng𦋐Gu緤𦋐" and the conversion result will be 笢弊.

    We can see from the above example that the number of candidate selections is reduced.
    However, user  must have the concept of word  during input, and we need to take note
    that only  words that  are  registered in the system  are valid.  Hence, the need of
    candidate selection still exists.


					- 4-3 -
(c) Conversion based on phrase or any arbitrary Pinyin string
    For this conversion, the user is able to input any arbitrary length of Pinyin.  That
    is, the user terminates the Pinyin input string whenever he thinks is suitable.  The
    system  will analyse the input string,  performs the necessary  grammatical analysis
    and word segmentation, and subsequently produces a more accurate conversion output.
    The number of conversions is very much reduced than in (b).

    cWnn system makes use of this mechanism of conversion, hence provides a more flexible
    user input interface.  The diagram below shows the conversion process for the entire
    cWnn system.


    	       ∣  Input          Output ∥
             岩押岩岩岩              岩岩押岩岩  User Interface
     庚岸岸岸岸拈岸岸岸岸岸岸岸岸岸岸岸岸拈岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸庖
     岫        岫                        岫			  	       岫
     岫        岫			 ∣ Internal / External     	       岫
     岫        岫                        岫                                    岫
     岫    底岩招岩岩岩岩弧          底岩招岩岩弧        底岩岩岩岩岩岩弧      岫
     岫    岱   Input    怯岸岸∣岸岸怫 Editor 怯岸岸岸岸怫 Conversion 岱      岫
     岫    岱 Automaton  岱External/ 彿岩岩岩岩忽        岱 Mechanism  岱      岫
     岫    彿岩承岩岩岩岩忽Internal			 彿岩岩岩承岩岩忽      岫
     岫        岫						 岫	       岫
     岫    底岩招岩岩岩岩岩弧                            底岩岩岩招岩岩岩弧    岫
     岫    岱    Input     岱                            岱  Conversion  岱    岫
     岫    岱 Environment  岱        		         岱  Environment 岱    岫
     岫    彿岩岩岩岩岩岩岩忽 			 	 彿岩岩岩岩岩岩岩忽    岫
     岫									       岫
     弩岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸彼


					- 4-4 -
底岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩弧
岱 4.4  PINYIN-HANZI CONVERSION IN CWNN 岱
彿岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩忽

	In cWnn system, there are two ways of conversion: (1) Forward conversion (淏砃曹遙)
						     	  (2) Reverse conversion (欄砃曹遙)
 	Forward conversion refers to Pinyin-Hanzi conversion, whereas reverse conversion
refers to  Hanzi-Pinyin conversion.  In Pinyin-Hanzi conversion, Pinyin is the input and
the conversion result is the corresponding Hanzi.  Vice versa for Hanzi-Pinyin conversion.

	As mentioned in Section 4.3,  no  conversion mechanism is able to perform a 100%
accuracy conversion.  Hence, besides providing a multi-phrase conversion mechanism, cWnn
also provides facilities to perform  re-editing,  re-conversion  as well as to allow the
user to segment the words and phrases manually.

We  will now  explain the  Pinyin-Hanzi conversion mechanism, as well as the  assessment
formula for the multi-phrase conversion in cWnn system.


1. Conversion Mechanism
岩岩岩岩岩岩岩岩岩岩岩岩
Pinyin-Hanzi conversion includes the following five conversions. The first three listed
below are most commonly used.  The last two conversions are meant for system developers
to check on grammatical analysis.

(a) Multi-phrase conversion
	The concept of  multi-phrase  conversion has  been mentioned in Section 4.3.  In
	cWnn, once a Pinyin input string is sent for conversion, the system will perform
	the conversion based on the current  environment (refer to Chapter 5) as well as
	the  conversion  parameters of  the current environment.  After  conversion, the
	result  will appear  on the input line,  with the cursor positioned at the first
	word of the  sentence.  If  a  re-conversion is  required (done by  pressing the
	conversion 曹遙 key again), the conversion method as in (c) will be performed.

(b) Word conversion
	The concept of word conversion has been mentioned in Section 4.3.  In cWnn, the
	portion of the  input  string indicated  by the cursor is treated as a word and
	conversion  is performed  based on this word.  The candidate word  that has the
	highest assessment value is output as the result.
	For example, Pinyin  "Shi𦋐Yong𦋐" corresponds to Hanzi such as 妏蚚, 巠蚚, 囥蚚,
	妗蚚, 妘蚚 and 彸蚚 ...etc.  However, 妏蚚 has the highest assessment value in
	the system.  Hence, 妏蚚 will be the initial conversion result.

(c) Word candidates extraction
	Treat  the  portion  of  the input string indicated by the cursor as a word and
	perform word conversion.  Output all the possible word candidates.
	From the above example of (a), if 妏蚚 is not the word that you want, press the
	conversion 曹遙 key again to get all the possible candidates such as 巠蚚, 囥蚚,
	妗蚚, 妘蚚 and 彸蚚 ...etc. and select accordingly.
					- 4-5 -
(d) Phrase conversion
	Treat the portion  of the input string  indicated by the cursor as a phrase and
	perform  phrase conversion.  Output the candidate  phrase that  has the highest
	assessment value as result.

(e) Phrase candidates extraction
	Treat the portion of the input string indicated by the cursor  as  a phrase and
	perform  phrase  conversion.  Output all the possible phrase candidates.


2. Manual Word Segmentation
岩岩岩岩岩岩岩岩岩岩岩岩岩岩
Automatic  character segmentation  has been mentioned in Section 4.2.  cWnn also performs
word segmentation.  However, in Pinyin-Hanzi conversion, automatic  word segmentation may
not be 100% accurate.  Hence, when the  conversion result is incorrect, the user needs to
segment the words manually by using the segmentation keys (^O or ^I).  The word indicated
by the cursor will be segmented.  To complete the manual segmentation process, press the
conversion key again.  For example,
	府岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸庖
	岫 In the phrase 踏毞毞げ淏疑, 踏毞 ,毞げ and 淏疑 are treated as words.  岫
	岫 We know that 淏疑 is not the correct word.  Hence we need to segment   岫
	岫 the word 淏疑 to individual characters, then perform a re-conversion.  岫
	岫 (1) First, move the cursor to 淏疑, then press ^I to separate the      岫
	岫     word.								  岫
	岫 (2) The word will be converted back to individual Pinyin. You may now  岫
	岫     do a re-conversion be pressing the conversion 曹遙 key again. 	  岫
	弩岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸彼

Similarly, to ^O may be used to combine characters into one unit.  For exmaple,
	府岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸庖
	岫 斕疑 is treated as separate characters.  In order to make these two    岫
	岫 characters as one unit.  You may do the following:			  岫
	岫 (1) Place the cursor at 斕, then press ^O  to combine 斕 and 疑.       岫
	岫 (2) The characters will be converted back to Pinyin.  You may now do a 岫
	岫     re-conversion be pressing the conversion 曹遙 key again.  If no    岫
	岫     candidate for this word exists, a message will be displayed	  岫
	岫     " 綜硃ㄠ跺珩羶衄  (崋繫域)".  This means that the word does not    岫
	岫     exist in the current dictionaries.				  岫
	弩岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸岸彼

NOTE:  Another way to perform ^O is by using the 恅誹↙ key.
       Another way to perform ^I is by using the 恅誹↘ key.


					- 4-6 -
3. Assessment Formula for Multi-Phrase Conversion
岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩岩
The multi-phrase conversion plays a major role in the Pinyin-Hanzi conversion.  The level
of accuracy  for this  conversion has a direct effect on the effectiveness of the system.
There  are  several factors that affect the conversion result of Pinyin-Hanzi conversion,
each  differs  according  to different  conditions.  The  followings  are  the assessment
formula  for  multi-phrase  conversion.  Users  are  able  to  change  the  corresponding
conversion parameters in order to obtain the most suitable conversion environment.

(a) Assessment parameters
	parameter (0) 	Number of phrase "n"
	    		During the assessment process, this is the maximum number of
			phrases that can be assessed at one time.
			The default "n" value in cWnn is "1".

	parameter (1) 	Number of words "m"
	    		During the assessment process, this is the maximum number of
			words that can be in a phrase.
			The default "m" value in cWnn is "5".

(b) Word assessment parameters
	parameter (2) 	Usage frequency weight
	    		A usage frequency is given to each word in the dictionary.
			When  a user  uses the dictionary, the system will create as
			well as manage  a usage frequency file for the user.  As the
			user uses the  system,  the usage frequency of  each word in
			the  dictionary  will be updated according to how  often the
			user  uses  each  word.   Hence,  each  user  will  have his
			individual usage frequency file.
			The default value in cWnn is "2".

	parameter (3) 	Word length weight
			Word length refers to the number of characters in a word.
			The default value in cWnn is "750".

	parameter (4) 	Tone correctness weight
			This gives  higher assessment values to words  entered with
			correct  four tones, although  cWnn  allows  input  with or
			without four tones.  The default value in cWnn is "10".

	parameter (5) 	Last used weight
	    		Last used refers to the most recently used word for a Pinyin.
			By  increasing  the weight of this parameter, the assessment
			value of recently used word can be increased.
			The default value in cWnn is "80".


					- 4-7 -
	parameter (6) 	Dictionary priority weight
	    		Each  dictionary has  a priority defined  by the environment.
			By  changing  this  value,  assessment  values may be biased
	    		towards certain dictionaries.
			The default value in cWnn is "10".

(c) Phrase assessment parameters
	parameter (7) 	Average word assessment value weight
	    		A phrase  consists  of several  words, and each word has its
			own  word  assessment value as described above.  The average
			of these values is the average word assessment value.
			The default value in cWnn is "5".

	parameter (8) 	Phrase length weight
	    		Phrase length refers to the number of characters in a phrase.
			The default value in cWnn is "1000".

	parameter (9) 	Number of words in phrase weight
	    		This  refers to the the number of words in a phrase.  Larger
			number  of  words  in  a  phrase  shows  greater grammatical
			certainty among the words, and hence higher reliability.
			The default value in cWnn is "50".

(d) Other paramters
	Characters  other than  Hanzi that  appear at the input line  have their own
	individual weights.  The followings are the parameters:

	parameter (10) 	Usage frequency of numerals
			The default value in cWnn is "0".

	parameter (11) 	Usage frequency of alphabets
			The default value in cWnn is "-200".

	parameter (12) 	Usage frequency of symbols
			The default value in cWnn is "0".

	parameter (13) 	Usage frequency of open parentheses
			The default value in cWnn is "0".

	parameter (14) 	Usage frequency of close parentheses
			The default value in cWnn is "0".

	parameter (16) 	Maximum number of candidates allowed during conversion
			The default value in cWnn is "16".


					- 4-8 -
(e) Assessment formula for multi-phrase conversion
	Pinyin-Hanzi conversion in cWnn is based on an assessment formula.  We can see
	from the above that each parameter  has its value.  By increasing their values,
	their weightage in the conversion process will increase.

	The formulae  shown below determine the assessment values for a word, a phrase
	and the total assessment value for candidates of a phrase.

		Assessment value for word :
			f = (c1 x frequency) + (c2 x word length)
			    + (c3 x tone correctness) + (c4 x last used)
			    + (c5 x dictionary priority)

		Assessment value for phrase :
			F = k1 x avg( f1, f2, ..fm ) + (k2 x phrase length)
		    	    + (k3 x number of words in phrase)

		Total assessment value for candidates of a phrase :
			Vi = avg( Fi1 + Fi2 + ... + Fin )

		Best assessment value for a phrase :
			MAX( V1, V2, ... Vk )


	NOTE:
	     	*   c1 =  parameter (2)
		    c2 =  parameter (3)
		    c3 =  parameter (4)
		    c4 =  parameter (5)
	  	    c5 =  parameter (6)

		*   k1 =  parameter (7)
		    k2 =  parameter (8)
	  	    k3 =  parameter (9)


The above mentioned  parameter values are the default values set in cWnn.  These
default values may be set in a environment file.  Refer to Section 5.3.

The default values may be changed dynamically by using the environment operation
functions.  Refer to Section 5.2 for explanations.


					- 4-9 -
author	Yoshiki Yazawa <yaz@honeyplanet.jp>
date	Fri, 05 Mar 2010 20:46:36 +0900
parents	bbc77ca4def5
children