diff cWnn/manual/chap8 @ 0:bbc77ca4def5

initial import
author Yoshiki Yazawa <yaz@cc.rim.or.jp>
date Thu, 13 Dec 2007 04:30:14 +0900
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/cWnn/manual/chap8	Thu Dec 13 04:30:14 2007 +0900
@@ -0,0 +1,400 @@
+              ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+               ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ 
+               ┃  Chapter 8   CWNN  FILE  MANAGEMENT  ┃
+               ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
+              ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
+
+
+┏━━━━━━━━┓
+┃ 8.1  OVERVIEW  ┃
+┗━━━━━━━━┛
+
+	In cWnn system, the cserver plays an important role in managing the different 
+resources and files.
+
+	Resource files are read in during cserver startup.  If the files are not read,
+they will be read in by the cserver subsequently when  requested by certain front-end 
+processors during their startup.  
+
+There are three categories of files in cWnn, namely:	
+ 	
+	(1) Dictionary files
+	(2) Usage frequency files 
+	(3) Grammar files 
+
+We will now explain in details each of the three cWnn file types.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+					- 8-1 -
+┏━━━━━━━━━━━━┓
+┃ 8.2  DICTIONARY FILES  ┃
+┗━━━━━━━━━━━━┛
+
+	Dictionary is classified into two categories : (1) Text format
+						       (2) Binary format
+	Text format dictionary is readable, but binary format dictionary is not 
+readable.  The text format dictionary is converted to binary format using the "catod" 
+utility (refer to Section 6.7).  Only the binary format  dictionary is used  by cWnn 
+system.  The binary  format dictionary  may be converted back to text format via the 
+the  "cdtoa" utility (refer to Section 6.8).
+
+	The maximum number of words allowed in a dictionary is 70,000.
+
+
+1. Dictionary in Text Format
+━━━━━━━━━━━━━━
+The format of the text dictionary is shown below.  
+
+     The text format is as follows:
+     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  	┌───────────────────────────┐
+	│ \comment	<Comment> <CR>				│
+	│ \total	<Total_frequency> <CR>			│
+	│ \cixing      <Dict_cixing> <CR>			│
+	│ \Pinyin      <CR>					│
+	│        						│
+	│ pinyin	word	Cixing 		Frequency <CR>	│
+	│ pinyin	word	Cixing		Frequency <CR>	│
+	│ pinyin	word	Cixing		Frequency <CR>	│
+	│  :	         :	  :		    :		│
+	│  :	         :	  :		    :		│
+ 	│ (EOF)						│
+	└───────────────────────────┘
+
+     Description:
+	 - comment	   : These are comments in a dictionary.
+	 - total 	   : This is the total number of  times  a dictionary is  used
+			     for conversion, ie, the  usage frequency of a dictionary.
+	 - cixing	   : This specifies the part of speech used by THIS particular 
+			     dictionary ONLY.  The format  of the part of speech  here 
+			     is the same  as that in the  system standard  cixing file
+			     (cixing.data).  Refer to Section 8.4.  
+			     If the  part of speech is NOT specified here, the default
+			     file will be "/usr/local/lib/wnn/zh_CN/cixing.data".
+	 - Pinyin	   : This determines the type of dicionary. It can be "Zhuyin"
+			     or "Bixing", depending on the dictionary itself.
+
+
+					- 8-2 -
+	 - pinyin	   : For the  Pinyin-Hanzi conversion  dictionary, the Pinyin 
+			     here refers to the pronunciation for each character/word.
+			     For encoded input, the Pinyin refers to the code of each 
+			     character/word.  
+			     The maximum length for Pinyin is 256 characters.
+
+	 - word		   : This refers to the actual Chinese character/word.  Each 
+			     character or word should not exceed 256 characters.  
+			     If a space, carriage return or other special characters 
+			     are needed to be added to the character/word, it can be 
+			     done by appending them in octal after "\0".
+			     If characters other than "0" is appended after the "\", 
+	  		     it will refer to the character itself.
+			     For example, "\\" refers to "\" itself.
+	 - Cixing	   : This refers to the part of speech defined in the grammar 
+			     file, such as noun, pronoun etc.  For details, refer to 
+			     grammar files explained in 8.4.
+	 - Frequency	   : The usage frequency for each word.
+
+
+     Example of a text format dictionary:
+     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  	┌───────────────────────────┐
+	│ \comment  	This is a Pinyin dictionary         	│
+	│ \total	0   	          	   		│
+	│ \cixing           	   		  		│
+	│ \Pinyin             	   		  		│
+	│   	          	   		  		│
+	│ W幆幚		我	人称代 		1200    <CR>	│
+	│ R帵n幚  	人	单位量		10      <CR>	│
+	│ De幚		的	语气		30      <CR>	│
+	│  :	         :	  :		 :		│
+	│  :	         :	  :		 :		│
+	└───────────────────────────┘
+
+
+2. Dictionary in Binary Format
+━━━━━━━━━━━━━━━
+This is the binary format dictionary used by the cWnn system.  While reading in a file, 
+the cserver is able to determine whether the file is a dictionary via the binary format.  
+Once a  dictionary is accessed by the cserver, its contents may be changed.  During the  
+termination of cserver, the updated dictionary will be written back to the file.
+
+Each tuple (词条) in a dictionary has  a serial number.  The serial number is used for 
+matching the tuples in a dictionary with those in the usage frequency file.
+
+
+
+
+					- 8-3 -
+3. System Dictionary and User Dictionary
+━━━━━━━━━━━━━━━━━━━━
+System dictionary  refers to the dictionary provided by the  system itself.  There are 
+two types of  system dictionaries.   One consists of only characters,  while the other
+consists of  words.  For the Pinyin input and Zhuyin input environments, the following 
+dictionary files are used :
+
+	(1) level_1.dic  	- consists of only characters (单字).  These are the  
+				  Chinese characters that are more commonly used.
+	(2) level_2.dic 	- consists of  Chinese  characters that  are not  so 
+				  commonly used.
+	(3) basic.dic  		- This is a word dictionary ie. it consists of single 
+				  character word (单字词) and  multi-character words 
+				  (多字词).
+
+User dictionary  refers to  dictionary that is  created by the user.  This dictionary
+allows the  user to register or delete his own  words.  The  dictionary  structure is 
+similar to that of the system dictionary.  
+
+
+4. Assess of Dictionary Files
+━━━━━━━━━━━━━━━
+Both system  and user dictionaries can be added or removed through the settings of the 
+environment files.  
+
+It may be set via the  "setdic" command in the initialization file "cserverrc" (refer 
+to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5).
+Similar settings need to be done for the reverse initialization file "wnnenvrc_R"
+(refer to Section 5.6).
+
+Default path for system dictionary : 	/usr/local/lib/wnn/zh_CN/dic/sys/
+".dic" is the default filename extension for dictionary.  For example, level_1.dic
+
+Default path for user dictionary : 	/usr/local/lib/wnn/zh_CN/dic/usr/@USR/
+"ud" is the default filename for user dictionary.
+
+
+5. Logical Dictionary and Dictionary Files
+━━━━━━━━━━━━━━━━━━━━━
+In the cWnn system, several front-end processors are connected to the cserver, and all 
+the resources managed by cserver are utilized by the different front-end processors.  
+Each dictionary file may combine with several different usage frequency files.  Hence, 
+each combination will form different dictionary logically. 
+
+A dictionary may also be used for both forward and reverse conversion, such as Pinyin-
+Hanzi conversion and  Hanzi-Pinyin  conversion.  Hence, they form two separate logical
+dictionaries.  For details, refer to "cwnnstat" in Section 6.4.
+
+NOTE:  ONE default dictionary may form several logical dictionaries. 
+					- 8-4 -
+┏━━━━━━━━━━━━━━┓
+┃ 8.3  USAGE FREQUENCY FILES ┃
+┗━━━━━━━━━━━━━━┛
+
+	Usage  frequency files are attached to a dictionary. In every dictionary, there 
+are information  on the usage frequency of each word.  This information  represents the 
+default usage  frequency for each word in the dictionary.  The default  usage frequency 
+is obtained from statistical results by analysing large amount of Chinese articles.  
+
+	Since the  usage frequency  information of each word is already included in the 
+text format dictionary, there is NO need for an explicit text format of usage frequency 
+file.  Refer to the example of text format dictionary in Section 8.2 above.
+
+
+1. System Usage Frequency File and User Usage Frequency File
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Note that the default usage frequency  defined by the system may not be suitable for all 
+users.  Hence, besides the default usage frequency, the cserver will create a user usage  
+frequency file for  each user.  The  initial file is  a copy of the default file, and it 
+is created  when the user  starts the  front-end  processor for  the first time.  As the  
+system  is being  used by the  user, the usage  frequency of each  word will  be changed 
+according to  how often a  word is being  used.  Therefore, this  user frequency file is  
+accustomed to the individual user.  During the termination of the cserver, or during the  
+termination of those environments using the frequency file, the user usage frequency file  
+will be updated.  When the same user activates the front-end processor again, instead of 
+creating a new  user usage frequency file, the updated frequency file will be read in by 
+cserver.
+ 
+The  usage frequency of each word in the dictionary plays a part in the Hanzi conversion.  
+Hence, the weight for usage frequency of each word may be changed to adjust its impact on 
+the conversion process so as to obtain a more accurate conversion result.  
+
+In the conversion evaluation,  there is a  "last used" information which also resides in 
+the usage frequency file.
+
+
+2. Assess of Usage Frequency Files
+━━━━━━━━━━━━━━━━━
+Usage  Frequency Files is  specified in the initialization file  "wnnenvrc" (refer to
+Section 5.5) and "wnnenvrc_R" (refer to Section 5.6).
+
+Default path for usage frequency file:  /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
+".h" is the default filename extension for usage frequency file.  For example, basic.h,
+level_1.h, level_2.h.
+
+
+
+
+
+					- 8-5 -
+┏━━━━━━━━━━━━━━━━━━━┓
+┃ 8.4  GRAMMAR FILES AND CIXING FILES  ┃
+┗━━━━━━━━━━━━━━━━━━━┛
+
+	The definition of the grammar(词法) files and part of speech(词性) file are 
+dependent of the system.  Substantial knowledge on Chinese grammar and the Pinyin-Hanzi 
+conversion process of this system are required in order to understand them. We will now 
+only give some necessary and brief explanations on the grammar used in cWnn.
+
+NOTE:  We will now refer part of speech as Cixing (词性).
+
+
+1. Cixing File in Text Format
+━━━━━━━━━━━━━━━
+Cixing file  defines a  set of  grammatical attributes,  which is based upon to define 
+the Chinese grammar. The grammatical attributes of all the words in the dictionary must 
+be in this Cixing file.
+
+The content in the  Cixing file is  intepreted line by line.  Whatever that comes after 
+a semicolon  ";"  in a line is regarded as comments.  A backslash  "\" means it will be 
+continued on the following line.  Refer to cWnn default Cixing file for example.
+
+    The Cixing file is divided into three portions:
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+       (a) Tree structure     : During a word add operation via the front-end processor, 
+				the user needs to  choose the  appropriate  grammatical
+				attribute for the word to be added.
+			        This tree  structure will be searched accordingly until 
+				the user has chosen the required grammatical attribute.  
+			        For example :
+			  		普通名词/|普通名:人名—:事物名—
+
+			        This means that 普通名词 can be further classified into
+		 	        普通名, 人名 or 事物名.  Only the leaves are the actual
+			   	Cixing that can be attached to words.
+
+       (b) Cixing definitions : These are  Cixing that may include  Chinese characters, 
+				such as  普通名 and  单字.  
+			        "@"  refers to a  null Cixing, and  "@" may replace any 
+				new  Cixing  to  be  appended,  without  affecting  the 
+				compatibility  with the existing dictionary and grammar 
+				files. 
+
+       (c) Combined Cixing    : This  defines the combined  Cixing that contain  two of 
+				more  grammatical  definition   attributes.    Combined 
+				Cixing can be  assigned to  single word and they reduce 
+				the  number of tuples (词条) having  the  same  Chinese 
+				characters.  
+
+					- 8-6 -
+    Example of a text format Cixing file:
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+	  	┌──────────────────────────────┐
+		│ ;;;; 词性的树型结构: 				      │
+ 		│ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │
+		│ 普通名词/|普通名:人名—:事物名—			      │
+		│ 方位名词/|单纯方位名:合成方位名		  	      │
+		│ 表人特殊名/|百家性—:称谓名—		 	      │
+		│     		:					      │
+		│ 		:					      │
+		│							      │
+		│ ;;;; 词性的定义: 					      │
+		│ 终止            ;;; 0  终止, 作为文节的终止		      │
+		│ 数字            ;;; 1  数字				      │
+		│ @               ;;; 11				      │
+		│ 单字            ;;; 13				      │
+		│ 普通名						      │
+		│ 人名—						      │
+		│     		:					      │
+		│ 		:					      │
+		│							      │
+		│ ;;;;    复合词性定义:				      │
+		│ 姓名词-$普通名:百家姓—				      │
+		│ 表人物量-$表人量:表物量				      │
+		│ 行为动词-$及物动—:不及物动—			      │
+		│     		:					      │
+		│ 		:					      │
+		└──────────────────────────────┘
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+					- 8-7 -
+2. Grammar Files in Text Format
+━━━━━━━━━━━━━━━━
+Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in 
+the grammar file.  This grammar file is a database and is read during the startup of 
+the cserver.  
+
+The text format grammar files are as follow:
+	(1) con.master
+	(2) con.masterR
+	(3) con.attr
+	(4) con.jirattr
+	(5) con.jircon
+	(6) con.shuutan
+	(7) con.shuutanR
+These files may be found under the directory "/cdic" in the cWnn source.
+
+The binary format grammar file may be created using the  "catof" utility (refer to 
+Section 6.9).  This binary format grammar file will be used by the cserver. 
+In order to create the binary grammar file, the Cixing text file is also needed in 
+addition to the seven text format grammar files listed above.
+
+When cserver reads in the  grammar file, it is  able to  determine whether it is a 
+grammar file by analysing the  binary format.  Two  or  more  grammar files can be 
+managed by the cserver.  Different user environments may make use of different 
+grammar files.  A user is also able to change the grammar file dynamically via the 
+operation function (文法变更).  Refer to Section 5.2.
+
+
+3. Assess of Grammar File and Cixing File
+━━━━━━━━━━━━━━━━━━━━━
+Default path of Cixing text file:     /usr/local/lib/wnn/zh_CN/
+The default filename for the text format Cixing file in cWnn is "cixing.data"
+
+Default path of grammar binary file:    /usr/local/lib/wnn/zh_CN/dic/sys/
+The default filename for the binary format grammar file in cWnn is "full.con"  
+and "full.conR".
+
+
+
+
+
+
+
+
+
+
+
+
+
+					- 8-8 -