view cWnn/manual/chap8 @ 10:fc3022f61fc7

tiny clean up
author Yoshiki Yazawa <yaz@cc.rim.or.jp>
date Fri, 21 Dec 2007 17:23:36 +0900
parents bbc77ca4def5
children
line wrap: on
line source

              ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
               ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ 
               ┃  Chapter 8   CWNN  FILE  MANAGEMENT  ┃
               ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
              ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛


┏━━━━━━━━┓
┃ 8.1  OVERVIEW  ┃
┗━━━━━━━━┛

	In cWnn system, the cserver plays an important role in managing the different 
resources and files.

	Resource files are read in during cserver startup.  If the files are not read,
they will be read in by the cserver subsequently when  requested by certain front-end 
processors during their startup.  

There are three categories of files in cWnn, namely:	
 	
	(1) Dictionary files
	(2) Usage frequency files 
	(3) Grammar files 

We will now explain in details each of the three cWnn file types.
























					- 8-1 -
┏━━━━━━━━━━━━┓
┃ 8.2  DICTIONARY FILES  ┃
┗━━━━━━━━━━━━┛

	Dictionary is classified into two categories : (1) Text format
						       (2) Binary format
	Text format dictionary is readable, but binary format dictionary is not 
readable.  The text format dictionary is converted to binary format using the "catod" 
utility (refer to Section 6.7).  Only the binary format  dictionary is used  by cWnn 
system.  The binary  format dictionary  may be converted back to text format via the 
the  "cdtoa" utility (refer to Section 6.8).

	The maximum number of words allowed in a dictionary is 70,000.


1. Dictionary in Text Format
━━━━━━━━━━━━━━
The format of the text dictionary is shown below.  

     The text format is as follows:
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  	┌───────────────────────────┐
	│ \comment	<Comment> <CR>				│
	│ \total	<Total_frequency> <CR>			│
	│ \cixing      <Dict_cixing> <CR>			│
	│ \Pinyin      <CR>					│
	│        						│
	│ pinyin	word	Cixing 		Frequency <CR>	│
	│ pinyin	word	Cixing		Frequency <CR>	│
	│ pinyin	word	Cixing		Frequency <CR>	│
	│  :	         :	  :		    :		│
	│  :	         :	  :		    :		│
 	│ (EOF)						│
	└───────────────────────────┘

     Description:
	 - comment	   : These are comments in a dictionary.
	 - total 	   : This is the total number of  times  a dictionary is  used
			     for conversion, ie, the  usage frequency of a dictionary.
	 - cixing	   : This specifies the part of speech used by THIS particular 
			     dictionary ONLY.  The format  of the part of speech  here 
			     is the same  as that in the  system standard  cixing file
			     (cixing.data).  Refer to Section 8.4.  
			     If the  part of speech is NOT specified here, the default
			     file will be "/usr/local/lib/wnn/zh_CN/cixing.data".
	 - Pinyin	   : This determines the type of dicionary. It can be "Zhuyin"
			     or "Bixing", depending on the dictionary itself.


					- 8-2 -
	 - pinyin	   : For the  Pinyin-Hanzi conversion  dictionary, the Pinyin 
			     here refers to the pronunciation for each character/word.
			     For encoded input, the Pinyin refers to the code of each 
			     character/word.  
			     The maximum length for Pinyin is 256 characters.

	 - word		   : This refers to the actual Chinese character/word.  Each 
			     character or word should not exceed 256 characters.  
			     If a space, carriage return or other special characters 
			     are needed to be added to the character/word, it can be 
			     done by appending them in octal after "\0".
			     If characters other than "0" is appended after the "\", 
	  		     it will refer to the character itself.
			     For example, "\\" refers to "\" itself.
	 - Cixing	   : This refers to the part of speech defined in the grammar 
			     file, such as noun, pronoun etc.  For details, refer to 
			     grammar files explained in 8.4.
	 - Frequency	   : The usage frequency for each word.


     Example of a text format dictionary:
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  	┌───────────────────────────┐
	│ \comment  	This is a Pinyin dictionary         	│
	│ \total	0   	          	   		│
	│ \cixing           	   		  		│
	│ \Pinyin             	   		  		│
	│   	          	   		  		│
	│ W幆幚		我	人称代 		1200    <CR>	│
	│ R帵n幚  	人	单位量		10      <CR>	│
	│ De幚		的	语气		30      <CR>	│
	│  :	         :	  :		 :		│
	│  :	         :	  :		 :		│
	└───────────────────────────┘


2. Dictionary in Binary Format
━━━━━━━━━━━━━━━
This is the binary format dictionary used by the cWnn system.  While reading in a file, 
the cserver is able to determine whether the file is a dictionary via the binary format.  
Once a  dictionary is accessed by the cserver, its contents may be changed.  During the  
termination of cserver, the updated dictionary will be written back to the file.

Each tuple (词条) in a dictionary has  a serial number.  The serial number is used for 
matching the tuples in a dictionary with those in the usage frequency file.




					- 8-3 -
3. System Dictionary and User Dictionary
━━━━━━━━━━━━━━━━━━━━
System dictionary  refers to the dictionary provided by the  system itself.  There are 
two types of  system dictionaries.   One consists of only characters,  while the other
consists of  words.  For the Pinyin input and Zhuyin input environments, the following 
dictionary files are used :

	(1) level_1.dic  	- consists of only characters (单字).  These are the  
				  Chinese characters that are more commonly used.
	(2) level_2.dic 	- consists of  Chinese  characters that  are not  so 
				  commonly used.
	(3) basic.dic  		- This is a word dictionary ie. it consists of single 
				  character word (单字词) and  multi-character words 
				  (多字词).

User dictionary  refers to  dictionary that is  created by the user.  This dictionary
allows the  user to register or delete his own  words.  The  dictionary  structure is 
similar to that of the system dictionary.  


4. Assess of Dictionary Files
━━━━━━━━━━━━━━━
Both system  and user dictionaries can be added or removed through the settings of the 
environment files.  

It may be set via the  "setdic" command in the initialization file "cserverrc" (refer 
to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5).
Similar settings need to be done for the reverse initialization file "wnnenvrc_R"
(refer to Section 5.6).

Default path for system dictionary : 	/usr/local/lib/wnn/zh_CN/dic/sys/
".dic" is the default filename extension for dictionary.  For example, level_1.dic

Default path for user dictionary : 	/usr/local/lib/wnn/zh_CN/dic/usr/@USR/
"ud" is the default filename for user dictionary.


5. Logical Dictionary and Dictionary Files
━━━━━━━━━━━━━━━━━━━━━
In the cWnn system, several front-end processors are connected to the cserver, and all 
the resources managed by cserver are utilized by the different front-end processors.  
Each dictionary file may combine with several different usage frequency files.  Hence, 
each combination will form different dictionary logically. 

A dictionary may also be used for both forward and reverse conversion, such as Pinyin-
Hanzi conversion and  Hanzi-Pinyin  conversion.  Hence, they form two separate logical
dictionaries.  For details, refer to "cwnnstat" in Section 6.4.

NOTE:  ONE default dictionary may form several logical dictionaries. 
					- 8-4 -
┏━━━━━━━━━━━━━━┓
┃ 8.3  USAGE FREQUENCY FILES ┃
┗━━━━━━━━━━━━━━┛

	Usage  frequency files are attached to a dictionary. In every dictionary, there 
are information  on the usage frequency of each word.  This information  represents the 
default usage  frequency for each word in the dictionary.  The default  usage frequency 
is obtained from statistical results by analysing large amount of Chinese articles.  

	Since the  usage frequency  information of each word is already included in the 
text format dictionary, there is NO need for an explicit text format of usage frequency 
file.  Refer to the example of text format dictionary in Section 8.2 above.


1. System Usage Frequency File and User Usage Frequency File
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Note that the default usage frequency  defined by the system may not be suitable for all 
users.  Hence, besides the default usage frequency, the cserver will create a user usage  
frequency file for  each user.  The  initial file is  a copy of the default file, and it 
is created  when the user  starts the  front-end  processor for  the first time.  As the  
system  is being  used by the  user, the usage  frequency of each  word will  be changed 
according to  how often a  word is being  used.  Therefore, this  user frequency file is  
accustomed to the individual user.  During the termination of the cserver, or during the  
termination of those environments using the frequency file, the user usage frequency file  
will be updated.  When the same user activates the front-end processor again, instead of 
creating a new  user usage frequency file, the updated frequency file will be read in by 
cserver.
 
The  usage frequency of each word in the dictionary plays a part in the Hanzi conversion.  
Hence, the weight for usage frequency of each word may be changed to adjust its impact on 
the conversion process so as to obtain a more accurate conversion result.  

In the conversion evaluation,  there is a  "last used" information which also resides in 
the usage frequency file.


2. Assess of Usage Frequency Files
━━━━━━━━━━━━━━━━━
Usage  Frequency Files is  specified in the initialization file  "wnnenvrc" (refer to
Section 5.5) and "wnnenvrc_R" (refer to Section 5.6).

Default path for usage frequency file:  /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
".h" is the default filename extension for usage frequency file.  For example, basic.h,
level_1.h, level_2.h.





					- 8-5 -
┏━━━━━━━━━━━━━━━━━━━┓
┃ 8.4  GRAMMAR FILES AND CIXING FILES  ┃
┗━━━━━━━━━━━━━━━━━━━┛

	The definition of the grammar(词法) files and part of speech(词性) file are 
dependent of the system.  Substantial knowledge on Chinese grammar and the Pinyin-Hanzi 
conversion process of this system are required in order to understand them. We will now 
only give some necessary and brief explanations on the grammar used in cWnn.

NOTE:  We will now refer part of speech as Cixing (词性).


1. Cixing File in Text Format
━━━━━━━━━━━━━━━
Cixing file  defines a  set of  grammatical attributes,  which is based upon to define 
the Chinese grammar. The grammatical attributes of all the words in the dictionary must 
be in this Cixing file.

The content in the  Cixing file is  intepreted line by line.  Whatever that comes after 
a semicolon  ";"  in a line is regarded as comments.  A backslash  "\" means it will be 
continued on the following line.  Refer to cWnn default Cixing file for example.

    The Cixing file is divided into three portions:
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       (a) Tree structure     : During a word add operation via the front-end processor, 
				the user needs to  choose the  appropriate  grammatical
				attribute for the word to be added.
			        This tree  structure will be searched accordingly until 
				the user has chosen the required grammatical attribute.  
			        For example :
			  		普通名词/|普通名:人名—:事物名—

			        This means that 普通名词 can be further classified into
		 	        普通名, 人名 or 事物名.  Only the leaves are the actual
			   	Cixing that can be attached to words.

       (b) Cixing definitions : These are  Cixing that may include  Chinese characters, 
				such as  普通名 and  单字.  
			        "@"  refers to a  null Cixing, and  "@" may replace any 
				new  Cixing  to  be  appended,  without  affecting  the 
				compatibility  with the existing dictionary and grammar 
				files. 

       (c) Combined Cixing    : This  defines the combined  Cixing that contain  two of 
				more  grammatical  definition   attributes.    Combined 
				Cixing can be  assigned to  single word and they reduce 
				the  number of tuples (词条) having  the  same  Chinese 
				characters.  

					- 8-6 -
    Example of a text format Cixing file:
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  	┌──────────────────────────────┐
		│ ;;;; 词性的树型结构: 				      │
 		│ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │
		│ 普通名词/|普通名:人名—:事物名—			      │
		│ 方位名词/|单纯方位名:合成方位名		  	      │
		│ 表人特殊名/|百家性—:称谓名—		 	      │
		│     		:					      │
		│ 		:					      │
		│							      │
		│ ;;;; 词性的定义: 					      │
		│ 终止            ;;; 0  终止, 作为文节的终止		      │
		│ 数字            ;;; 1  数字				      │
		│ @               ;;; 11				      │
		│ 单字            ;;; 13				      │
		│ 普通名						      │
		│ 人名—						      │
		│     		:					      │
		│ 		:					      │
		│							      │
		│ ;;;;    复合词性定义:				      │
		│ 姓名词-$普通名:百家姓—				      │
		│ 表人物量-$表人量:表物量				      │
		│ 行为动词-$及物动—:不及物动—			      │
		│     		:					      │
		│ 		:					      │
		└──────────────────────────────┘





















					- 8-7 -
2. Grammar Files in Text Format
━━━━━━━━━━━━━━━━
Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in 
the grammar file.  This grammar file is a database and is read during the startup of 
the cserver.  

The text format grammar files are as follow:
	(1) con.master
	(2) con.masterR
	(3) con.attr
	(4) con.jirattr
	(5) con.jircon
	(6) con.shuutan
	(7) con.shuutanR
These files may be found under the directory "/cdic" in the cWnn source.

The binary format grammar file may be created using the  "catof" utility (refer to 
Section 6.9).  This binary format grammar file will be used by the cserver. 
In order to create the binary grammar file, the Cixing text file is also needed in 
addition to the seven text format grammar files listed above.

When cserver reads in the  grammar file, it is  able to  determine whether it is a 
grammar file by analysing the  binary format.  Two  or  more  grammar files can be 
managed by the cserver.  Different user environments may make use of different 
grammar files.  A user is also able to change the grammar file dynamically via the 
operation function (文法变更).  Refer to Section 5.2.


3. Assess of Grammar File and Cixing File
━━━━━━━━━━━━━━━━━━━━━
Default path of Cixing text file:     /usr/local/lib/wnn/zh_CN/
The default filename for the text format Cixing file in cWnn is "cixing.data"

Default path of grammar binary file:    /usr/local/lib/wnn/zh_CN/dic/sys/
The default filename for the binary format grammar file in cWnn is "full.con"  
and "full.conR".













					- 8-8 -