Mercurial > freewnn
view cWnn/manual/chap8 @ 24:becc60787557
- added autogen.sh script
- removed generated files
author | Yoshiki Yazawa <yaz@honeyplanet.jp> |
---|---|
date | Fri, 05 Mar 2010 20:46:36 +0900 |
parents | bbc77ca4def5 |
children |
line wrap: on
line source
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Chapter 8 CWNN FILE MANAGEMENT ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ┏━━━━━━━━┓ ┃ 8.1 OVERVIEW ┃ ┗━━━━━━━━┛ In cWnn system, the cserver plays an important role in managing the different resources and files. Resource files are read in during cserver startup. If the files are not read, they will be read in by the cserver subsequently when requested by certain front-end processors during their startup. There are three categories of files in cWnn, namely: (1) Dictionary files (2) Usage frequency files (3) Grammar files We will now explain in details each of the three cWnn file types. - 8-1 - ┏━━━━━━━━━━━━┓ ┃ 8.2 DICTIONARY FILES ┃ ┗━━━━━━━━━━━━┛ Dictionary is classified into two categories : (1) Text format (2) Binary format Text format dictionary is readable, but binary format dictionary is not readable. The text format dictionary is converted to binary format using the "catod" utility (refer to Section 6.7). Only the binary format dictionary is used by cWnn system. The binary format dictionary may be converted back to text format via the the "cdtoa" utility (refer to Section 6.8). The maximum number of words allowed in a dictionary is 70,000. 1. Dictionary in Text Format ━━━━━━━━━━━━━━ The format of the text dictionary is shown below. The text format is as follows: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ┌───────────────────────────┐ │ \comment <Comment> <CR> │ │ \total <Total_frequency> <CR> │ │ \cixing <Dict_cixing> <CR> │ │ \Pinyin <CR> │ │ │ │ pinyin word Cixing Frequency <CR> │ │ pinyin word Cixing Frequency <CR> │ │ pinyin word Cixing Frequency <CR> │ │ : : : : │ │ : : : : │ │ (EOF) │ └───────────────────────────┘ Description: - comment : These are comments in a dictionary. - total : This is the total number of times a dictionary is used for conversion, ie, the usage frequency of a dictionary. - cixing : This specifies the part of speech used by THIS particular dictionary ONLY. The format of the part of speech here is the same as that in the system standard cixing file (cixing.data). Refer to Section 8.4. If the part of speech is NOT specified here, the default file will be "/usr/local/lib/wnn/zh_CN/cixing.data". - Pinyin : This determines the type of dicionary. It can be "Zhuyin" or "Bixing", depending on the dictionary itself. - 8-2 - - pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin here refers to the pronunciation for each character/word. For encoded input, the Pinyin refers to the code of each character/word. The maximum length for Pinyin is 256 characters. - word : This refers to the actual Chinese character/word. Each character or word should not exceed 256 characters. If a space, carriage return or other special characters are needed to be added to the character/word, it can be done by appending them in octal after "\0". If characters other than "0" is appended after the "\", it will refer to the character itself. For example, "\\" refers to "\" itself. - Cixing : This refers to the part of speech defined in the grammar file, such as noun, pronoun etc. For details, refer to grammar files explained in 8.4. - Frequency : The usage frequency for each word. Example of a text format dictionary: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ┌───────────────────────────┐ │ \comment This is a Pinyin dictionary │ │ \total 0 │ │ \cixing │ │ \Pinyin │ │ │ │ W幆幚 我 人称代 1200 <CR> │ │ R帵n幚 人 单位量 10 <CR> │ │ De幚 的 语气 30 <CR> │ │ : : : : │ │ : : : : │ └───────────────────────────┘ 2. Dictionary in Binary Format ━━━━━━━━━━━━━━━ This is the binary format dictionary used by the cWnn system. While reading in a file, the cserver is able to determine whether the file is a dictionary via the binary format. Once a dictionary is accessed by the cserver, its contents may be changed. During the termination of cserver, the updated dictionary will be written back to the file. Each tuple (词条) in a dictionary has a serial number. The serial number is used for matching the tuples in a dictionary with those in the usage frequency file. - 8-3 - 3. System Dictionary and User Dictionary ━━━━━━━━━━━━━━━━━━━━ System dictionary refers to the dictionary provided by the system itself. There are two types of system dictionaries. One consists of only characters, while the other consists of words. For the Pinyin input and Zhuyin input environments, the following dictionary files are used : (1) level_1.dic - consists of only characters (单字). These are the Chinese characters that are more commonly used. (2) level_2.dic - consists of Chinese characters that are not so commonly used. (3) basic.dic - This is a word dictionary ie. it consists of single character word (单字词) and multi-character words (多字词). User dictionary refers to dictionary that is created by the user. This dictionary allows the user to register or delete his own words. The dictionary structure is similar to that of the system dictionary. 4. Assess of Dictionary Files ━━━━━━━━━━━━━━━ Both system and user dictionaries can be added or removed through the settings of the environment files. It may be set via the "setdic" command in the initialization file "cserverrc" (refer to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5). Similar settings need to be done for the reverse initialization file "wnnenvrc_R" (refer to Section 5.6). Default path for system dictionary : /usr/local/lib/wnn/zh_CN/dic/sys/ ".dic" is the default filename extension for dictionary. For example, level_1.dic Default path for user dictionary : /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ "ud" is the default filename for user dictionary. 5. Logical Dictionary and Dictionary Files ━━━━━━━━━━━━━━━━━━━━━ In the cWnn system, several front-end processors are connected to the cserver, and all the resources managed by cserver are utilized by the different front-end processors. Each dictionary file may combine with several different usage frequency files. Hence, each combination will form different dictionary logically. A dictionary may also be used for both forward and reverse conversion, such as Pinyin- Hanzi conversion and Hanzi-Pinyin conversion. Hence, they form two separate logical dictionaries. For details, refer to "cwnnstat" in Section 6.4. NOTE: ONE default dictionary may form several logical dictionaries. - 8-4 - ┏━━━━━━━━━━━━━━┓ ┃ 8.3 USAGE FREQUENCY FILES ┃ ┗━━━━━━━━━━━━━━┛ Usage frequency files are attached to a dictionary. In every dictionary, there are information on the usage frequency of each word. This information represents the default usage frequency for each word in the dictionary. The default usage frequency is obtained from statistical results by analysing large amount of Chinese articles. Since the usage frequency information of each word is already included in the text format dictionary, there is NO need for an explicit text format of usage frequency file. Refer to the example of text format dictionary in Section 8.2 above. 1. System Usage Frequency File and User Usage Frequency File ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Note that the default usage frequency defined by the system may not be suitable for all users. Hence, besides the default usage frequency, the cserver will create a user usage frequency file for each user. The initial file is a copy of the default file, and it is created when the user starts the front-end processor for the first time. As the system is being used by the user, the usage frequency of each word will be changed according to how often a word is being used. Therefore, this user frequency file is accustomed to the individual user. During the termination of the cserver, or during the termination of those environments using the frequency file, the user usage frequency file will be updated. When the same user activates the front-end processor again, instead of creating a new user usage frequency file, the updated frequency file will be read in by cserver. The usage frequency of each word in the dictionary plays a part in the Hanzi conversion. Hence, the weight for usage frequency of each word may be changed to adjust its impact on the conversion process so as to obtain a more accurate conversion result. In the conversion evaluation, there is a "last used" information which also resides in the usage frequency file. 2. Assess of Usage Frequency Files ━━━━━━━━━━━━━━━━━ Usage Frequency Files is specified in the initialization file "wnnenvrc" (refer to Section 5.5) and "wnnenvrc_R" (refer to Section 5.6). Default path for usage frequency file: /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ ".h" is the default filename extension for usage frequency file. For example, basic.h, level_1.h, level_2.h. - 8-5 - ┏━━━━━━━━━━━━━━━━━━━┓ ┃ 8.4 GRAMMAR FILES AND CIXING FILES ┃ ┗━━━━━━━━━━━━━━━━━━━┛ The definition of the grammar(词法) files and part of speech(词性) file are dependent of the system. Substantial knowledge on Chinese grammar and the Pinyin-Hanzi conversion process of this system are required in order to understand them. We will now only give some necessary and brief explanations on the grammar used in cWnn. NOTE: We will now refer part of speech as Cixing (词性). 1. Cixing File in Text Format ━━━━━━━━━━━━━━━ Cixing file defines a set of grammatical attributes, which is based upon to define the Chinese grammar. The grammatical attributes of all the words in the dictionary must be in this Cixing file. The content in the Cixing file is intepreted line by line. Whatever that comes after a semicolon ";" in a line is regarded as comments. A backslash "\" means it will be continued on the following line. Refer to cWnn default Cixing file for example. The Cixing file is divided into three portions: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (a) Tree structure : During a word add operation via the front-end processor, the user needs to choose the appropriate grammatical attribute for the word to be added. This tree structure will be searched accordingly until the user has chosen the required grammatical attribute. For example : 普通名词/|普通名:人名—:事物名— This means that 普通名词 can be further classified into 普通名, 人名 or 事物名. Only the leaves are the actual Cixing that can be attached to words. (b) Cixing definitions : These are Cixing that may include Chinese characters, such as 普通名 and 单字. "@" refers to a null Cixing, and "@" may replace any new Cixing to be appended, without affecting the compatibility with the existing dictionary and grammar files. (c) Combined Cixing : This defines the combined Cixing that contain two of more grammatical definition attributes. Combined Cixing can be assigned to single word and they reduce the number of tuples (词条) having the same Chinese characters. - 8-6 - Example of a text format Cixing file: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ┌──────────────────────────────┐ │ ;;;; 词性的树型结构: │ │ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │ │ 普通名词/|普通名:人名—:事物名— │ │ 方位名词/|单纯方位名:合成方位名 │ │ 表人特殊名/|百家性—:称谓名— │ │ : │ │ : │ │ │ │ ;;;; 词性的定义: │ │ 终止 ;;; 0 终止, 作为文节的终止 │ │ 数字 ;;; 1 数字 │ │ @ ;;; 11 │ │ 单字 ;;; 13 │ │ 普通名 │ │ 人名— │ │ : │ │ : │ │ │ │ ;;;; 复合词性定义: │ │ 姓名词-$普通名:百家姓— │ │ 表人物量-$表人量:表物量 │ │ 行为动词-$及物动—:不及物动— │ │ : │ │ : │ └──────────────────────────────┘ - 8-7 - 2. Grammar Files in Text Format ━━━━━━━━━━━━━━━━ Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in the grammar file. This grammar file is a database and is read during the startup of the cserver. The text format grammar files are as follow: (1) con.master (2) con.masterR (3) con.attr (4) con.jirattr (5) con.jircon (6) con.shuutan (7) con.shuutanR These files may be found under the directory "/cdic" in the cWnn source. The binary format grammar file may be created using the "catof" utility (refer to Section 6.9). This binary format grammar file will be used by the cserver. In order to create the binary grammar file, the Cixing text file is also needed in addition to the seven text format grammar files listed above. When cserver reads in the grammar file, it is able to determine whether it is a grammar file by analysing the binary format. Two or more grammar files can be managed by the cserver. Different user environments may make use of different grammar files. A user is also able to change the grammar file dynamically via the operation function (文法变更). Refer to Section 5.2. 3. Assess of Grammar File and Cixing File ━━━━━━━━━━━━━━━━━━━━━ Default path of Cixing text file: /usr/local/lib/wnn/zh_CN/ The default filename for the text format Cixing file in cWnn is "cixing.data" Default path of grammar binary file: /usr/local/lib/wnn/zh_CN/dic/sys/ The default filename for the binary format grammar file in cWnn is "full.con" and "full.conR". - 8-8 -