Mercurial > freewnn
diff cWnn/manual/chap8 @ 0:bbc77ca4def5
initial import
author | Yoshiki Yazawa <yaz@cc.rim.or.jp> |
---|---|
date | Thu, 13 Dec 2007 04:30:14 +0900 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cWnn/manual/chap8 Thu Dec 13 04:30:14 2007 +0900 @@ -0,0 +1,400 @@ + ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┃ Chapter 8 CWNN FILE MANAGEMENT ┃ + ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ + ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ + + +┏━━━━━━━━┓ +┃ 8.1 OVERVIEW ┃ +┗━━━━━━━━┛ + + In cWnn system, the cserver plays an important role in managing the different +resources and files. + + Resource files are read in during cserver startup. If the files are not read, +they will be read in by the cserver subsequently when requested by certain front-end +processors during their startup. + +There are three categories of files in cWnn, namely: + + (1) Dictionary files + (2) Usage frequency files + (3) Grammar files + +We will now explain in details each of the three cWnn file types. + + + + + + + + + + + + + + + + + + + + + + + + + - 8-1 - +┏━━━━━━━━━━━━┓ +┃ 8.2 DICTIONARY FILES ┃ +┗━━━━━━━━━━━━┛ + + Dictionary is classified into two categories : (1) Text format + (2) Binary format + Text format dictionary is readable, but binary format dictionary is not +readable. The text format dictionary is converted to binary format using the "catod" +utility (refer to Section 6.7). Only the binary format dictionary is used by cWnn +system. The binary format dictionary may be converted back to text format via the +the "cdtoa" utility (refer to Section 6.8). + + The maximum number of words allowed in a dictionary is 70,000. + + +1. Dictionary in Text Format +━━━━━━━━━━━━━━ +The format of the text dictionary is shown below. + + The text format is as follows: + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ┌───────────────────────────┐ + │ \comment <Comment> <CR> │ + │ \total <Total_frequency> <CR> │ + │ \cixing <Dict_cixing> <CR> │ + │ \Pinyin <CR> │ + │ │ + │ pinyin word Cixing Frequency <CR> │ + │ pinyin word Cixing Frequency <CR> │ + │ pinyin word Cixing Frequency <CR> │ + │ : : : : │ + │ : : : : │ + │ (EOF) │ + └───────────────────────────┘ + + Description: + - comment : These are comments in a dictionary. + - total : This is the total number of times a dictionary is used + for conversion, ie, the usage frequency of a dictionary. + - cixing : This specifies the part of speech used by THIS particular + dictionary ONLY. The format of the part of speech here + is the same as that in the system standard cixing file + (cixing.data). Refer to Section 8.4. + If the part of speech is NOT specified here, the default + file will be "/usr/local/lib/wnn/zh_CN/cixing.data". + - Pinyin : This determines the type of dicionary. It can be "Zhuyin" + or "Bixing", depending on the dictionary itself. + + + - 8-2 - + - pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin + here refers to the pronunciation for each character/word. + For encoded input, the Pinyin refers to the code of each + character/word. + The maximum length for Pinyin is 256 characters. + + - word : This refers to the actual Chinese character/word. Each + character or word should not exceed 256 characters. + If a space, carriage return or other special characters + are needed to be added to the character/word, it can be + done by appending them in octal after "\0". + If characters other than "0" is appended after the "\", + it will refer to the character itself. + For example, "\\" refers to "\" itself. + - Cixing : This refers to the part of speech defined in the grammar + file, such as noun, pronoun etc. For details, refer to + grammar files explained in 8.4. + - Frequency : The usage frequency for each word. + + + Example of a text format dictionary: + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ┌───────────────────────────┐ + │ \comment This is a Pinyin dictionary │ + │ \total 0 │ + │ \cixing │ + │ \Pinyin │ + │ │ + │ W幆幚 我 人称代 1200 <CR> │ + │ R帵n幚 人 单位量 10 <CR> │ + │ De幚 的 语气 30 <CR> │ + │ : : : : │ + │ : : : : │ + └───────────────────────────┘ + + +2. Dictionary in Binary Format +━━━━━━━━━━━━━━━ +This is the binary format dictionary used by the cWnn system. While reading in a file, +the cserver is able to determine whether the file is a dictionary via the binary format. +Once a dictionary is accessed by the cserver, its contents may be changed. During the +termination of cserver, the updated dictionary will be written back to the file. + +Each tuple (词条) in a dictionary has a serial number. The serial number is used for +matching the tuples in a dictionary with those in the usage frequency file. + + + + + - 8-3 - +3. System Dictionary and User Dictionary +━━━━━━━━━━━━━━━━━━━━ +System dictionary refers to the dictionary provided by the system itself. There are +two types of system dictionaries. One consists of only characters, while the other +consists of words. For the Pinyin input and Zhuyin input environments, the following +dictionary files are used : + + (1) level_1.dic - consists of only characters (单字). These are the + Chinese characters that are more commonly used. + (2) level_2.dic - consists of Chinese characters that are not so + commonly used. + (3) basic.dic - This is a word dictionary ie. it consists of single + character word (单字词) and multi-character words + (多字词). + +User dictionary refers to dictionary that is created by the user. This dictionary +allows the user to register or delete his own words. The dictionary structure is +similar to that of the system dictionary. + + +4. Assess of Dictionary Files +━━━━━━━━━━━━━━━ +Both system and user dictionaries can be added or removed through the settings of the +environment files. + +It may be set via the "setdic" command in the initialization file "cserverrc" (refer +to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5). +Similar settings need to be done for the reverse initialization file "wnnenvrc_R" +(refer to Section 5.6). + +Default path for system dictionary : /usr/local/lib/wnn/zh_CN/dic/sys/ +".dic" is the default filename extension for dictionary. For example, level_1.dic + +Default path for user dictionary : /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ +"ud" is the default filename for user dictionary. + + +5. Logical Dictionary and Dictionary Files +━━━━━━━━━━━━━━━━━━━━━ +In the cWnn system, several front-end processors are connected to the cserver, and all +the resources managed by cserver are utilized by the different front-end processors. +Each dictionary file may combine with several different usage frequency files. Hence, +each combination will form different dictionary logically. + +A dictionary may also be used for both forward and reverse conversion, such as Pinyin- +Hanzi conversion and Hanzi-Pinyin conversion. Hence, they form two separate logical +dictionaries. For details, refer to "cwnnstat" in Section 6.4. + +NOTE: ONE default dictionary may form several logical dictionaries. + - 8-4 - +┏━━━━━━━━━━━━━━┓ +┃ 8.3 USAGE FREQUENCY FILES ┃ +┗━━━━━━━━━━━━━━┛ + + Usage frequency files are attached to a dictionary. In every dictionary, there +are information on the usage frequency of each word. This information represents the +default usage frequency for each word in the dictionary. The default usage frequency +is obtained from statistical results by analysing large amount of Chinese articles. + + Since the usage frequency information of each word is already included in the +text format dictionary, there is NO need for an explicit text format of usage frequency +file. Refer to the example of text format dictionary in Section 8.2 above. + + +1. System Usage Frequency File and User Usage Frequency File +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ +Note that the default usage frequency defined by the system may not be suitable for all +users. Hence, besides the default usage frequency, the cserver will create a user usage +frequency file for each user. The initial file is a copy of the default file, and it +is created when the user starts the front-end processor for the first time. As the +system is being used by the user, the usage frequency of each word will be changed +according to how often a word is being used. Therefore, this user frequency file is +accustomed to the individual user. During the termination of the cserver, or during the +termination of those environments using the frequency file, the user usage frequency file +will be updated. When the same user activates the front-end processor again, instead of +creating a new user usage frequency file, the updated frequency file will be read in by +cserver. + +The usage frequency of each word in the dictionary plays a part in the Hanzi conversion. +Hence, the weight for usage frequency of each word may be changed to adjust its impact on +the conversion process so as to obtain a more accurate conversion result. + +In the conversion evaluation, there is a "last used" information which also resides in +the usage frequency file. + + +2. Assess of Usage Frequency Files +━━━━━━━━━━━━━━━━━ +Usage Frequency Files is specified in the initialization file "wnnenvrc" (refer to +Section 5.5) and "wnnenvrc_R" (refer to Section 5.6). + +Default path for usage frequency file: /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ +".h" is the default filename extension for usage frequency file. For example, basic.h, +level_1.h, level_2.h. + + + + + + - 8-5 - +┏━━━━━━━━━━━━━━━━━━━┓ +┃ 8.4 GRAMMAR FILES AND CIXING FILES ┃ +┗━━━━━━━━━━━━━━━━━━━┛ + + The definition of the grammar(词法) files and part of speech(词性) file are +dependent of the system. Substantial knowledge on Chinese grammar and the Pinyin-Hanzi +conversion process of this system are required in order to understand them. We will now +only give some necessary and brief explanations on the grammar used in cWnn. + +NOTE: We will now refer part of speech as Cixing (词性). + + +1. Cixing File in Text Format +━━━━━━━━━━━━━━━ +Cixing file defines a set of grammatical attributes, which is based upon to define +the Chinese grammar. The grammatical attributes of all the words in the dictionary must +be in this Cixing file. + +The content in the Cixing file is intepreted line by line. Whatever that comes after +a semicolon ";" in a line is regarded as comments. A backslash "\" means it will be +continued on the following line. Refer to cWnn default Cixing file for example. + + The Cixing file is divided into three portions: + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + (a) Tree structure : During a word add operation via the front-end processor, + the user needs to choose the appropriate grammatical + attribute for the word to be added. + This tree structure will be searched accordingly until + the user has chosen the required grammatical attribute. + For example : + 普通名词/|普通名:人名—:事物名— + + This means that 普通名词 can be further classified into + 普通名, 人名 or 事物名. Only the leaves are the actual + Cixing that can be attached to words. + + (b) Cixing definitions : These are Cixing that may include Chinese characters, + such as 普通名 and 单字. + "@" refers to a null Cixing, and "@" may replace any + new Cixing to be appended, without affecting the + compatibility with the existing dictionary and grammar + files. + + (c) Combined Cixing : This defines the combined Cixing that contain two of + more grammatical definition attributes. Combined + Cixing can be assigned to single word and they reduce + the number of tuples (词条) having the same Chinese + characters. + + - 8-6 - + Example of a text format Cixing file: + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ┌──────────────────────────────┐ + │ ;;;; 词性的树型结构: │ + │ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │ + │ 普通名词/|普通名:人名—:事物名— │ + │ 方位名词/|单纯方位名:合成方位名 │ + │ 表人特殊名/|百家性—:称谓名— │ + │ : │ + │ : │ + │ │ + │ ;;;; 词性的定义: │ + │ 终止 ;;; 0 终止, 作为文节的终止 │ + │ 数字 ;;; 1 数字 │ + │ @ ;;; 11 │ + │ 单字 ;;; 13 │ + │ 普通名 │ + │ 人名— │ + │ : │ + │ : │ + │ │ + │ ;;;; 复合词性定义: │ + │ 姓名词-$普通名:百家姓— │ + │ 表人物量-$表人量:表物量 │ + │ 行为动词-$及物动—:不及物动— │ + │ : │ + │ : │ + └──────────────────────────────┘ + + + + + + + + + + + + + + + + + + + + + + - 8-7 - +2. Grammar Files in Text Format +━━━━━━━━━━━━━━━━ +Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in +the grammar file. This grammar file is a database and is read during the startup of +the cserver. + +The text format grammar files are as follow: + (1) con.master + (2) con.masterR + (3) con.attr + (4) con.jirattr + (5) con.jircon + (6) con.shuutan + (7) con.shuutanR +These files may be found under the directory "/cdic" in the cWnn source. + +The binary format grammar file may be created using the "catof" utility (refer to +Section 6.9). This binary format grammar file will be used by the cserver. +In order to create the binary grammar file, the Cixing text file is also needed in +addition to the seven text format grammar files listed above. + +When cserver reads in the grammar file, it is able to determine whether it is a +grammar file by analysing the binary format. Two or more grammar files can be +managed by the cserver. Different user environments may make use of different +grammar files. A user is also able to change the grammar file dynamically via the +operation function (文法变更). Refer to Section 5.2. + + +3. Assess of Grammar File and Cixing File +━━━━━━━━━━━━━━━━━━━━━ +Default path of Cixing text file: /usr/local/lib/wnn/zh_CN/ +The default filename for the text format Cixing file in cWnn is "cixing.data" + +Default path of grammar binary file: /usr/local/lib/wnn/zh_CN/dic/sys/ +The default filename for the binary format grammar file in cWnn is "full.con" +and "full.conR". + + + + + + + + + + + + + + - 8-8 -