Mercurial > freewnn
comparison cWnn/manual/chap8 @ 0:bbc77ca4def5
initial import
author | Yoshiki Yazawa <yaz@cc.rim.or.jp> |
---|---|
date | Thu, 13 Dec 2007 04:30:14 +0900 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:bbc77ca4def5 |
---|---|
1 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ | |
2 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ | |
3 ┃ Chapter 8 CWNN FILE MANAGEMENT ┃ | |
4 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ | |
5 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ | |
6 | |
7 | |
8 ┏━━━━━━━━┓ | |
9 ┃ 8.1 OVERVIEW ┃ | |
10 ┗━━━━━━━━┛ | |
11 | |
12 In cWnn system, the cserver plays an important role in managing the different | |
13 resources and files. | |
14 | |
15 Resource files are read in during cserver startup. If the files are not read, | |
16 they will be read in by the cserver subsequently when requested by certain front-end | |
17 processors during their startup. | |
18 | |
19 There are three categories of files in cWnn, namely: | |
20 | |
21 (1) Dictionary files | |
22 (2) Usage frequency files | |
23 (3) Grammar files | |
24 | |
25 We will now explain in details each of the three cWnn file types. | |
26 | |
27 | |
28 | |
29 | |
30 | |
31 | |
32 | |
33 | |
34 | |
35 | |
36 | |
37 | |
38 | |
39 | |
40 | |
41 | |
42 | |
43 | |
44 | |
45 | |
46 | |
47 | |
48 | |
49 | |
50 - 8-1 - | |
51 ┏━━━━━━━━━━━━┓ | |
52 ┃ 8.2 DICTIONARY FILES ┃ | |
53 ┗━━━━━━━━━━━━┛ | |
54 | |
55 Dictionary is classified into two categories : (1) Text format | |
56 (2) Binary format | |
57 Text format dictionary is readable, but binary format dictionary is not | |
58 readable. The text format dictionary is converted to binary format using the "catod" | |
59 utility (refer to Section 6.7). Only the binary format dictionary is used by cWnn | |
60 system. The binary format dictionary may be converted back to text format via the | |
61 the "cdtoa" utility (refer to Section 6.8). | |
62 | |
63 The maximum number of words allowed in a dictionary is 70,000. | |
64 | |
65 | |
66 1. Dictionary in Text Format | |
67 ━━━━━━━━━━━━━━ | |
68 The format of the text dictionary is shown below. | |
69 | |
70 The text format is as follows: | |
71 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
72 ┌───────────────────────────┐ | |
73 │ \comment <Comment> <CR> │ | |
74 │ \total <Total_frequency> <CR> │ | |
75 │ \cixing <Dict_cixing> <CR> │ | |
76 │ \Pinyin <CR> │ | |
77 │ │ | |
78 │ pinyin word Cixing Frequency <CR> │ | |
79 │ pinyin word Cixing Frequency <CR> │ | |
80 │ pinyin word Cixing Frequency <CR> │ | |
81 │ : : : : │ | |
82 │ : : : : │ | |
83 │ (EOF) │ | |
84 └───────────────────────────┘ | |
85 | |
86 Description: | |
87 - comment : These are comments in a dictionary. | |
88 - total : This is the total number of times a dictionary is used | |
89 for conversion, ie, the usage frequency of a dictionary. | |
90 - cixing : This specifies the part of speech used by THIS particular | |
91 dictionary ONLY. The format of the part of speech here | |
92 is the same as that in the system standard cixing file | |
93 (cixing.data). Refer to Section 8.4. | |
94 If the part of speech is NOT specified here, the default | |
95 file will be "/usr/local/lib/wnn/zh_CN/cixing.data". | |
96 - Pinyin : This determines the type of dicionary. It can be "Zhuyin" | |
97 or "Bixing", depending on the dictionary itself. | |
98 | |
99 | |
100 - 8-2 - | |
101 - pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin | |
102 here refers to the pronunciation for each character/word. | |
103 For encoded input, the Pinyin refers to the code of each | |
104 character/word. | |
105 The maximum length for Pinyin is 256 characters. | |
106 | |
107 - word : This refers to the actual Chinese character/word. Each | |
108 character or word should not exceed 256 characters. | |
109 If a space, carriage return or other special characters | |
110 are needed to be added to the character/word, it can be | |
111 done by appending them in octal after "\0". | |
112 If characters other than "0" is appended after the "\", | |
113 it will refer to the character itself. | |
114 For example, "\\" refers to "\" itself. | |
115 - Cixing : This refers to the part of speech defined in the grammar | |
116 file, such as noun, pronoun etc. For details, refer to | |
117 grammar files explained in 8.4. | |
118 - Frequency : The usage frequency for each word. | |
119 | |
120 | |
121 Example of a text format dictionary: | |
122 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
123 ┌───────────────────────────┐ | |
124 │ \comment This is a Pinyin dictionary │ | |
125 │ \total 0 │ | |
126 │ \cixing │ | |
127 │ \Pinyin │ | |
128 │ │ | |
129 │ W幆幚 我 人称代 1200 <CR> │ | |
130 │ R帵n幚 人 单位量 10 <CR> │ | |
131 │ De幚 的 语气 30 <CR> │ | |
132 │ : : : : │ | |
133 │ : : : : │ | |
134 └───────────────────────────┘ | |
135 | |
136 | |
137 2. Dictionary in Binary Format | |
138 ━━━━━━━━━━━━━━━ | |
139 This is the binary format dictionary used by the cWnn system. While reading in a file, | |
140 the cserver is able to determine whether the file is a dictionary via the binary format. | |
141 Once a dictionary is accessed by the cserver, its contents may be changed. During the | |
142 termination of cserver, the updated dictionary will be written back to the file. | |
143 | |
144 Each tuple (词条) in a dictionary has a serial number. The serial number is used for | |
145 matching the tuples in a dictionary with those in the usage frequency file. | |
146 | |
147 | |
148 | |
149 | |
150 - 8-3 - | |
151 3. System Dictionary and User Dictionary | |
152 ━━━━━━━━━━━━━━━━━━━━ | |
153 System dictionary refers to the dictionary provided by the system itself. There are | |
154 two types of system dictionaries. One consists of only characters, while the other | |
155 consists of words. For the Pinyin input and Zhuyin input environments, the following | |
156 dictionary files are used : | |
157 | |
158 (1) level_1.dic - consists of only characters (单字). These are the | |
159 Chinese characters that are more commonly used. | |
160 (2) level_2.dic - consists of Chinese characters that are not so | |
161 commonly used. | |
162 (3) basic.dic - This is a word dictionary ie. it consists of single | |
163 character word (单字词) and multi-character words | |
164 (多字词). | |
165 | |
166 User dictionary refers to dictionary that is created by the user. This dictionary | |
167 allows the user to register or delete his own words. The dictionary structure is | |
168 similar to that of the system dictionary. | |
169 | |
170 | |
171 4. Assess of Dictionary Files | |
172 ━━━━━━━━━━━━━━━ | |
173 Both system and user dictionaries can be added or removed through the settings of the | |
174 environment files. | |
175 | |
176 It may be set via the "setdic" command in the initialization file "cserverrc" (refer | |
177 to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5). | |
178 Similar settings need to be done for the reverse initialization file "wnnenvrc_R" | |
179 (refer to Section 5.6). | |
180 | |
181 Default path for system dictionary : /usr/local/lib/wnn/zh_CN/dic/sys/ | |
182 ".dic" is the default filename extension for dictionary. For example, level_1.dic | |
183 | |
184 Default path for user dictionary : /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ | |
185 "ud" is the default filename for user dictionary. | |
186 | |
187 | |
188 5. Logical Dictionary and Dictionary Files | |
189 ━━━━━━━━━━━━━━━━━━━━━ | |
190 In the cWnn system, several front-end processors are connected to the cserver, and all | |
191 the resources managed by cserver are utilized by the different front-end processors. | |
192 Each dictionary file may combine with several different usage frequency files. Hence, | |
193 each combination will form different dictionary logically. | |
194 | |
195 A dictionary may also be used for both forward and reverse conversion, such as Pinyin- | |
196 Hanzi conversion and Hanzi-Pinyin conversion. Hence, they form two separate logical | |
197 dictionaries. For details, refer to "cwnnstat" in Section 6.4. | |
198 | |
199 NOTE: ONE default dictionary may form several logical dictionaries. | |
200 - 8-4 - | |
201 ┏━━━━━━━━━━━━━━┓ | |
202 ┃ 8.3 USAGE FREQUENCY FILES ┃ | |
203 ┗━━━━━━━━━━━━━━┛ | |
204 | |
205 Usage frequency files are attached to a dictionary. In every dictionary, there | |
206 are information on the usage frequency of each word. This information represents the | |
207 default usage frequency for each word in the dictionary. The default usage frequency | |
208 is obtained from statistical results by analysing large amount of Chinese articles. | |
209 | |
210 Since the usage frequency information of each word is already included in the | |
211 text format dictionary, there is NO need for an explicit text format of usage frequency | |
212 file. Refer to the example of text format dictionary in Section 8.2 above. | |
213 | |
214 | |
215 1. System Usage Frequency File and User Usage Frequency File | |
216 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
217 Note that the default usage frequency defined by the system may not be suitable for all | |
218 users. Hence, besides the default usage frequency, the cserver will create a user usage | |
219 frequency file for each user. The initial file is a copy of the default file, and it | |
220 is created when the user starts the front-end processor for the first time. As the | |
221 system is being used by the user, the usage frequency of each word will be changed | |
222 according to how often a word is being used. Therefore, this user frequency file is | |
223 accustomed to the individual user. During the termination of the cserver, or during the | |
224 termination of those environments using the frequency file, the user usage frequency file | |
225 will be updated. When the same user activates the front-end processor again, instead of | |
226 creating a new user usage frequency file, the updated frequency file will be read in by | |
227 cserver. | |
228 | |
229 The usage frequency of each word in the dictionary plays a part in the Hanzi conversion. | |
230 Hence, the weight for usage frequency of each word may be changed to adjust its impact on | |
231 the conversion process so as to obtain a more accurate conversion result. | |
232 | |
233 In the conversion evaluation, there is a "last used" information which also resides in | |
234 the usage frequency file. | |
235 | |
236 | |
237 2. Assess of Usage Frequency Files | |
238 ━━━━━━━━━━━━━━━━━ | |
239 Usage Frequency Files is specified in the initialization file "wnnenvrc" (refer to | |
240 Section 5.5) and "wnnenvrc_R" (refer to Section 5.6). | |
241 | |
242 Default path for usage frequency file: /usr/local/lib/wnn/zh_CN/dic/usr/@USR/ | |
243 ".h" is the default filename extension for usage frequency file. For example, basic.h, | |
244 level_1.h, level_2.h. | |
245 | |
246 | |
247 | |
248 | |
249 | |
250 - 8-5 - | |
251 ┏━━━━━━━━━━━━━━━━━━━┓ | |
252 ┃ 8.4 GRAMMAR FILES AND CIXING FILES ┃ | |
253 ┗━━━━━━━━━━━━━━━━━━━┛ | |
254 | |
255 The definition of the grammar(词法) files and part of speech(词性) file are | |
256 dependent of the system. Substantial knowledge on Chinese grammar and the Pinyin-Hanzi | |
257 conversion process of this system are required in order to understand them. We will now | |
258 only give some necessary and brief explanations on the grammar used in cWnn. | |
259 | |
260 NOTE: We will now refer part of speech as Cixing (词性). | |
261 | |
262 | |
263 1. Cixing File in Text Format | |
264 ━━━━━━━━━━━━━━━ | |
265 Cixing file defines a set of grammatical attributes, which is based upon to define | |
266 the Chinese grammar. The grammatical attributes of all the words in the dictionary must | |
267 be in this Cixing file. | |
268 | |
269 The content in the Cixing file is intepreted line by line. Whatever that comes after | |
270 a semicolon ";" in a line is regarded as comments. A backslash "\" means it will be | |
271 continued on the following line. Refer to cWnn default Cixing file for example. | |
272 | |
273 The Cixing file is divided into three portions: | |
274 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
275 (a) Tree structure : During a word add operation via the front-end processor, | |
276 the user needs to choose the appropriate grammatical | |
277 attribute for the word to be added. | |
278 This tree structure will be searched accordingly until | |
279 the user has chosen the required grammatical attribute. | |
280 For example : | |
281 普通名词/|普通名:人名—:事物名— | |
282 | |
283 This means that 普通名词 can be further classified into | |
284 普通名, 人名 or 事物名. Only the leaves are the actual | |
285 Cixing that can be attached to words. | |
286 | |
287 (b) Cixing definitions : These are Cixing that may include Chinese characters, | |
288 such as 普通名 and 单字. | |
289 "@" refers to a null Cixing, and "@" may replace any | |
290 new Cixing to be appended, without affecting the | |
291 compatibility with the existing dictionary and grammar | |
292 files. | |
293 | |
294 (c) Combined Cixing : This defines the combined Cixing that contain two of | |
295 more grammatical definition attributes. Combined | |
296 Cixing can be assigned to single word and they reduce | |
297 the number of tuples (词条) having the same Chinese | |
298 characters. | |
299 | |
300 - 8-6 - | |
301 Example of a text format Cixing file: | |
302 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
303 ┌──────────────────────────────┐ | |
304 │ ;;;; 词性的树型结构: │ | |
305 │ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │ | |
306 │ 普通名词/|普通名:人名—:事物名— │ | |
307 │ 方位名词/|单纯方位名:合成方位名 │ | |
308 │ 表人特殊名/|百家性—:称谓名— │ | |
309 │ : │ | |
310 │ : │ | |
311 │ │ | |
312 │ ;;;; 词性的定义: │ | |
313 │ 终止 ;;; 0 终止, 作为文节的终止 │ | |
314 │ 数字 ;;; 1 数字 │ | |
315 │ @ ;;; 11 │ | |
316 │ 单字 ;;; 13 │ | |
317 │ 普通名 │ | |
318 │ 人名— │ | |
319 │ : │ | |
320 │ : │ | |
321 │ │ | |
322 │ ;;;; 复合词性定义: │ | |
323 │ 姓名词-$普通名:百家姓— │ | |
324 │ 表人物量-$表人量:表物量 │ | |
325 │ 行为动词-$及物动—:不及物动— │ | |
326 │ : │ | |
327 │ : │ | |
328 └──────────────────────────────┘ | |
329 | |
330 | |
331 | |
332 | |
333 | |
334 | |
335 | |
336 | |
337 | |
338 | |
339 | |
340 | |
341 | |
342 | |
343 | |
344 | |
345 | |
346 | |
347 | |
348 | |
349 | |
350 - 8-7 - | |
351 2. Grammar Files in Text Format | |
352 ━━━━━━━━━━━━━━━━ | |
353 Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in | |
354 the grammar file. This grammar file is a database and is read during the startup of | |
355 the cserver. | |
356 | |
357 The text format grammar files are as follow: | |
358 (1) con.master | |
359 (2) con.masterR | |
360 (3) con.attr | |
361 (4) con.jirattr | |
362 (5) con.jircon | |
363 (6) con.shuutan | |
364 (7) con.shuutanR | |
365 These files may be found under the directory "/cdic" in the cWnn source. | |
366 | |
367 The binary format grammar file may be created using the "catof" utility (refer to | |
368 Section 6.9). This binary format grammar file will be used by the cserver. | |
369 In order to create the binary grammar file, the Cixing text file is also needed in | |
370 addition to the seven text format grammar files listed above. | |
371 | |
372 When cserver reads in the grammar file, it is able to determine whether it is a | |
373 grammar file by analysing the binary format. Two or more grammar files can be | |
374 managed by the cserver. Different user environments may make use of different | |
375 grammar files. A user is also able to change the grammar file dynamically via the | |
376 operation function (文法变更). Refer to Section 5.2. | |
377 | |
378 | |
379 3. Assess of Grammar File and Cixing File | |
380 ━━━━━━━━━━━━━━━━━━━━━ | |
381 Default path of Cixing text file: /usr/local/lib/wnn/zh_CN/ | |
382 The default filename for the text format Cixing file in cWnn is "cixing.data" | |
383 | |
384 Default path of grammar binary file: /usr/local/lib/wnn/zh_CN/dic/sys/ | |
385 The default filename for the binary format grammar file in cWnn is "full.con" | |
386 and "full.conR". | |
387 | |
388 | |
389 | |
390 | |
391 | |
392 | |
393 | |
394 | |
395 | |
396 | |
397 | |
398 | |
399 | |
400 - 8-8 - |