comparison cWnn/manual/chap8 @ 0:bbc77ca4def5

initial import
author Yoshiki Yazawa <yaz@cc.rim.or.jp>
date Thu, 13 Dec 2007 04:30:14 +0900
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:bbc77ca4def5
1 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
2 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
3 ┃ Chapter 8 CWNN FILE MANAGEMENT ┃
4 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
5 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
6
7
8 ┏━━━━━━━━┓
9 ┃ 8.1 OVERVIEW ┃
10 ┗━━━━━━━━┛
11
12 In cWnn system, the cserver plays an important role in managing the different
13 resources and files.
14
15 Resource files are read in during cserver startup. If the files are not read,
16 they will be read in by the cserver subsequently when requested by certain front-end
17 processors during their startup.
18
19 There are three categories of files in cWnn, namely:
20
21 (1) Dictionary files
22 (2) Usage frequency files
23 (3) Grammar files
24
25 We will now explain in details each of the three cWnn file types.
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 - 8-1 -
51 ┏━━━━━━━━━━━━┓
52 ┃ 8.2 DICTIONARY FILES ┃
53 ┗━━━━━━━━━━━━┛
54
55 Dictionary is classified into two categories : (1) Text format
56 (2) Binary format
57 Text format dictionary is readable, but binary format dictionary is not
58 readable. The text format dictionary is converted to binary format using the "catod"
59 utility (refer to Section 6.7). Only the binary format dictionary is used by cWnn
60 system. The binary format dictionary may be converted back to text format via the
61 the "cdtoa" utility (refer to Section 6.8).
62
63 The maximum number of words allowed in a dictionary is 70,000.
64
65
66 1. Dictionary in Text Format
67 ━━━━━━━━━━━━━━
68 The format of the text dictionary is shown below.
69
70 The text format is as follows:
71 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72 ┌───────────────────────────┐
73 │ \comment <Comment> <CR> │
74 │ \total <Total_frequency> <CR> │
75 │ \cixing <Dict_cixing> <CR> │
76 │ \Pinyin <CR> │
77 │ │
78 │ pinyin word Cixing Frequency <CR> │
79 │ pinyin word Cixing Frequency <CR> │
80 │ pinyin word Cixing Frequency <CR> │
81 │ : : : : │
82 │ : : : : │
83 │ (EOF) │
84 └───────────────────────────┘
85
86 Description:
87 - comment : These are comments in a dictionary.
88 - total : This is the total number of times a dictionary is used
89 for conversion, ie, the usage frequency of a dictionary.
90 - cixing : This specifies the part of speech used by THIS particular
91 dictionary ONLY. The format of the part of speech here
92 is the same as that in the system standard cixing file
93 (cixing.data). Refer to Section 8.4.
94 If the part of speech is NOT specified here, the default
95 file will be "/usr/local/lib/wnn/zh_CN/cixing.data".
96 - Pinyin : This determines the type of dicionary. It can be "Zhuyin"
97 or "Bixing", depending on the dictionary itself.
98
99
100 - 8-2 -
101 - pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin
102 here refers to the pronunciation for each character/word.
103 For encoded input, the Pinyin refers to the code of each
104 character/word.
105 The maximum length for Pinyin is 256 characters.
106
107 - word : This refers to the actual Chinese character/word. Each
108 character or word should not exceed 256 characters.
109 If a space, carriage return or other special characters
110 are needed to be added to the character/word, it can be
111 done by appending them in octal after "\0".
112 If characters other than "0" is appended after the "\",
113 it will refer to the character itself.
114 For example, "\\" refers to "\" itself.
115 - Cixing : This refers to the part of speech defined in the grammar
116 file, such as noun, pronoun etc. For details, refer to
117 grammar files explained in 8.4.
118 - Frequency : The usage frequency for each word.
119
120
121 Example of a text format dictionary:
122 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
123 ┌───────────────────────────┐
124 │ \comment This is a Pinyin dictionary │
125 │ \total 0 │
126 │ \cixing │
127 │ \Pinyin │
128 │ │
129 │ W幆幚 我 人称代 1200 <CR> │
130 │ R帵n幚 人 单位量 10 <CR> │
131 │ De幚 的 语气 30 <CR> │
132 │ : : : : │
133 │ : : : : │
134 └───────────────────────────┘
135
136
137 2. Dictionary in Binary Format
138 ━━━━━━━━━━━━━━━
139 This is the binary format dictionary used by the cWnn system. While reading in a file,
140 the cserver is able to determine whether the file is a dictionary via the binary format.
141 Once a dictionary is accessed by the cserver, its contents may be changed. During the
142 termination of cserver, the updated dictionary will be written back to the file.
143
144 Each tuple (词条) in a dictionary has a serial number. The serial number is used for
145 matching the tuples in a dictionary with those in the usage frequency file.
146
147
148
149
150 - 8-3 -
151 3. System Dictionary and User Dictionary
152 ━━━━━━━━━━━━━━━━━━━━
153 System dictionary refers to the dictionary provided by the system itself. There are
154 two types of system dictionaries. One consists of only characters, while the other
155 consists of words. For the Pinyin input and Zhuyin input environments, the following
156 dictionary files are used :
157
158 (1) level_1.dic - consists of only characters (单字). These are the
159 Chinese characters that are more commonly used.
160 (2) level_2.dic - consists of Chinese characters that are not so
161 commonly used.
162 (3) basic.dic - This is a word dictionary ie. it consists of single
163 character word (单字词) and multi-character words
164 (多字词).
165
166 User dictionary refers to dictionary that is created by the user. This dictionary
167 allows the user to register or delete his own words. The dictionary structure is
168 similar to that of the system dictionary.
169
170
171 4. Assess of Dictionary Files
172 ━━━━━━━━━━━━━━━
173 Both system and user dictionaries can be added or removed through the settings of the
174 environment files.
175
176 It may be set via the "setdic" command in the initialization file "cserverrc" (refer
177 to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5).
178 Similar settings need to be done for the reverse initialization file "wnnenvrc_R"
179 (refer to Section 5.6).
180
181 Default path for system dictionary : /usr/local/lib/wnn/zh_CN/dic/sys/
182 ".dic" is the default filename extension for dictionary. For example, level_1.dic
183
184 Default path for user dictionary : /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
185 "ud" is the default filename for user dictionary.
186
187
188 5. Logical Dictionary and Dictionary Files
189 ━━━━━━━━━━━━━━━━━━━━━
190 In the cWnn system, several front-end processors are connected to the cserver, and all
191 the resources managed by cserver are utilized by the different front-end processors.
192 Each dictionary file may combine with several different usage frequency files. Hence,
193 each combination will form different dictionary logically.
194
195 A dictionary may also be used for both forward and reverse conversion, such as Pinyin-
196 Hanzi conversion and Hanzi-Pinyin conversion. Hence, they form two separate logical
197 dictionaries. For details, refer to "cwnnstat" in Section 6.4.
198
199 NOTE: ONE default dictionary may form several logical dictionaries.
200 - 8-4 -
201 ┏━━━━━━━━━━━━━━┓
202 ┃ 8.3 USAGE FREQUENCY FILES ┃
203 ┗━━━━━━━━━━━━━━┛
204
205 Usage frequency files are attached to a dictionary. In every dictionary, there
206 are information on the usage frequency of each word. This information represents the
207 default usage frequency for each word in the dictionary. The default usage frequency
208 is obtained from statistical results by analysing large amount of Chinese articles.
209
210 Since the usage frequency information of each word is already included in the
211 text format dictionary, there is NO need for an explicit text format of usage frequency
212 file. Refer to the example of text format dictionary in Section 8.2 above.
213
214
215 1. System Usage Frequency File and User Usage Frequency File
216 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
217 Note that the default usage frequency defined by the system may not be suitable for all
218 users. Hence, besides the default usage frequency, the cserver will create a user usage
219 frequency file for each user. The initial file is a copy of the default file, and it
220 is created when the user starts the front-end processor for the first time. As the
221 system is being used by the user, the usage frequency of each word will be changed
222 according to how often a word is being used. Therefore, this user frequency file is
223 accustomed to the individual user. During the termination of the cserver, or during the
224 termination of those environments using the frequency file, the user usage frequency file
225 will be updated. When the same user activates the front-end processor again, instead of
226 creating a new user usage frequency file, the updated frequency file will be read in by
227 cserver.
228
229 The usage frequency of each word in the dictionary plays a part in the Hanzi conversion.
230 Hence, the weight for usage frequency of each word may be changed to adjust its impact on
231 the conversion process so as to obtain a more accurate conversion result.
232
233 In the conversion evaluation, there is a "last used" information which also resides in
234 the usage frequency file.
235
236
237 2. Assess of Usage Frequency Files
238 ━━━━━━━━━━━━━━━━━
239 Usage Frequency Files is specified in the initialization file "wnnenvrc" (refer to
240 Section 5.5) and "wnnenvrc_R" (refer to Section 5.6).
241
242 Default path for usage frequency file: /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
243 ".h" is the default filename extension for usage frequency file. For example, basic.h,
244 level_1.h, level_2.h.
245
246
247
248
249
250 - 8-5 -
251 ┏━━━━━━━━━━━━━━━━━━━┓
252 ┃ 8.4 GRAMMAR FILES AND CIXING FILES ┃
253 ┗━━━━━━━━━━━━━━━━━━━┛
254
255 The definition of the grammar(词法) files and part of speech(词性) file are
256 dependent of the system. Substantial knowledge on Chinese grammar and the Pinyin-Hanzi
257 conversion process of this system are required in order to understand them. We will now
258 only give some necessary and brief explanations on the grammar used in cWnn.
259
260 NOTE: We will now refer part of speech as Cixing (词性).
261
262
263 1. Cixing File in Text Format
264 ━━━━━━━━━━━━━━━
265 Cixing file defines a set of grammatical attributes, which is based upon to define
266 the Chinese grammar. The grammatical attributes of all the words in the dictionary must
267 be in this Cixing file.
268
269 The content in the Cixing file is intepreted line by line. Whatever that comes after
270 a semicolon ";" in a line is regarded as comments. A backslash "\" means it will be
271 continued on the following line. Refer to cWnn default Cixing file for example.
272
273 The Cixing file is divided into three portions:
274 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
275 (a) Tree structure : During a word add operation via the front-end processor,
276 the user needs to choose the appropriate grammatical
277 attribute for the word to be added.
278 This tree structure will be searched accordingly until
279 the user has chosen the required grammatical attribute.
280 For example :
281 普通名词/|普通名:人名—:事物名—
282
283 This means that 普通名词 can be further classified into
284 普通名, 人名 or 事物名. Only the leaves are the actual
285 Cixing that can be attached to words.
286
287 (b) Cixing definitions : These are Cixing that may include Chinese characters,
288 such as 普通名 and 单字.
289 "@" refers to a null Cixing, and "@" may replace any
290 new Cixing to be appended, without affecting the
291 compatibility with the existing dictionary and grammar
292 files.
293
294 (c) Combined Cixing : This defines the combined Cixing that contain two of
295 more grammatical definition attributes. Combined
296 Cixing can be assigned to single word and they reduce
297 the number of tuples (词条) having the same Chinese
298 characters.
299
300 - 8-6 -
301 Example of a text format Cixing file:
302 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
303 ┌──────────────────────────────┐
304 │ ;;;; 词性的树型结构: │
305 │ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │
306 │ 普通名词/|普通名:人名—:事物名— │
307 │ 方位名词/|单纯方位名:合成方位名 │
308 │ 表人特殊名/|百家性—:称谓名— │
309 │ : │
310 │ : │
311 │ │
312 │ ;;;; 词性的定义: │
313 │ 终止 ;;; 0 终止, 作为文节的终止 │
314 │ 数字 ;;; 1 数字 │
315 │ @ ;;; 11 │
316 │ 单字 ;;; 13 │
317 │ 普通名 │
318 │ 人名— │
319 │ : │
320 │ : │
321 │ │
322 │ ;;;; 复合词性定义: │
323 │ 姓名词-$普通名:百家姓— │
324 │ 表人物量-$表人量:表物量 │
325 │ 行为动词-$及物动—:不及物动— │
326 │ : │
327 │ : │
328 └──────────────────────────────┘
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350 - 8-7 -
351 2. Grammar Files in Text Format
352 ━━━━━━━━━━━━━━━━
353 Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in
354 the grammar file. This grammar file is a database and is read during the startup of
355 the cserver.
356
357 The text format grammar files are as follow:
358 (1) con.master
359 (2) con.masterR
360 (3) con.attr
361 (4) con.jirattr
362 (5) con.jircon
363 (6) con.shuutan
364 (7) con.shuutanR
365 These files may be found under the directory "/cdic" in the cWnn source.
366
367 The binary format grammar file may be created using the "catof" utility (refer to
368 Section 6.9). This binary format grammar file will be used by the cserver.
369 In order to create the binary grammar file, the Cixing text file is also needed in
370 addition to the seven text format grammar files listed above.
371
372 When cserver reads in the grammar file, it is able to determine whether it is a
373 grammar file by analysing the binary format. Two or more grammar files can be
374 managed by the cserver. Different user environments may make use of different
375 grammar files. A user is also able to change the grammar file dynamically via the
376 operation function (文法变更). Refer to Section 5.2.
377
378
379 3. Assess of Grammar File and Cixing File
380 ━━━━━━━━━━━━━━━━━━━━━
381 Default path of Cixing text file: /usr/local/lib/wnn/zh_CN/
382 The default filename for the text format Cixing file in cWnn is "cixing.data"
383
384 Default path of grammar binary file: /usr/local/lib/wnn/zh_CN/dic/sys/
385 The default filename for the binary format grammar file in cWnn is "full.con"
386 and "full.conR".
387
388
389
390
391
392
393
394
395
396
397
398
399
400 - 8-8 -