0
|
1 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
|
2 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
|
3 ┃ Chapter 8 CWNN FILE MANAGEMENT ┃
|
|
4 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
5 ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
6
|
|
7
|
|
8 ┏━━━━━━━━┓
|
|
9 ┃ 8.1 OVERVIEW ┃
|
|
10 ┗━━━━━━━━┛
|
|
11
|
|
12 In cWnn system, the cserver plays an important role in managing the different
|
|
13 resources and files.
|
|
14
|
|
15 Resource files are read in during cserver startup. If the files are not read,
|
|
16 they will be read in by the cserver subsequently when requested by certain front-end
|
|
17 processors during their startup.
|
|
18
|
|
19 There are three categories of files in cWnn, namely:
|
|
20
|
|
21 (1) Dictionary files
|
|
22 (2) Usage frequency files
|
|
23 (3) Grammar files
|
|
24
|
|
25 We will now explain in details each of the three cWnn file types.
|
|
26
|
|
27
|
|
28
|
|
29
|
|
30
|
|
31
|
|
32
|
|
33
|
|
34
|
|
35
|
|
36
|
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
|
42
|
|
43
|
|
44
|
|
45
|
|
46
|
|
47
|
|
48
|
|
49
|
|
50 - 8-1 -
|
|
51 ┏━━━━━━━━━━━━┓
|
|
52 ┃ 8.2 DICTIONARY FILES ┃
|
|
53 ┗━━━━━━━━━━━━┛
|
|
54
|
|
55 Dictionary is classified into two categories : (1) Text format
|
|
56 (2) Binary format
|
|
57 Text format dictionary is readable, but binary format dictionary is not
|
|
58 readable. The text format dictionary is converted to binary format using the "catod"
|
|
59 utility (refer to Section 6.7). Only the binary format dictionary is used by cWnn
|
|
60 system. The binary format dictionary may be converted back to text format via the
|
|
61 the "cdtoa" utility (refer to Section 6.8).
|
|
62
|
|
63 The maximum number of words allowed in a dictionary is 70,000.
|
|
64
|
|
65
|
|
66 1. Dictionary in Text Format
|
|
67 ━━━━━━━━━━━━━━
|
|
68 The format of the text dictionary is shown below.
|
|
69
|
|
70 The text format is as follows:
|
|
71 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
72 ┌───────────────────────────┐
|
|
73 │ \comment <Comment> <CR> │
|
|
74 │ \total <Total_frequency> <CR> │
|
|
75 │ \cixing <Dict_cixing> <CR> │
|
|
76 │ \Pinyin <CR> │
|
|
77 │ │
|
|
78 │ pinyin word Cixing Frequency <CR> │
|
|
79 │ pinyin word Cixing Frequency <CR> │
|
|
80 │ pinyin word Cixing Frequency <CR> │
|
|
81 │ : : : : │
|
|
82 │ : : : : │
|
|
83 │ (EOF) │
|
|
84 └───────────────────────────┘
|
|
85
|
|
86 Description:
|
|
87 - comment : These are comments in a dictionary.
|
|
88 - total : This is the total number of times a dictionary is used
|
|
89 for conversion, ie, the usage frequency of a dictionary.
|
|
90 - cixing : This specifies the part of speech used by THIS particular
|
|
91 dictionary ONLY. The format of the part of speech here
|
|
92 is the same as that in the system standard cixing file
|
|
93 (cixing.data). Refer to Section 8.4.
|
|
94 If the part of speech is NOT specified here, the default
|
|
95 file will be "/usr/local/lib/wnn/zh_CN/cixing.data".
|
|
96 - Pinyin : This determines the type of dicionary. It can be "Zhuyin"
|
|
97 or "Bixing", depending on the dictionary itself.
|
|
98
|
|
99
|
|
100 - 8-2 -
|
|
101 - pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin
|
|
102 here refers to the pronunciation for each character/word.
|
|
103 For encoded input, the Pinyin refers to the code of each
|
|
104 character/word.
|
|
105 The maximum length for Pinyin is 256 characters.
|
|
106
|
|
107 - word : This refers to the actual Chinese character/word. Each
|
|
108 character or word should not exceed 256 characters.
|
|
109 If a space, carriage return or other special characters
|
|
110 are needed to be added to the character/word, it can be
|
|
111 done by appending them in octal after "\0".
|
|
112 If characters other than "0" is appended after the "\",
|
|
113 it will refer to the character itself.
|
|
114 For example, "\\" refers to "\" itself.
|
|
115 - Cixing : This refers to the part of speech defined in the grammar
|
|
116 file, such as noun, pronoun etc. For details, refer to
|
|
117 grammar files explained in 8.4.
|
|
118 - Frequency : The usage frequency for each word.
|
|
119
|
|
120
|
|
121 Example of a text format dictionary:
|
|
122 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
123 ┌───────────────────────────┐
|
|
124 │ \comment This is a Pinyin dictionary │
|
|
125 │ \total 0 │
|
|
126 │ \cixing │
|
|
127 │ \Pinyin │
|
|
128 │ │
|
|
129 │ W幆幚 我 人称代 1200 <CR> │
|
|
130 │ R帵n幚 人 单位量 10 <CR> │
|
|
131 │ De幚 的 语气 30 <CR> │
|
|
132 │ : : : : │
|
|
133 │ : : : : │
|
|
134 └───────────────────────────┘
|
|
135
|
|
136
|
|
137 2. Dictionary in Binary Format
|
|
138 ━━━━━━━━━━━━━━━
|
|
139 This is the binary format dictionary used by the cWnn system. While reading in a file,
|
|
140 the cserver is able to determine whether the file is a dictionary via the binary format.
|
|
141 Once a dictionary is accessed by the cserver, its contents may be changed. During the
|
|
142 termination of cserver, the updated dictionary will be written back to the file.
|
|
143
|
|
144 Each tuple (词条) in a dictionary has a serial number. The serial number is used for
|
|
145 matching the tuples in a dictionary with those in the usage frequency file.
|
|
146
|
|
147
|
|
148
|
|
149
|
|
150 - 8-3 -
|
|
151 3. System Dictionary and User Dictionary
|
|
152 ━━━━━━━━━━━━━━━━━━━━
|
|
153 System dictionary refers to the dictionary provided by the system itself. There are
|
|
154 two types of system dictionaries. One consists of only characters, while the other
|
|
155 consists of words. For the Pinyin input and Zhuyin input environments, the following
|
|
156 dictionary files are used :
|
|
157
|
|
158 (1) level_1.dic - consists of only characters (单字). These are the
|
|
159 Chinese characters that are more commonly used.
|
|
160 (2) level_2.dic - consists of Chinese characters that are not so
|
|
161 commonly used.
|
|
162 (3) basic.dic - This is a word dictionary ie. it consists of single
|
|
163 character word (单字词) and multi-character words
|
|
164 (多字词).
|
|
165
|
|
166 User dictionary refers to dictionary that is created by the user. This dictionary
|
|
167 allows the user to register or delete his own words. The dictionary structure is
|
|
168 similar to that of the system dictionary.
|
|
169
|
|
170
|
|
171 4. Assess of Dictionary Files
|
|
172 ━━━━━━━━━━━━━━━
|
|
173 Both system and user dictionaries can be added or removed through the settings of the
|
|
174 environment files.
|
|
175
|
|
176 It may be set via the "setdic" command in the initialization file "cserverrc" (refer
|
|
177 to Section 5.3) or in the initialization file "wnnenvrc" (refer to Section 5.5).
|
|
178 Similar settings need to be done for the reverse initialization file "wnnenvrc_R"
|
|
179 (refer to Section 5.6).
|
|
180
|
|
181 Default path for system dictionary : /usr/local/lib/wnn/zh_CN/dic/sys/
|
|
182 ".dic" is the default filename extension for dictionary. For example, level_1.dic
|
|
183
|
|
184 Default path for user dictionary : /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
|
|
185 "ud" is the default filename for user dictionary.
|
|
186
|
|
187
|
|
188 5. Logical Dictionary and Dictionary Files
|
|
189 ━━━━━━━━━━━━━━━━━━━━━
|
|
190 In the cWnn system, several front-end processors are connected to the cserver, and all
|
|
191 the resources managed by cserver are utilized by the different front-end processors.
|
|
192 Each dictionary file may combine with several different usage frequency files. Hence,
|
|
193 each combination will form different dictionary logically.
|
|
194
|
|
195 A dictionary may also be used for both forward and reverse conversion, such as Pinyin-
|
|
196 Hanzi conversion and Hanzi-Pinyin conversion. Hence, they form two separate logical
|
|
197 dictionaries. For details, refer to "cwnnstat" in Section 6.4.
|
|
198
|
|
199 NOTE: ONE default dictionary may form several logical dictionaries.
|
|
200 - 8-4 -
|
|
201 ┏━━━━━━━━━━━━━━┓
|
|
202 ┃ 8.3 USAGE FREQUENCY FILES ┃
|
|
203 ┗━━━━━━━━━━━━━━┛
|
|
204
|
|
205 Usage frequency files are attached to a dictionary. In every dictionary, there
|
|
206 are information on the usage frequency of each word. This information represents the
|
|
207 default usage frequency for each word in the dictionary. The default usage frequency
|
|
208 is obtained from statistical results by analysing large amount of Chinese articles.
|
|
209
|
|
210 Since the usage frequency information of each word is already included in the
|
|
211 text format dictionary, there is NO need for an explicit text format of usage frequency
|
|
212 file. Refer to the example of text format dictionary in Section 8.2 above.
|
|
213
|
|
214
|
|
215 1. System Usage Frequency File and User Usage Frequency File
|
|
216 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
217 Note that the default usage frequency defined by the system may not be suitable for all
|
|
218 users. Hence, besides the default usage frequency, the cserver will create a user usage
|
|
219 frequency file for each user. The initial file is a copy of the default file, and it
|
|
220 is created when the user starts the front-end processor for the first time. As the
|
|
221 system is being used by the user, the usage frequency of each word will be changed
|
|
222 according to how often a word is being used. Therefore, this user frequency file is
|
|
223 accustomed to the individual user. During the termination of the cserver, or during the
|
|
224 termination of those environments using the frequency file, the user usage frequency file
|
|
225 will be updated. When the same user activates the front-end processor again, instead of
|
|
226 creating a new user usage frequency file, the updated frequency file will be read in by
|
|
227 cserver.
|
|
228
|
|
229 The usage frequency of each word in the dictionary plays a part in the Hanzi conversion.
|
|
230 Hence, the weight for usage frequency of each word may be changed to adjust its impact on
|
|
231 the conversion process so as to obtain a more accurate conversion result.
|
|
232
|
|
233 In the conversion evaluation, there is a "last used" information which also resides in
|
|
234 the usage frequency file.
|
|
235
|
|
236
|
|
237 2. Assess of Usage Frequency Files
|
|
238 ━━━━━━━━━━━━━━━━━
|
|
239 Usage Frequency Files is specified in the initialization file "wnnenvrc" (refer to
|
|
240 Section 5.5) and "wnnenvrc_R" (refer to Section 5.6).
|
|
241
|
|
242 Default path for usage frequency file: /usr/local/lib/wnn/zh_CN/dic/usr/@USR/
|
|
243 ".h" is the default filename extension for usage frequency file. For example, basic.h,
|
|
244 level_1.h, level_2.h.
|
|
245
|
|
246
|
|
247
|
|
248
|
|
249
|
|
250 - 8-5 -
|
|
251 ┏━━━━━━━━━━━━━━━━━━━┓
|
|
252 ┃ 8.4 GRAMMAR FILES AND CIXING FILES ┃
|
|
253 ┗━━━━━━━━━━━━━━━━━━━┛
|
|
254
|
|
255 The definition of the grammar(词法) files and part of speech(词性) file are
|
|
256 dependent of the system. Substantial knowledge on Chinese grammar and the Pinyin-Hanzi
|
|
257 conversion process of this system are required in order to understand them. We will now
|
|
258 only give some necessary and brief explanations on the grammar used in cWnn.
|
|
259
|
|
260 NOTE: We will now refer part of speech as Cixing (词性).
|
|
261
|
|
262
|
|
263 1. Cixing File in Text Format
|
|
264 ━━━━━━━━━━━━━━━
|
|
265 Cixing file defines a set of grammatical attributes, which is based upon to define
|
|
266 the Chinese grammar. The grammatical attributes of all the words in the dictionary must
|
|
267 be in this Cixing file.
|
|
268
|
|
269 The content in the Cixing file is intepreted line by line. Whatever that comes after
|
|
270 a semicolon ";" in a line is regarded as comments. A backslash "\" means it will be
|
|
271 continued on the following line. Refer to cWnn default Cixing file for example.
|
|
272
|
|
273 The Cixing file is divided into three portions:
|
|
274 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
275 (a) Tree structure : During a word add operation via the front-end processor,
|
|
276 the user needs to choose the appropriate grammatical
|
|
277 attribute for the word to be added.
|
|
278 This tree structure will be searched accordingly until
|
|
279 the user has chosen the required grammatical attribute.
|
|
280 For example :
|
|
281 普通名词/|普通名:人名—:事物名—
|
|
282
|
|
283 This means that 普通名词 can be further classified into
|
|
284 普通名, 人名 or 事物名. Only the leaves are the actual
|
|
285 Cixing that can be attached to words.
|
|
286
|
|
287 (b) Cixing definitions : These are Cixing that may include Chinese characters,
|
|
288 such as 普通名 and 单字.
|
|
289 "@" refers to a null Cixing, and "@" may replace any
|
|
290 new Cixing to be appended, without affecting the
|
|
291 compatibility with the existing dictionary and grammar
|
|
292 files.
|
|
293
|
|
294 (c) Combined Cixing : This defines the combined Cixing that contain two of
|
|
295 more grammatical definition attributes. Combined
|
|
296 Cixing can be assigned to single word and they reduce
|
|
297 the number of tuples (词条) having the same Chinese
|
|
298 characters.
|
|
299
|
|
300 - 8-6 -
|
|
301 Example of a text format Cixing file:
|
|
302 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
303 ┌──────────────────────────────┐
|
|
304 │ ;;;; 词性的树型结构: │
|
|
305 │ 名词/|普通名词/:抽象名:时间名:处所名:方位名词/:表人特殊名/ │
|
|
306 │ 普通名词/|普通名:人名—:事物名— │
|
|
307 │ 方位名词/|单纯方位名:合成方位名 │
|
|
308 │ 表人特殊名/|百家性—:称谓名— │
|
|
309 │ : │
|
|
310 │ : │
|
|
311 │ │
|
|
312 │ ;;;; 词性的定义: │
|
|
313 │ 终止 ;;; 0 终止, 作为文节的终止 │
|
|
314 │ 数字 ;;; 1 数字 │
|
|
315 │ @ ;;; 11 │
|
|
316 │ 单字 ;;; 13 │
|
|
317 │ 普通名 │
|
|
318 │ 人名— │
|
|
319 │ : │
|
|
320 │ : │
|
|
321 │ │
|
|
322 │ ;;;; 复合词性定义: │
|
|
323 │ 姓名词-$普通名:百家姓— │
|
|
324 │ 表人物量-$表人量:表物量 │
|
|
325 │ 行为动词-$及物动—:不及物动— │
|
|
326 │ : │
|
|
327 │ : │
|
|
328 └──────────────────────────────┘
|
|
329
|
|
330
|
|
331
|
|
332
|
|
333
|
|
334
|
|
335
|
|
336
|
|
337
|
|
338
|
|
339
|
|
340
|
|
341
|
|
342
|
|
343
|
|
344
|
|
345
|
|
346
|
|
347
|
|
348
|
|
349
|
|
350 - 8-7 -
|
|
351 2. Grammar Files in Text Format
|
|
352 ━━━━━━━━━━━━━━━━
|
|
353 Based on the defined set of Cixing, a set of grammar rules for Chinese is defined in
|
|
354 the grammar file. This grammar file is a database and is read during the startup of
|
|
355 the cserver.
|
|
356
|
|
357 The text format grammar files are as follow:
|
|
358 (1) con.master
|
|
359 (2) con.masterR
|
|
360 (3) con.attr
|
|
361 (4) con.jirattr
|
|
362 (5) con.jircon
|
|
363 (6) con.shuutan
|
|
364 (7) con.shuutanR
|
|
365 These files may be found under the directory "/cdic" in the cWnn source.
|
|
366
|
|
367 The binary format grammar file may be created using the "catof" utility (refer to
|
|
368 Section 6.9). This binary format grammar file will be used by the cserver.
|
|
369 In order to create the binary grammar file, the Cixing text file is also needed in
|
|
370 addition to the seven text format grammar files listed above.
|
|
371
|
|
372 When cserver reads in the grammar file, it is able to determine whether it is a
|
|
373 grammar file by analysing the binary format. Two or more grammar files can be
|
|
374 managed by the cserver. Different user environments may make use of different
|
|
375 grammar files. A user is also able to change the grammar file dynamically via the
|
|
376 operation function (文法变更). Refer to Section 5.2.
|
|
377
|
|
378
|
|
379 3. Assess of Grammar File and Cixing File
|
|
380 ━━━━━━━━━━━━━━━━━━━━━
|
|
381 Default path of Cixing text file: /usr/local/lib/wnn/zh_CN/
|
|
382 The default filename for the text format Cixing file in cWnn is "cixing.data"
|
|
383
|
|
384 Default path of grammar binary file: /usr/local/lib/wnn/zh_CN/dic/sys/
|
|
385 The default filename for the binary format grammar file in cWnn is "full.con"
|
|
386 and "full.conR".
|
|
387
|
|
388
|
|
389
|
|
390
|
|
391
|
|
392
|
|
393
|
|
394
|
|
395
|
|
396
|
|
397
|
|
398
|
|
399
|
|
400 - 8-8 -
|