0
|
1 *************************************************
|
|
2 * Chapter 7 CWNN FILE MANAGMENT *
|
|
3 *************************************************
|
|
4
|
|
5
|
|
6
|
|
7
|
|
8
|
|
9 7.1 OVERVIEW
|
|
10 =============
|
|
11
|
|
12 In cWnn system, the server plays an important role in managing the different
|
|
13 resources and files.
|
|
14 The files to be read in during cserver startup, as well as when requested by
|
|
15 certain client, are shown as follows :
|
|
16
|
|
17 (1) Dictionary files
|
|
18 (2) Usage frequency files
|
|
19 (3) Grammar files
|
|
20 (4) Pinyin error tolerance files
|
|
21
|
|
22 The above cWnn files are explained in this chapter.
|
|
23
|
|
24
|
|
25
|
|
26
|
|
27
|
|
28
|
|
29
|
|
30
|
|
31
|
|
32
|
|
33
|
|
34
|
|
35
|
|
36
|
|
37
|
|
38
|
|
39
|
|
40
|
|
41
|
|
42
|
|
43
|
|
44
|
|
45
|
|
46
|
|
47
|
|
48
|
|
49
|
|
50 - 7-1 -
|
|
51
|
|
52 7.2 DICTIONARY FILES
|
|
53 =====================
|
|
54
|
|
55 Dictionary is classified into two categories : (1) Text format
|
|
56 (2) Binary format
|
|
57 Text format dictionary can be read but binary format dictionary cannot be
|
|
58 read. The text dictionary is converted to binary format using the "atod" utility
|
|
59 (please refer to Section 4.6). Only the binary format dictionary is used by cWnn
|
|
60 system. However, the binary form of dictionary can be converted back to text form
|
|
61 using the "dtoa" utility (please refer to Section 4.7). The number of words which
|
|
62 can be stored in a dictioanry is between 0 to 65535.
|
|
63
|
|
64
|
|
65 1. Dictionary in Text Format
|
|
66 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
67 The format of the text dictionary is shown below. The first two lines can be
|
|
68 omitted. However, the text dictionary obtained from "dtoa" has the following format.
|
|
69
|
|
70 +-------------------------------------------------------+
|
|
71 | \comment Comment <CR> |
|
|
72 | \total Total frequency <CR> |
|
|
73 | Pinyin word Cixing Frequency <CR> |
|
|
74 | Pinyin word Cixing Frequency <CR> |
|
|
75 | Pinyin word Cixing Frequency <CR> |
|
|
76 | : : : : |
|
|
77 | : : : : |
|
|
78 | : : : : |
|
|
79 | |
|
|
80 | (EOF) |
|
|
81 +-------------------------------------------------------+
|
|
82
|
|
83 * Comment : Comments can be added in a dictionary
|
|
84 * Total frequency : This is the total usage frequency for all the conversion
|
|
85 performed using this dictionary.
|
|
86 * Pinyin : For the Pinyin-Hanzi conversion dictionary, the Pinyin
|
|
87 here refers to the pronunciation for each word.
|
|
88 For Bianma dictionary, the Pinyin refers to the code for
|
|
89 each word.
|
|
90 The maximum length for Pinyin is 256 characters.
|
|
91 * Word : Each word should not exceed 256 characters.
|
|
92 If a space, carriage return or other special characters
|
|
93 need to be added in the word, it can be done by appending
|
|
94 them in octal after "\0".
|
|
95 If characters other than "0" is appended after the "\",
|
|
96 it will refer to the character itself.
|
|
97 For example, "\\" refers to "\" itself.
|
|
98
|
|
99
|
|
100 - 7-2 -
|
|
101 * Cixing : Refers to the part of speech defined in the grammar file,
|
|
102 such as noun and pronoun. For details, please refer to
|
|
103 grammar files explained in 7.4.
|
|
104 * Frequency : The frequency of usage for each word.
|
|
105 In the text format dictionary, the range value of
|
|
106 frequency is between 0 and 2400.
|
|
107
|
|
108
|
|
109 2. Dictionary in Binary Format
|
|
110 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
111 This is the dictionary used by the cWnn system. While reading in a file, the server is
|
|
112 able to determine if the file is a dictionary via the binary format. Once a dictionary
|
|
113 is accessed by the server, its contents may be changed. During the termination of
|
|
114 server, the updated dictionary will be written back to the file.
|
|
115
|
|
116 All the words in the dictionary have serial numbers. The serial numbers are for the
|
|
117 purpose of matching between the words in the dictionary and the usage frequency file.
|
|
118
|
|
119
|
|
120 3. System Dictionary and User Dictionary
|
|
121 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
122 System dictionary refers to the dictionary provided by the system itself. For the
|
|
123 Pinyin input and Zhuyin input environment, the following dictionary files are used :
|
|
124
|
|
125 /usr/local/lib/wnn/zh_CN/dic/sys/level-1.dic
|
|
126 /usr/local/lib/wnn/zh_CN/dic/sys/level-2.dic
|
|
127 /usr/local/lib/wnn/zh_CN/dic/sys/basic-1.dic
|
|
128 /usr/local/lib/wnn/zh_CN/dic/sys/basic-2.dic
|
|
129
|
|
130 User dictionary refers to dictionary that is created by the user. The user is able to
|
|
131 add or delete his own words into or from the dictionary. The dictionary structure is
|
|
132 similar to that of the system dictionary. The user dictionary has the standard path
|
|
133 "/usr/local/lib/wnn/zh_CN/dic/usr/@USR/ud".
|
|
134
|
|
135 "ud" is the standard filename for user dictionary.
|
|
136 Both system and user dictionaries can be added or removed through the setting of the
|
|
137 system environment. Please refer to chapter 5 for details.
|
|
138
|
|
139
|
|
140 4. Logical Dictionary and Dictionary Files
|
|
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
142 In the cWnn system, several clients are connected to one server, and all the resources
|
|
143 managed by the server are used by the different clients. Each dictionary file may have
|
|
144 several different usage frequency files. Hence, we may say that there are several
|
|
145 dictionaries existing logically.
|
|
146 In addition, a dictionary can be used for reverse conversion, such as Pinyin-Hanzi
|
|
147 conversion and Hanzi-Pinyin conversion. Hence, there are two dictionaries logically.
|
|
148 For details, please refer to "wnnstat" in Section 4.6.
|
|
149
|
|
150 - 7-3 -
|
|
151
|
|
152 7.3 USAGE FREQUENCY FILES
|
|
153 ==========================
|
|
154
|
|
155 Usage frequency files are attached to dictionaries. In every dictionary, there
|
|
156 exists information on each word's usage frequency. This information represents the
|
|
157 standard usage frequency for each word in the dictionary. The standard usage frequency
|
|
158 is obtained from statistical results by analying large amount of Chinese articles.
|
|
159 Since the usage frequency information is included in the text form of dictionary, there
|
|
160 is no explicit system usage frequency file.
|
|
161
|
|
162 Note that the standard usage frequency defined by the system may not be suitable
|
|
163 for all users (client). Hence, besides the standard frequency, the server will create a
|
|
164 user usage frequency file for each user. The initial file is a copy of the standard
|
|
165 file, and it is created when the user executes the client for the first time. As the
|
|
166 system is being used by the user, the usage frequency of each word will be changed
|
|
167 according to how often a word is being used. Therefore, this user frequency file is
|
|
168 accustomed to the individual user. During the termination of the server or environment,
|
|
169 the user usage frequency file will be updated. This file will be read in by the server
|
|
170 when the same user activates the client again, instead of creating a new frequency file.
|
|
171
|
|
172 The usage frequency of each word in the dictionary plays a part in the Hanzi
|
|
173 conversion. Hence, the weight for usage frequency can be changed to adjsut its impact
|
|
174 on the conversion process so as to obtain a more accurate conversion result. In the
|
|
175 conversion evaluation, there is a "last used" information which also resides in the
|
|
176 usage frequency file.
|
|
177
|
|
178 The standard path for the usage frequency file is
|
|
179 "/usr/local/lib/wnn/zh_CN/dic/usr/@USR/dictionary_name.h".
|
|
180
|
|
181
|
|
182
|
|
183
|
|
184
|
|
185
|
|
186
|
|
187
|
|
188
|
|
189
|
|
190
|
|
191
|
|
192
|
|
193
|
|
194
|
|
195
|
|
196
|
|
197
|
|
198
|
|
199
|
|
200 - 7-4 -
|
|
201
|
|
202 7.4 GRAMMAR FILES AND CIXING FILES
|
|
203 ===================================
|
|
204
|
|
205 The definition of the grammar files and Cixing file are dependent of the system.
|
|
206 Substantial knowledge on Chinese grammar and the Pinyin-Hanzi conversion process of this
|
|
207 system are required to understand them. We will now give only some necessary and brief
|
|
208 explanations.
|
|
209
|
|
210
|
|
211 1. Cixing File
|
|
212 ~~~~~~~~~~~~~~
|
|
213 Cixing file defines a set of grammatical attributes, which is based upon to define
|
|
214 Chinese grammar. The grammatical attributes of all the words in the dictionary must be
|
|
215 in this set.
|
|
216
|
|
217 Standard path : /usr/local/lib/wnn/zh_CN/dic/cixing.data
|
|
218
|
|
219 The Cixing file is intepreted line by line. Whatever that comes after a semicolon ";"
|
|
220 in a line is regarded as comments, and a backslash "\" means it will be continued on
|
|
221 the following line.
|
|
222
|
|
223 The Cixing file is divided into three portions :
|
|
224
|
|
225
|
|
226 <Table-c-7.1>
|
|
227
|
|
228
|
|
229
|
|
230
|
|
231
|
|
232
|
|
233
|
|
234
|
|
235
|
|
236
|
|
237
|
|
238
|
|
239
|
|
240
|
|
241
|
|
242
|
|
243
|
|
244
|
|
245
|
|
246
|
|
247
|
|
248
|
|
249
|
|
250 - 7-5 -
|
|
251 2. Grammar Files
|
|
252 ~~~~~~~~~~~~~~~~
|
|
253 Based on the defined set of Cixing, the grammar files define a grammar for Chinese.
|
|
254 The text format of the grammar files is readable but the binary format cannot be read.
|
|
255 The conversion from readable grammar files to binary format is through the "atoc"
|
|
256 utility (refer to Section 4.8). Only this binary form of grammar file can be used by
|
|
257 the server.
|
|
258
|
|
259 Standard path : /usr/local/lib/wnn/zh_CN/dic/full.con
|
|
260
|
|
261 When the server reads the binary form of grammar file, it is able to determine whether
|
|
262 it is a grammar file. Two or more grammar files can be managed by the server.
|
|
263 Different user environment can make use of different grammar files, and a user is also
|
|
264 able to change the grammar file dynamically.
|
|
265
|
|
266 For related grammar file in text form, please refer to "CWNN APPLICATION DEVELOPMENT
|
|
267 MANUAL".
|
|
268
|
|
269
|
|
270
|
|
271
|
|
272
|
|
273
|
|
274
|
|
275
|
|
276
|
|
277
|
|
278
|
|
279
|
|
280
|
|
281
|
|
282
|
|
283
|
|
284
|
|
285
|
|
286
|
|
287
|
|
288
|
|
289
|
|
290
|
|
291
|
|
292
|
|
293
|
|
294
|
|
295
|
|
296
|
|
297
|
|
298
|
|
299
|
|
300 - 7-6 -
|