Mercurial > emacs
comparison src/charset.h @ 35478:340a9e4aeb29
comment fixes
author | Dave Love <fx@gnu.org> |
---|---|
date | Mon, 22 Jan 2001 23:20:06 +0000 |
parents | 9a2cf1e0032c |
children | 6f52e8c1039f |
comparison
equal
deleted
inserted
replaced
35477:e6bffd5c5287 | 35478:340a9e4aeb29 |
---|---|
27 /*** GENERAL NOTE on CHARACTER SET (CHARSET) *** | 27 /*** GENERAL NOTE on CHARACTER SET (CHARSET) *** |
28 | 28 |
29 A character set ("charset" hereafter) is a meaningful collection | 29 A character set ("charset" hereafter) is a meaningful collection |
30 (i.e. language, culture, functionality, etc) of characters. Emacs | 30 (i.e. language, culture, functionality, etc) of characters. Emacs |
31 handles multiple charsets at once. Each charset corresponds to one | 31 handles multiple charsets at once. Each charset corresponds to one |
32 of ISO charsets. Emacs identifies a charset by a unique | 32 of the ISO charsets. Emacs identifies a charset by a unique |
33 identification number, whereas ISO identifies a charset by a triplet | 33 identification number, whereas ISO identifies a charset by a triplet |
34 of DIMENSION, CHARS and FINAL-CHAR. So, hereafter, just saying | 34 of DIMENSION, CHARS and FINAL-CHAR. So, hereafter, just saying |
35 "charset" means an identification number (integer value). | 35 "charset" means an identification number (integer value). |
36 | 36 |
37 The value range of charset is 0x00, 0x81..0xFE. There are four | 37 The value range of charsets is 0x00, 0x81..0xFE. There are four |
38 kinds of charset depending on DIMENSION (1 or 2) and CHARS (94 or | 38 kinds of charset depending on DIMENSION (1 or 2) and CHARS (94 or |
39 96). For instance, a charset of DIMENSION2_CHARS94 contains 94x94 | 39 96). For instance, a charset of DIMENSION2_CHARS94 contains 94x94 |
40 characters. | 40 characters. |
41 | 41 |
42 Within Emacs Lisp, a charset is treated as a symbol which has a | 42 Within Emacs Lisp, a charset is treated as a symbol which has a |
43 property `charset'. The property value is a vector containing | 43 property `charset'. The property value is a vector containing |
44 various information about the charset. For readability of C codes, | 44 various information about the charset. For readability of C code, |
45 we use the following convention for C variable names: | 45 we use the following convention for C variable names: |
46 charset_symbol: Emacs Lisp symbol of a charset | 46 charset_symbol: Emacs Lisp symbol of a charset |
47 charset_id: Emacs Lisp integer of an identification number of a charset | 47 charset_id: Emacs Lisp integer of an identification number of a charset |
48 charset: C integer of an identification number of a charset | 48 charset: C integer of an identification number of a charset |
49 | 49 |
50 Each charset (except for ascii) is assigned a base leading-code | 50 Each charset (except for ascii) is assigned a base leading-code |
51 (range 0x80..0x9E). In addition, a charset of greater than 0xA0 | 51 (range 0x80..0x9E). In addition, a charset of greater than 0xA0 |
52 (whose base leading-code is 0x9A..0x9D) is assigned an extended | 52 (whose base leading-code is 0x9A..0x9D) is assigned an extended |
53 leading-code (range 0xA0..0xFE). In this case, each base | 53 leading-code (range 0xA0..0xFE). In this case, each base |
54 leading-code specify the allowable range of extended leading-code as | 54 leading-code specifies the allowable range of extended leading-code |
55 shown in the table below. A leading-code is used to represent a | 55 as shown in the table below. A leading-code is used to represent a |
56 character in Emacs' buffer and string. | 56 character in Emacs' buffer and string. |
57 | 57 |
58 We call a charset which has extended leading-code as "private | 58 We call a charset which has extended leading-code a "private |
59 charset" because those are mainly for a charset which is not yet | 59 charset" because those are mainly for a charset which is not yet |
60 registered by ISO. On the contrary, we call a charset which does | 60 registered by ISO. On the contrary, we call a charset which does |
61 not have extended leading-code as "official charset". | 61 not have extended leading-code an "official charset". |
62 | 62 |
63 --------------------------------------------------------------------------- | 63 --------------------------------------------------------------------------- |
64 charset dimension base leading-code extended leading-code | 64 charset dimension base leading-code extended leading-code |
65 --------------------------------------------------------------------------- | 65 --------------------------------------------------------------------------- |
66 0x00 official dim1 -- none -- -- none -- | 66 0x00 official dim1 -- none -- -- none -- |
134 multibyte buffer/string. So this macro name is not appropriate. */ | 134 multibyte buffer/string. So this macro name is not appropriate. */ |
135 #define CHAR_HEAD_P(ch) ((unsigned char) (ch) < 0xA0) | 135 #define CHAR_HEAD_P(ch) ((unsigned char) (ch) < 0xA0) |
136 | 136 |
137 /*** GENERAL NOTE on CHARACTER REPRESENTATION *** | 137 /*** GENERAL NOTE on CHARACTER REPRESENTATION *** |
138 | 138 |
139 At first, the term "character" or "char" is used for a multilingual | 139 Firstly, the term "character" or "char" is used for a multilingual |
140 character (of course, including ASCII character), not for a byte in | 140 character (of course, including ASCII characters), not for a byte in |
141 computer memory. We use the term "code" or "byte" for the latter | 141 computer memory. We use the term "code" or "byte" for the latter |
142 case. | 142 case. |
143 | 143 |
144 A character is identified by charset and one or two POSITION-CODEs. | 144 A character is identified by charset and one or two POSITION-CODEs. |
145 POSITION-CODE is the position of the character in the charset. A | 145 POSITION-CODE is the position of the character in the charset. A |
147 A character of DIMENSION2 charset has two POSITION-CODE: | 147 A character of DIMENSION2 charset has two POSITION-CODE: |
148 POSITION-CODE-1 and POSITION-CODE-2. The code range of | 148 POSITION-CODE-1 and POSITION-CODE-2. The code range of |
149 POSITION-CODE is 0x20..0x7F. | 149 POSITION-CODE is 0x20..0x7F. |
150 | 150 |
151 Emacs has two kinds of representation of a character: multi-byte | 151 Emacs has two kinds of representation of a character: multi-byte |
152 form (for buffer and string) and single-word form (for character | 152 form (for buffers and strings) and single-word form (for character |
153 object in Emacs Lisp). The latter is called "character code" here | 153 objects in Emacs Lisp). The latter is called "character code" |
154 after. Both representations encode the information of charset and | 154 hereafter. Both representations encode the information of charset |
155 POSITION-CODE but in a different way (for instance, MSB of | 155 and POSITION-CODE but in a different way (for instance, the MSB of |
156 POSITION-CODE is set in multi-byte form). | 156 POSITION-CODE is set in multi-byte form). |
157 | 157 |
158 For details of multi-byte form, see the section "2. Emacs internal | 158 For details of the multi-byte form, see the section "2. Emacs |
159 format handlers" of `coding.c'. | 159 internal format handlers" of `coding.c'. |
160 | 160 |
161 Emacs uses 19 bits for a character code. The bits are divided into | 161 Emacs uses 19 bits for a character code. The bits are divided into |
162 3 fields: FIELD1(5bits):FIELD2(7bits):FIELD3(7bits). | 162 3 fields: FIELD1(5bits):FIELD2(7bits):FIELD3(7bits). |
163 | 163 |
164 A character code of DIMENSION1 character uses FIELD2 to hold charset | 164 A character code of DIMENSION1 character uses FIELD2 to hold charset |
218 #define SINGLE_BYTE_CHAR_P(c) ((unsigned) (c) < 0x100) | 218 #define SINGLE_BYTE_CHAR_P(c) ((unsigned) (c) < 0x100) |
219 | 219 |
220 /* 1 if BYTE is an ASCII character in itself, in multibyte mode. */ | 220 /* 1 if BYTE is an ASCII character in itself, in multibyte mode. */ |
221 #define ASCII_BYTE_P(byte) ((byte) < 0x80) | 221 #define ASCII_BYTE_P(byte) ((byte) < 0x80) |
222 | 222 |
223 /* A char-table containing information of each character set. | 223 /* A char-table containing information on each character set. |
224 | 224 |
225 Unlike ordinary char-tables, this doesn't contain any nested table. | 225 Unlike ordinary char-tables, this doesn't contain any nested tables. |
226 Only the top level elements are used. Each element is a vector of | 226 Only the top level elements are used. Each element is a vector of |
227 the following information: | 227 the following information: |
228 CHARSET-ID, BYTES, DIMENSION, CHARS, WIDTH, DIRECTION, | 228 CHARSET-ID, BYTES, DIMENSION, CHARS, WIDTH, DIRECTION, |
229 LEADING-CODE-BASE, LEADING-CODE-EXT, | 229 LEADING-CODE-BASE, LEADING-CODE-EXT, |
230 ISO-FINAL-CHAR, ISO-GRAPHIC-PLANE, | 230 ISO-FINAL-CHAR, ISO-GRAPHIC-PLANE, |
231 REVERSE-CHARSET, SHORT-NAME, LONG-NAME, DESCRIPTION, | 231 REVERSE-CHARSET, SHORT-NAME, LONG-NAME, DESCRIPTION, |
232 PLIST. | 232 PLIST. |
233 | 233 |
234 CHARSET-ID (integer) is the identification number of the charset. | 234 CHARSET-ID (integer) is the identification number of the charset. |
235 | 235 |
236 BYTES (integer) is the length of multi-byte form of a character in | 236 BYTES (integer) is the length of the multi-byte form of a character |
237 the charset: one of 1, 2, 3, and 4. | 237 in the charset: one of 1, 2, 3, and 4. |
238 | 238 |
239 DIMENSION (integer) is the number of bytes to represent a character: 1 or 2. | 239 DIMENSION (integer) is the number of bytes to represent a character: 1 or 2. |
240 | 240 |
241 CHARS (integer) is the number of characters in a dimension: 94 or 96. | 241 CHARS (integer) is the number of characters in a dimension: 94 or 96. |
242 | 242 |
249 | 249 |
250 LEADING-CODE-BASE (integer) is the base leading-code for the | 250 LEADING-CODE-BASE (integer) is the base leading-code for the |
251 charset. | 251 charset. |
252 | 252 |
253 LEADING-CODE-EXT (integer) is the extended leading-code for the | 253 LEADING-CODE-EXT (integer) is the extended leading-code for the |
254 charset. All charsets of less than 0xA0 has the value 0. | 254 charset. All charsets of less than 0xA0 have the value 0. |
255 | 255 |
256 ISO-FINAL-CHAR (character) is the final character of the | 256 ISO-FINAL-CHAR (character) is the final character of the |
257 corresponding ISO 2022 charset. It is -1 for such a character | 257 corresponding ISO 2022 charset. It is -1 for such a character |
258 that is used only internally (e.g. `eight-bit-control'). | 258 that is used only internally (e.g. `eight-bit-control'). |
259 | 259 |
264 (e.g. `eight-bit-control'). | 264 (e.g. `eight-bit-control'). |
265 | 265 |
266 REVERSE-CHARSET (integer) is the charset which differs only in | 266 REVERSE-CHARSET (integer) is the charset which differs only in |
267 LEFT-TO-RIGHT value from the charset. If there's no such a | 267 LEFT-TO-RIGHT value from the charset. If there's no such a |
268 charset, the value is -1. | 268 charset, the value is -1. |
269 | 269 |
270 SHORT-NAME (string) is the short name to refer to the charset. | 270 SHORT-NAME (string) is the short name to refer to the charset. |
271 | 271 |
272 LONG-NAME (string) is the long name to refer to the charset. | 272 LONG-NAME (string) is the long name to refer to the charset. |
273 | 273 |
274 DESCRIPTION (string) is the description string of the charset. | 274 DESCRIPTION (string) is the description string of the charset. |
275 | 275 |
276 PLIST (property list) may contain any type of information a user | 276 PLIST (property list) may contain any type of information a user |
277 want to put and get by functions `put-charset-property' and | 277 wants to put and get by functions `put-charset-property' and |
278 `get-charset-property' respectively. */ | 278 `get-charset-property' respectively. */ |
279 extern Lisp_Object Vcharset_table; | 279 extern Lisp_Object Vcharset_table; |
280 | 280 |
281 /* Macros to access various information of CHARSET in Vcharset_table. | 281 /* Macros to access various information of CHARSET in Vcharset_table. |
282 We provide these macros for efficiency. No range check of CHARSET. */ | 282 We provide these macros for efficiency. No range check of CHARSET. */ |
513 (SINGLE_BYTE_CHAR_P (c) \ | 513 (SINGLE_BYTE_CHAR_P (c) \ |
514 ? ((ASCII_BYTE_P (c) || (c) >= 0xA0) ? 1 : 2) \ | 514 ? ((ASCII_BYTE_P (c) || (c) >= 0xA0) ? 1 : 2) \ |
515 : char_bytes (c)) | 515 : char_bytes (c)) |
516 | 516 |
517 /* The following two macros CHAR_STRING and STRING_CHAR are the main | 517 /* The following two macros CHAR_STRING and STRING_CHAR are the main |
518 entry points to convert between Emacs two types of character | 518 entry points to convert between Emacs's two types of character |
519 representations: multi-byte form and single-word form (character | 519 representations: multi-byte form and single-word form (character |
520 code). */ | 520 code). */ |
521 | 521 |
522 /* Store multi-byte form of the character C in STR. The caller should | 522 /* Store multi-byte form of the character C in STR. The caller should |
523 allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in | 523 allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in |