comparison src/charset.h @ 35478:340a9e4aeb29

comment fixes
author Dave Love <fx@gnu.org>
date Mon, 22 Jan 2001 23:20:06 +0000
parents 9a2cf1e0032c
children 6f52e8c1039f
comparison
equal deleted inserted replaced
35477:e6bffd5c5287 35478:340a9e4aeb29
27 /*** GENERAL NOTE on CHARACTER SET (CHARSET) *** 27 /*** GENERAL NOTE on CHARACTER SET (CHARSET) ***
28 28
29 A character set ("charset" hereafter) is a meaningful collection 29 A character set ("charset" hereafter) is a meaningful collection
30 (i.e. language, culture, functionality, etc) of characters. Emacs 30 (i.e. language, culture, functionality, etc) of characters. Emacs
31 handles multiple charsets at once. Each charset corresponds to one 31 handles multiple charsets at once. Each charset corresponds to one
32 of ISO charsets. Emacs identifies a charset by a unique 32 of the ISO charsets. Emacs identifies a charset by a unique
33 identification number, whereas ISO identifies a charset by a triplet 33 identification number, whereas ISO identifies a charset by a triplet
34 of DIMENSION, CHARS and FINAL-CHAR. So, hereafter, just saying 34 of DIMENSION, CHARS and FINAL-CHAR. So, hereafter, just saying
35 "charset" means an identification number (integer value). 35 "charset" means an identification number (integer value).
36 36
37 The value range of charset is 0x00, 0x81..0xFE. There are four 37 The value range of charsets is 0x00, 0x81..0xFE. There are four
38 kinds of charset depending on DIMENSION (1 or 2) and CHARS (94 or 38 kinds of charset depending on DIMENSION (1 or 2) and CHARS (94 or
39 96). For instance, a charset of DIMENSION2_CHARS94 contains 94x94 39 96). For instance, a charset of DIMENSION2_CHARS94 contains 94x94
40 characters. 40 characters.
41 41
42 Within Emacs Lisp, a charset is treated as a symbol which has a 42 Within Emacs Lisp, a charset is treated as a symbol which has a
43 property `charset'. The property value is a vector containing 43 property `charset'. The property value is a vector containing
44 various information about the charset. For readability of C codes, 44 various information about the charset. For readability of C code,
45 we use the following convention for C variable names: 45 we use the following convention for C variable names:
46 charset_symbol: Emacs Lisp symbol of a charset 46 charset_symbol: Emacs Lisp symbol of a charset
47 charset_id: Emacs Lisp integer of an identification number of a charset 47 charset_id: Emacs Lisp integer of an identification number of a charset
48 charset: C integer of an identification number of a charset 48 charset: C integer of an identification number of a charset
49 49
50 Each charset (except for ascii) is assigned a base leading-code 50 Each charset (except for ascii) is assigned a base leading-code
51 (range 0x80..0x9E). In addition, a charset of greater than 0xA0 51 (range 0x80..0x9E). In addition, a charset of greater than 0xA0
52 (whose base leading-code is 0x9A..0x9D) is assigned an extended 52 (whose base leading-code is 0x9A..0x9D) is assigned an extended
53 leading-code (range 0xA0..0xFE). In this case, each base 53 leading-code (range 0xA0..0xFE). In this case, each base
54 leading-code specify the allowable range of extended leading-code as 54 leading-code specifies the allowable range of extended leading-code
55 shown in the table below. A leading-code is used to represent a 55 as shown in the table below. A leading-code is used to represent a
56 character in Emacs' buffer and string. 56 character in Emacs' buffer and string.
57 57
58 We call a charset which has extended leading-code as "private 58 We call a charset which has extended leading-code a "private
59 charset" because those are mainly for a charset which is not yet 59 charset" because those are mainly for a charset which is not yet
60 registered by ISO. On the contrary, we call a charset which does 60 registered by ISO. On the contrary, we call a charset which does
61 not have extended leading-code as "official charset". 61 not have extended leading-code an "official charset".
62 62
63 --------------------------------------------------------------------------- 63 ---------------------------------------------------------------------------
64 charset dimension base leading-code extended leading-code 64 charset dimension base leading-code extended leading-code
65 --------------------------------------------------------------------------- 65 ---------------------------------------------------------------------------
66 0x00 official dim1 -- none -- -- none -- 66 0x00 official dim1 -- none -- -- none --
134 multibyte buffer/string. So this macro name is not appropriate. */ 134 multibyte buffer/string. So this macro name is not appropriate. */
135 #define CHAR_HEAD_P(ch) ((unsigned char) (ch) < 0xA0) 135 #define CHAR_HEAD_P(ch) ((unsigned char) (ch) < 0xA0)
136 136
137 /*** GENERAL NOTE on CHARACTER REPRESENTATION *** 137 /*** GENERAL NOTE on CHARACTER REPRESENTATION ***
138 138
139 At first, the term "character" or "char" is used for a multilingual 139 Firstly, the term "character" or "char" is used for a multilingual
140 character (of course, including ASCII character), not for a byte in 140 character (of course, including ASCII characters), not for a byte in
141 computer memory. We use the term "code" or "byte" for the latter 141 computer memory. We use the term "code" or "byte" for the latter
142 case. 142 case.
143 143
144 A character is identified by charset and one or two POSITION-CODEs. 144 A character is identified by charset and one or two POSITION-CODEs.
145 POSITION-CODE is the position of the character in the charset. A 145 POSITION-CODE is the position of the character in the charset. A
147 A character of DIMENSION2 charset has two POSITION-CODE: 147 A character of DIMENSION2 charset has two POSITION-CODE:
148 POSITION-CODE-1 and POSITION-CODE-2. The code range of 148 POSITION-CODE-1 and POSITION-CODE-2. The code range of
149 POSITION-CODE is 0x20..0x7F. 149 POSITION-CODE is 0x20..0x7F.
150 150
151 Emacs has two kinds of representation of a character: multi-byte 151 Emacs has two kinds of representation of a character: multi-byte
152 form (for buffer and string) and single-word form (for character 152 form (for buffers and strings) and single-word form (for character
153 object in Emacs Lisp). The latter is called "character code" here 153 objects in Emacs Lisp). The latter is called "character code"
154 after. Both representations encode the information of charset and 154 hereafter. Both representations encode the information of charset
155 POSITION-CODE but in a different way (for instance, MSB of 155 and POSITION-CODE but in a different way (for instance, the MSB of
156 POSITION-CODE is set in multi-byte form). 156 POSITION-CODE is set in multi-byte form).
157 157
158 For details of multi-byte form, see the section "2. Emacs internal 158 For details of the multi-byte form, see the section "2. Emacs
159 format handlers" of `coding.c'. 159 internal format handlers" of `coding.c'.
160 160
161 Emacs uses 19 bits for a character code. The bits are divided into 161 Emacs uses 19 bits for a character code. The bits are divided into
162 3 fields: FIELD1(5bits):FIELD2(7bits):FIELD3(7bits). 162 3 fields: FIELD1(5bits):FIELD2(7bits):FIELD3(7bits).
163 163
164 A character code of DIMENSION1 character uses FIELD2 to hold charset 164 A character code of DIMENSION1 character uses FIELD2 to hold charset
218 #define SINGLE_BYTE_CHAR_P(c) ((unsigned) (c) < 0x100) 218 #define SINGLE_BYTE_CHAR_P(c) ((unsigned) (c) < 0x100)
219 219
220 /* 1 if BYTE is an ASCII character in itself, in multibyte mode. */ 220 /* 1 if BYTE is an ASCII character in itself, in multibyte mode. */
221 #define ASCII_BYTE_P(byte) ((byte) < 0x80) 221 #define ASCII_BYTE_P(byte) ((byte) < 0x80)
222 222
223 /* A char-table containing information of each character set. 223 /* A char-table containing information on each character set.
224 224
225 Unlike ordinary char-tables, this doesn't contain any nested table. 225 Unlike ordinary char-tables, this doesn't contain any nested tables.
226 Only the top level elements are used. Each element is a vector of 226 Only the top level elements are used. Each element is a vector of
227 the following information: 227 the following information:
228 CHARSET-ID, BYTES, DIMENSION, CHARS, WIDTH, DIRECTION, 228 CHARSET-ID, BYTES, DIMENSION, CHARS, WIDTH, DIRECTION,
229 LEADING-CODE-BASE, LEADING-CODE-EXT, 229 LEADING-CODE-BASE, LEADING-CODE-EXT,
230 ISO-FINAL-CHAR, ISO-GRAPHIC-PLANE, 230 ISO-FINAL-CHAR, ISO-GRAPHIC-PLANE,
231 REVERSE-CHARSET, SHORT-NAME, LONG-NAME, DESCRIPTION, 231 REVERSE-CHARSET, SHORT-NAME, LONG-NAME, DESCRIPTION,
232 PLIST. 232 PLIST.
233 233
234 CHARSET-ID (integer) is the identification number of the charset. 234 CHARSET-ID (integer) is the identification number of the charset.
235 235
236 BYTES (integer) is the length of multi-byte form of a character in 236 BYTES (integer) is the length of the multi-byte form of a character
237 the charset: one of 1, 2, 3, and 4. 237 in the charset: one of 1, 2, 3, and 4.
238 238
239 DIMENSION (integer) is the number of bytes to represent a character: 1 or 2. 239 DIMENSION (integer) is the number of bytes to represent a character: 1 or 2.
240 240
241 CHARS (integer) is the number of characters in a dimension: 94 or 96. 241 CHARS (integer) is the number of characters in a dimension: 94 or 96.
242 242
249 249
250 LEADING-CODE-BASE (integer) is the base leading-code for the 250 LEADING-CODE-BASE (integer) is the base leading-code for the
251 charset. 251 charset.
252 252
253 LEADING-CODE-EXT (integer) is the extended leading-code for the 253 LEADING-CODE-EXT (integer) is the extended leading-code for the
254 charset. All charsets of less than 0xA0 has the value 0. 254 charset. All charsets of less than 0xA0 have the value 0.
255 255
256 ISO-FINAL-CHAR (character) is the final character of the 256 ISO-FINAL-CHAR (character) is the final character of the
257 corresponding ISO 2022 charset. It is -1 for such a character 257 corresponding ISO 2022 charset. It is -1 for such a character
258 that is used only internally (e.g. `eight-bit-control'). 258 that is used only internally (e.g. `eight-bit-control').
259 259
264 (e.g. `eight-bit-control'). 264 (e.g. `eight-bit-control').
265 265
266 REVERSE-CHARSET (integer) is the charset which differs only in 266 REVERSE-CHARSET (integer) is the charset which differs only in
267 LEFT-TO-RIGHT value from the charset. If there's no such a 267 LEFT-TO-RIGHT value from the charset. If there's no such a
268 charset, the value is -1. 268 charset, the value is -1.
269 269
270 SHORT-NAME (string) is the short name to refer to the charset. 270 SHORT-NAME (string) is the short name to refer to the charset.
271 271
272 LONG-NAME (string) is the long name to refer to the charset. 272 LONG-NAME (string) is the long name to refer to the charset.
273 273
274 DESCRIPTION (string) is the description string of the charset. 274 DESCRIPTION (string) is the description string of the charset.
275 275
276 PLIST (property list) may contain any type of information a user 276 PLIST (property list) may contain any type of information a user
277 want to put and get by functions `put-charset-property' and 277 wants to put and get by functions `put-charset-property' and
278 `get-charset-property' respectively. */ 278 `get-charset-property' respectively. */
279 extern Lisp_Object Vcharset_table; 279 extern Lisp_Object Vcharset_table;
280 280
281 /* Macros to access various information of CHARSET in Vcharset_table. 281 /* Macros to access various information of CHARSET in Vcharset_table.
282 We provide these macros for efficiency. No range check of CHARSET. */ 282 We provide these macros for efficiency. No range check of CHARSET. */
513 (SINGLE_BYTE_CHAR_P (c) \ 513 (SINGLE_BYTE_CHAR_P (c) \
514 ? ((ASCII_BYTE_P (c) || (c) >= 0xA0) ? 1 : 2) \ 514 ? ((ASCII_BYTE_P (c) || (c) >= 0xA0) ? 1 : 2) \
515 : char_bytes (c)) 515 : char_bytes (c))
516 516
517 /* The following two macros CHAR_STRING and STRING_CHAR are the main 517 /* The following two macros CHAR_STRING and STRING_CHAR are the main
518 entry points to convert between Emacs two types of character 518 entry points to convert between Emacs's two types of character
519 representations: multi-byte form and single-word form (character 519 representations: multi-byte form and single-word form (character
520 code). */ 520 code). */
521 521
522 /* Store multi-byte form of the character C in STR. The caller should 522 /* Store multi-byte form of the character C in STR. The caller should
523 allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in 523 allocate at least MAX_MULTIBYTE_LENGTH bytes area at STR in