comparison src/coding.c @ 18766:ac2e7e21abb0

Comment changes.
author Richard M. Stallman <rms@gnu.org>
date Sun, 13 Jul 1997 20:43:31 +0000
parents 17039a6e64cf
children 954e6be0a757
comparison
equal deleted inserted replaced
18765:a407fb58d35f 18766:ac2e7e21abb0
42 coding system. 42 coding system.
43 43
44 0. Emacs' internal format (emacs-mule) 44 0. Emacs' internal format (emacs-mule)
45 45
46 Emacs itself holds a multi-lingual character in a buffer and a string 46 Emacs itself holds a multi-lingual character in a buffer and a string
47 in a special format. Details are described in the section 2. 47 in a special format. Details are described in section 2.
48 48
49 1. ISO2022 49 1. ISO2022
50 50
51 The most famous coding system for multiple character sets. X's 51 The most famous coding system for multiple character sets. X's
52 Compound Text, various EUCs (Extended Unix Code), and such coding 52 Compound Text, various EUCs (Extended Unix Code), and coding
53 systems used in Internet communication as ISO-2022-JP are all 53 systems used in Internet communication such as ISO-2022-JP are
54 variants of ISO2022. Details are described in the section 3. 54 all variants of ISO2022. Details are described in section 3.
55 55
56 2. SJIS (or Shift-JIS or MS-Kanji-Code) 56 2. SJIS (or Shift-JIS or MS-Kanji-Code)
57 57
58 A coding system to encode character sets: ASCII, JISX0201, and 58 A coding system to encode character sets: ASCII, JISX0201, and
59 JISX0208. Widely used for PC's in Japan. Details are described in 59 JISX0208. Widely used for PC's in Japan. Details are described in
60 the section 4. 60 section 4.
61 61
62 3. BIG5 62 3. BIG5
63 63
64 A coding system to encode character sets: ASCII and Big5. Widely 64 A coding system to encode character sets: ASCII and Big5. Widely
65 used by Chinese (mainly in Taiwan and Hong Kong). Details are 65 used by Chinese (mainly in Taiwan and Hong Kong). Details are
66 described in the section 4. In this file, when written as "BIG5" 66 described in section 4. In this file, when we write "BIG5"
67 (all uppercase), it means the coding system, and when written as 67 (all uppercase), we mean the coding system, and when we write
68 "Big5" (capitalized), it means the character set. 68 "Big5" (capitalized), we mean the character set.
69 69
70 4. Else 70 4. Other
71 71
72 If a user want to read/write a text encoded in a coding system not 72 If a user wants to read/write a text encoded in a coding system not
73 listed above, he can supply a decoder and an encoder for it in CCL 73 listed above, he can supply a decoder and an encoder for it in CCL
74 (Code Conversion Language) programs. Emacs executes the CCL program 74 (Code Conversion Language) programs. Emacs executes the CCL program
75 while reading/writing. 75 while reading/writing.
76 76
77 Emacs represent a coding-system by a Lisp symbol that has a property 77 Emacs represents a coding-system by a Lisp symbol that has a property
78 `coding-system'. But, before actually using the coding-system, the 78 `coding-system'. But, before actually using the coding-system, the
79 information about it is set in a structure of type `struct 79 information about it is set in a structure of type `struct
80 coding_system' for rapid processing. See the section 6 for more 80 coding_system' for rapid processing. See section 6 for more details.
81 detail.
82 81
83 */ 82 */
84 83
85 /*** GENERAL NOTES on END-OF-LINE FORMAT *** 84 /*** GENERAL NOTES on END-OF-LINE FORMAT ***
86 85
87 How end-of-line of a text is encoded depends on a system. For 86 How end-of-line of a text is encoded depends on a system. For
88 instance, Unix's format is just one byte of `line-feed' code, 87 instance, Unix's format is just one byte of `line-feed' code,
89 whereas DOS's format is two bytes sequence of `carriage-return' and 88 whereas DOS's format is two-byte sequence of `carriage-return' and
90 `line-feed' codes. MacOS's format is one byte of `carriage-return'. 89 `line-feed' codes. MacOS's format is one byte of `carriage-return'.
91 90
92 Since how characters in a text is encoded and how end-of-line is 91 Since text characters encoding and end-of-line encoding are
93 encoded is independent, any coding system described above can take 92 independent, any coding system described above can take
94 any format of end-of-line. So, Emacs has information of format of 93 any format of end-of-line. So, Emacs has information of format of
95 end-of-line in each coding-system. See the section 6 for more 94 end-of-line in each coding-system. See section 6 for more details.
96 detail.
97 95
98 */ 96 */
99 97
100 /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** 98 /*** GENERAL NOTES on `detect_coding_XXX ()' functions ***
101 99
115 113
116 /*** GENERAL NOTES on `decode_coding_XXX ()' functions *** 114 /*** GENERAL NOTES on `decode_coding_XXX ()' functions ***
117 115
118 These functions decode SRC_BYTES length text at SOURCE encoded in 116 These functions decode SRC_BYTES length text at SOURCE encoded in
119 CODING to Emacs' internal format (emacs-mule). The resulting text 117 CODING to Emacs' internal format (emacs-mule). The resulting text
120 goes to a place pointed by DESTINATION, the length of which should 118 goes to a place pointed to by DESTINATION, the length of which should
121 not exceed DST_BYTES. The bytes actually processed is returned as 119 not exceed DST_BYTES. The number of bytes actually processed is
122 *CONSUMED. The return value is the length of the decoded text. 120 returned as *CONSUMED. The return value is the length of the decoded
123 Below is a template of these functions. */ 121 text. Below is a template of these functions. */
124 #if 0 122 #if 0
125 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) 123 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed)
126 struct coding_system *coding; 124 struct coding_system *coding;
127 unsigned char *source, *destination; 125 unsigned char *source, *destination;
128 int src_bytes, dst_bytes; 126 int src_bytes, dst_bytes;
134 132
135 /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** 133 /*** GENERAL NOTES on `encode_coding_XXX ()' functions ***
136 134
137 These functions encode SRC_BYTES length text at SOURCE of Emacs' 135 These functions encode SRC_BYTES length text at SOURCE of Emacs'
138 internal format (emacs-mule) to CODING. The resulting text goes to 136 internal format (emacs-mule) to CODING. The resulting text goes to
139 a place pointed by DESTINATION, the length of which should not 137 a place pointed to by DESTINATION, the length of which should not
140 exceed DST_BYTES. The bytes actually processed is returned as 138 exceed DST_BYTES. The number of bytes actually processed is
141 *CONSUMED. The return value is the length of the encoded text. 139 returned as *CONSUMED. The return value is the length of the
142 Below is a template of these functions. */ 140 encoded text. Below is a template of these functions. */
143 #if 0 141 #if 0
144 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) 142 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed)
145 struct coding_system *coding; 143 struct coding_system *coding;
146 unsigned char *source, *destination; 144 unsigned char *source, *destination;
147 int src_bytes, dst_bytes; 145 int src_bytes, dst_bytes;
198 *dst++ = 0xA0, *dst++ = (c) | 0x80; \ 196 *dst++ = 0xA0, *dst++ = (c) | 0x80; \
199 else \ 197 else \
200 *dst++ = (c); \ 198 *dst++ = (c); \
201 } while (0) 199 } while (0)
202 200
203 /* Decode one DIMENSION1 character of which charset is CHARSET and 201 /* Decode one DIMENSION1 character whose charset is CHARSET and whose
204 position-code is C. */ 202 position-code is C. */
205 203
206 #define DECODE_CHARACTER_DIMENSION1(charset, c) \ 204 #define DECODE_CHARACTER_DIMENSION1(charset, c) \
207 do { \ 205 do { \
208 unsigned char leading_code = CHARSET_LEADING_CODE_BASE (charset); \ 206 unsigned char leading_code = CHARSET_LEADING_CODE_BASE (charset); \
213 if (leading_code = CHARSET_LEADING_CODE_EXT (charset)) \ 211 if (leading_code = CHARSET_LEADING_CODE_EXT (charset)) \
214 *dst++ = leading_code; \ 212 *dst++ = leading_code; \
215 *dst++ = (c) | 0x80; \ 213 *dst++ = (c) | 0x80; \
216 } while (0) 214 } while (0)
217 215
218 /* Decode one DIMENSION2 character of which charset is CHARSET and 216 /* Decode one DIMENSION2 character whose charset is CHARSET and whose
219 position-codes are C1 and C2. */ 217 position-codes are C1 and C2. */
220 218
221 #define DECODE_CHARACTER_DIMENSION2(charset, c1, c2) \ 219 #define DECODE_CHARACTER_DIMENSION2(charset, c1, c2) \
222 do { \ 220 do { \
223 DECODE_CHARACTER_DIMENSION1 (charset, c1); \ 221 DECODE_CHARACTER_DIMENSION1 (charset, c1); \
335 333
336 334
337 /*** 2. Emacs internal format (emacs-mule) handlers ***/ 335 /*** 2. Emacs internal format (emacs-mule) handlers ***/
338 336
339 /* Emacs' internal format for encoding multiple character sets is a 337 /* Emacs' internal format for encoding multiple character sets is a
340 kind of multi-byte encoding, i.e. encoding a character by a sequence 338 kind of multi-byte encoding, i.e. characters are encoded by
341 of one-byte codes of variable length. ASCII characters and control 339 variable-length sequences of one-byte codes. ASCII characters
342 characters (e.g. `tab', `newline') are represented by one-byte as 340 and control characters (e.g. `tab', `newline') are represented by
343 is. It takes the range 0x00 through 0x7F. The other characters 341 one-byte sequences which are their ASCII codes, in the range 0x00
344 are represented by a sequence of `base leading-code', optional 342 through 0x7F. The other characters are represented by a sequence
345 `extended leading-code', and one or two `position-code's. Length 343 of `base leading-code', optional `extended leading-code', and one
346 of the sequence is decided by the base leading-code. Leading-code 344 or two `position-code's. The length of the sequence is determined
347 takes the range 0x80 through 0x9F, whereas extended leading-code 345 by the base leading-code. Leading-code takes the range 0x80
348 and position-code take the range 0xA0 through 0xFF. See the 346 through 0x9F, whereas extended leading-code and position-code take
349 document of `charset.h' for more detail about leading-code and 347 the range 0xA0 through 0xFF. See `charset.h' for more details
350 position-code. 348 about leading-code and position-code.
351 349
352 There's one exception in this rule. Special leading-code 350 There's one exception to this rule. Special leading-code
353 `leading-code-composition' denotes that the following several 351 `leading-code-composition' denotes that the following several
354 characters should be composed into one character. Leading-codes of 352 characters should be composed into one character. Leading-codes of
355 components (except for ASCII) are added 0x20. An ASCII character 353 components (except for ASCII) are added 0x20. An ASCII character
356 component is represented by a 2-byte sequence of `0xA0' and 354 component is represented by a 2-byte sequence of `0xA0' and
357 `ASCII-code + 0x80'. See also the document in `charset.h' for the 355 `ASCII-code + 0x80'. See also the comments in `charset.h' for the
358 detail of composite character. Hence, we can summarize the code 356 details of composite character. Hence, we can summarize the code
359 range as follows: 357 range as follows:
360 358
361 --- CODE RANGE of Emacs' internal format --- 359 --- CODE RANGE of Emacs' internal format ---
362 (character set) (range) 360 (character set) (range)
363 ASCII 0x00 .. 0x7F 361 ASCII 0x00 .. 0x7F
445 443
446 444
447 /*** 3. ISO2022 handlers ***/ 445 /*** 3. ISO2022 handlers ***/
448 446
449 /* The following note describes the coding system ISO2022 briefly. 447 /* The following note describes the coding system ISO2022 briefly.
450 Since the intension of this note is to help understanding of the 448 Since the intention of this note is to help in understanding of
451 programs in this file, some parts are NOT ACCURATE or OVERLY 449 the programs in this file, some parts are NOT ACCURATE or OVERLY
452 SIMPLIFIED. For the thorough understanding, please refer to the 450 SIMPLIFIED. For the thorough understanding, please refer to the
453 original document of ISO2022. 451 original document of ISO2022.
454 452
455 ISO2022 provides many mechanisms to encode several character sets 453 ISO2022 provides many mechanisms to encode several character sets
456 in 7-bit and 8-bit environment. If one choose 7-bite environment, 454 in 7-bit and 8-bit environment. If one chooses 7-bite environment,
457 all text is encoded by codes of less than 128. This may make the 455 all text is encoded by codes of less than 128. This may make the
458 encoded text a little bit longer, but the text get more stability 456 encoded text a little bit longer, but the text gets more stability
459 to pass through several gateways (some of them split MSB off). 457 to pass through several gateways (some of them strip off the MSB).
460 458
461 There are two kind of character set: control character set and 459 There are two kinds of character set: control character set and
462 graphic character set. The former contains control characters such 460 graphic character set. The former contains control characters such
463 as `newline' and `escape' to provide control functions (control 461 as `newline' and `escape' to provide control functions (control
464 functions are provided also by escape sequence). The latter 462 functions are provided also by escape sequences). The latter
465 contains graphic characters such as ' A' and '-'. Emacs recognizes 463 contains graphic characters such as ' A' and '-'. Emacs recognizes
466 two control character sets and many graphic character sets. 464 two control character sets and many graphic character sets.
467 465
468 Graphic character sets are classified into one of the following 466 Graphic character sets are classified into one of the following
469 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, 467 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96,
563 561
564 Note (**): If <F> is '@', 'A', or 'B', the intermediate character 562 Note (**): If <F> is '@', 'A', or 'B', the intermediate character
565 '(' can be omitted. We call this as "short-form" here after. 563 '(' can be omitted. We call this as "short-form" here after.
566 564
567 Now you may notice that there are a lot of ways for encoding the 565 Now you may notice that there are a lot of ways for encoding the
568 same multilingual text in ISO2022. Actually, there exist many 566 same multilingual text in ISO2022. Actually, there exists many
569 coding systems such as Compound Text (used in X's inter client 567 coding systems such as Compound Text (used in X's inter client
570 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR 568 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR
571 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian 569 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian
572 localized platforms), and all of these are variants of ISO2022. 570 localized platforms), and all of these are variants of ISO2022.
573 571
1016 } 1014 }
1017 *consumed = src - source; 1015 *consumed = src - source;
1018 return dst - destination; 1016 return dst - destination;
1019 } 1017 }
1020 1018
1021 /* ISO2022 encoding staffs. */ 1019 /* ISO2022 encoding stuff. */
1022 1020
1023 /* 1021 /*
1024 It is not enough to say just "ISO2022" on encoding, but we have to 1022 It is not enough to say just "ISO2022" on encoding, we have to
1025 specify more details. In Emacs, each coding-system of ISO2022 1023 specify more details. In Emacs, each coding-system of ISO2022
1026 variant has the following specifications: 1024 variant has the following specifications:
1027 1. Initial designation to G0 thru G3. 1025 1. Initial designation to G0 thru G3.
1028 2. Allows short-form designation? 1026 2. Allows short-form designation?
1029 3. ASCII should be designated to G0 before control characters? 1027 3. ASCII should be designated to G0 before control characters?
1034 And the following two are only for Japanese: 1032 And the following two are only for Japanese:
1035 8. Use ASCII in place of JIS0201-1976-Roman? 1033 8. Use ASCII in place of JIS0201-1976-Roman?
1036 9. Use JISX0208-1983 in place of JISX0208-1978? 1034 9. Use JISX0208-1983 in place of JISX0208-1978?
1037 These specifications are encoded in `coding->flags' as flag bits 1035 These specifications are encoded in `coding->flags' as flag bits
1038 defined by macros CODING_FLAG_ISO_XXX. See `coding.h' for more 1036 defined by macros CODING_FLAG_ISO_XXX. See `coding.h' for more
1039 detail. 1037 details.
1040 */ 1038 */
1041 1039
1042 /* Produce codes (escape sequence) for designating CHARSET to graphic 1040 /* Produce codes (escape sequence) for designating CHARSET to graphic
1043 register REG. If <final-char> of CHARSET is '@', 'A', or 'B' and 1041 register REG. If <final-char> of CHARSET is '@', 'A', or 'B' and
1044 the coding system CODING allows, produce designation sequence of 1042 the coding system CODING allows, produce designation sequence of
1130 do { \ 1128 do { \
1131 *dst++ = ISO_CODE_ESC, *dst++ = 'o'; \ 1129 *dst++ = ISO_CODE_ESC, *dst++ = 'o'; \
1132 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3; \ 1130 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3; \
1133 } while (0) 1131 } while (0)
1134 1132
1135 /* Produce codes for a DIMENSION1 character of which character set is 1133 /* Produce codes for a DIMENSION1 character whose character set is
1136 CHARSET and position-code is C1. Designation and invocation 1134 CHARSET and whose position-code is C1. Designation and invocation
1137 sequences are also produced in advance if necessary. */ 1135 sequences are also produced in advance if necessary. */
1138 1136
1139 1137
1140 #define ENCODE_ISO_CHARACTER_DIMENSION1(charset, c1) \ 1138 #define ENCODE_ISO_CHARACTER_DIMENSION1(charset, c1) \
1141 do { \ 1139 do { \
1164 register. Then repeat the loop to actually produce the \ 1162 register. Then repeat the loop to actually produce the \
1165 character. */ \ 1163 character. */ \
1166 dst = encode_invocation_designation (charset, coding, dst); \ 1164 dst = encode_invocation_designation (charset, coding, dst); \
1167 } while (1) 1165 } while (1)
1168 1166
1169 /* Produce codes for a DIMENSION2 character of which character set is 1167 /* Produce codes for a DIMENSION2 character whose character set is
1170 CHARSET and position-codes are C1 and C2. Designation and 1168 CHARSET and whose position-codes are C1 and C2. Designation and
1171 invocation codes are also produced in advance if necessary. */ 1169 invocation codes are also produced in advance if necessary. */
1172 1170
1173 #define ENCODE_ISO_CHARACTER_DIMENSION2(charset, c1, c2) \ 1171 #define ENCODE_ISO_CHARACTER_DIMENSION2(charset, c1, c2) \
1174 do { \ 1172 do { \
1175 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \ 1173 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \
1550 } 1548 }
1551 1549
1552 1550
1553 /*** 4. SJIS and BIG5 handlers ***/ 1551 /*** 4. SJIS and BIG5 handlers ***/
1554 1552
1555 /* Although SJIS and BIG5 are not ISO's coding system, They are used 1553 /* Although SJIS and BIG5 are not ISO's coding system, they are used
1556 quite widely. So, for the moment, Emacs supports them in the bare 1554 quite widely. So, for the moment, Emacs supports them in the bare
1557 C code. But, in the future, they may be supported only by CCL. */ 1555 C code. But, in the future, they may be supported only by CCL. */
1558 1556
1559 /* SJIS is a coding system encoding three character sets: ASCII, right 1557 /* SJIS is a coding system encoding three character sets: ASCII, right
1560 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded 1558 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded
2165 Lisp_Object coding_system; 2163 Lisp_Object coding_system;
2166 struct coding_system *coding; 2164 struct coding_system *coding;
2167 { 2165 {
2168 Lisp_Object type, eol_type; 2166 Lisp_Object type, eol_type;
2169 2167
2170 /* At first, set several fields default values. */ 2168 /* At first, set several fields to default values. */
2171 coding->require_flushing = 0; 2169 coding->require_flushing = 0;
2172 coding->last_block = 0; 2170 coding->last_block = 0;
2173 coding->selective = 0; 2171 coding->selective = 0;
2174 coding->composing = 0; 2172 coding->composing = 0;
2175 coding->direction = 0; 2173 coding->direction = 0;