Mercurial > emacs
comparison src/coding.c @ 88771:75c78754826d
comments
author | Dave Love <fx@gnu.org> |
---|---|
date | Sun, 16 Jun 2002 19:57:54 +0000 |
parents | 7f284ac55b07 |
children | 64b8f6168269 |
comparison
equal
deleted
inserted
replaced
88770:7df1e731d256 | 88771:75c78754826d |
---|---|
92 section 8. | 92 section 8. |
93 | 93 |
94 o BIG5 | 94 o BIG5 |
95 | 95 |
96 A coding system to encode character sets: ASCII and Big5. Widely | 96 A coding system to encode character sets: ASCII and Big5. Widely |
97 used by Chinese (mainly in Taiwan and Hong Kong). Details are | 97 used for Chinese (mainly in Taiwan and Hong Kong). Details are |
98 described in section 8. In this file, when we write "big5" (all | 98 described in section 8. In this file, when we write "big5" (all |
99 lowercase), we mean the coding system, and when we write "Big5" | 99 lowercase), we mean the coding system, and when we write "Big5" |
100 (capitalized), we mean the character set. | 100 (capitalized), we mean the character set. |
101 | 101 |
102 o CCL | 102 o CCL |
106 CCL (Code Conversion Language) programs. Emacs executes the CCL | 106 CCL (Code Conversion Language) programs. Emacs executes the CCL |
107 program while decoding/encoding. | 107 program while decoding/encoding. |
108 | 108 |
109 o Raw-text | 109 o Raw-text |
110 | 110 |
111 A coding system for a text containing raw eight-bit data. Emacs | 111 A coding system for text containing raw eight-bit data. Emacs |
112 treats each byte of source text as a character (except for | 112 treats each byte of source text as a character (except for |
113 end-of-line conversion). | 113 end-of-line conversion). |
114 | 114 |
115 o No-conversion | 115 o No-conversion |
116 | 116 |
585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder) | 585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder) |
586 #define CODING_CCL_VALIDS(coding) \ | 586 #define CODING_CCL_VALIDS(coding) \ |
587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ | 587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ |
588 ->data) | 588 ->data) |
589 | 589 |
590 /* Index for each coding category in `coding_category_table' */ | 590 /* Index for each coding category in `coding_categories' */ |
591 | 591 |
592 enum coding_category | 592 enum coding_category |
593 { | 593 { |
594 coding_category_iso_7, | 594 coding_category_iso_7, |
595 coding_category_iso_7_tight, | 595 coding_category_iso_7_tight, |
2047 | 2047 |
2048 /*** 7. ISO2022 handlers ***/ | 2048 /*** 7. ISO2022 handlers ***/ |
2049 | 2049 |
2050 /* The following note describes the coding system ISO2022 briefly. | 2050 /* The following note describes the coding system ISO2022 briefly. |
2051 Since the intention of this note is to help understand the | 2051 Since the intention of this note is to help understand the |
2052 functions in this file, some parts are NOT ACCURATE or OVERLY | 2052 functions in this file, some parts are NOT ACCURATE or are OVERLY |
2053 SIMPLIFIED. For thorough understanding, please refer to the | 2053 SIMPLIFIED. For thorough understanding, please refer to the |
2054 original document of ISO2022. | 2054 original document of ISO2022. This is equivalent to the standard |
2055 ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*). | |
2055 | 2056 |
2056 ISO2022 provides many mechanisms to encode several character sets | 2057 ISO2022 provides many mechanisms to encode several character sets |
2057 in 7-bit and 8-bit environments. For 7-bite environments, all text | 2058 in 7-bit and 8-bit environments. For 7-bit environments, all text |
2058 is encoded using bytes less than 128. This may make the encoded | 2059 is encoded using bytes less than 128. This may make the encoded |
2059 text a little bit longer, but the text passes more easily through | 2060 text a little bit longer, but the text passes more easily through |
2060 several gateways, some of which strip off MSB (Most Signigant Bit). | 2061 several types of gateway, some of which strip off the MSB (Most |
2061 | 2062 Significant Bit). |
2062 There are two kinds of character sets: control character set and | 2063 |
2063 graphic character set. The former contains control characters such | 2064 There are two kinds of character sets: control character sets and |
2065 graphic character sets. The former contain control characters such | |
2064 as `newline' and `escape' to provide control functions (control | 2066 as `newline' and `escape' to provide control functions (control |
2065 functions are also provided by escape sequences). The latter | 2067 functions are also provided by escape sequences). The latter |
2066 contains graphic characters such as 'A' and '-'. Emacs recognizes | 2068 contain graphic characters such as 'A' and '-'. Emacs recognizes |
2067 two control character sets and many graphic character sets. | 2069 two control character sets and many graphic character sets. |
2068 | 2070 |
2069 Graphic character sets are classified into one of the following | 2071 Graphic character sets are classified into one of the following |
2070 four classes, according to the number of bytes (DIMENSION) and | 2072 four classes, according to the number of bytes (DIMENSION) and |
2071 number of characters in one dimension (CHARS) of the set: | 2073 number of characters in one dimension (CHARS) of the set: |
2073 - DIMENSION1_CHARS96 | 2075 - DIMENSION1_CHARS96 |
2074 - DIMENSION2_CHARS94 | 2076 - DIMENSION2_CHARS94 |
2075 - DIMENSION2_CHARS96 | 2077 - DIMENSION2_CHARS96 |
2076 | 2078 |
2077 In addition, each character set is assigned an identification tag, | 2079 In addition, each character set is assigned an identification tag, |
2078 unique for each set, called "final character" (denoted as <F> | 2080 unique for each set, called the "final character" (denoted as <F> |
2079 hereafter). The <F> of each character set is decided by ECMA(*) | 2081 hereafter). The <F> of each character set is decided by ECMA(*) |
2080 when it is registered in ISO. The code range of <F> is 0x30..0x7F | 2082 when it is registered in ISO. The code range of <F> is 0x30..0x7F |
2081 (0x30..0x3F are for private use only). | 2083 (0x30..0x3F are for private use only). |
2082 | 2084 |
2083 Note (*): ECMA = European Computer Manufacturers Association | 2085 Note (*): ECMA = European Computer Manufacturers Association |
2084 | 2086 |
2085 Here are examples of graphic character set [NAME(<F>)]: | 2087 Here are examples of graphic character sets [NAME(<F>)]: |
2086 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | 2088 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... |
2087 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | 2089 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... |
2088 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 2090 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
2089 o DIMENSION2_CHARS96 -- none for the moment | 2091 o DIMENSION2_CHARS96 -- none for the moment |
2090 | 2092 |
2173 7-bit environment, non-locking-shift, and non-single-shift. | 2175 7-bit environment, non-locking-shift, and non-single-shift. |
2174 | 2176 |
2175 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 2177 Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
2176 '(' must be omitted. We refer to this as "short-form" hereafter. | 2178 '(' must be omitted. We refer to this as "short-form" hereafter. |
2177 | 2179 |
2178 Now you may notice that there are a lot of ways for encoding the | 2180 Now you may notice that there are a lot of ways of encoding the |
2179 same multilingual text in ISO2022. Actually, there exist many | 2181 same multilingual text in ISO2022. Actually, there exist many |
2180 coding systems such as Compound Text (used in X11's inter client | 2182 coding systems such as Compound Text (used in X11's inter client |
2181 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR | 2183 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR |
2182 (used in Korean internet), EUC (Extended UNIX Code, used in Asian | 2184 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian |
2183 localized platforms), and all of these are variants of ISO2022. | 2185 localized platforms), and all of these are variants of ISO2022. |
2184 | 2186 |
2185 In addition to the above, Emacs handles two more kinds of escape | 2187 In addition to the above, Emacs handles two more kinds of escape |
2186 sequences: ISO6429's direction specification and Emacs' private | 2188 sequences: ISO6429's direction specification and Emacs' private |
2187 sequence for specifying character composition. | 2189 sequence for specifying character composition. |
2199 o ESC '1' -- end composition | 2201 o ESC '1' -- end composition |
2200 o ESC '2' -- start rule-base composition (*) | 2202 o ESC '2' -- start rule-base composition (*) |
2201 o ESC '3' -- start relative composition with alternate chars (**) | 2203 o ESC '3' -- start relative composition with alternate chars (**) |
2202 o ESC '4' -- start rule-base composition with alternate chars (**) | 2204 o ESC '4' -- start rule-base composition with alternate chars (**) |
2203 Since these are not standard escape sequences of any ISO standard, | 2205 Since these are not standard escape sequences of any ISO standard, |
2204 the use of them for these meaning is restricted to Emacs only. | 2206 the use of them with these meanings is restricted to Emacs only. |
2205 | 2207 |
2206 (*) This form is used only in Emacs 20.5 and the older versions, | 2208 (*) This form is used only in Emacs 20.7 and older versions, |
2207 but the newer versions can safely decode it. | 2209 but newer versions can safely decode it. |
2208 (**) This form is used only in Emacs 21.1 and the newer versions, | 2210 (**) This form is used only in Emacs 21.1 and newer versions, |
2209 and the older versions can't decode it. | 2211 and older versions can't decode it. |
2210 | 2212 |
2211 Here's a list of examples usages of these composition escape | 2213 Here's a list of example usages of these composition escape |
2212 sequences (categorized by `enum composition_method'). | 2214 sequences (categorized by `enum composition_method'). |
2213 | 2215 |
2214 COMPOSITION_RELATIVE: | 2216 COMPOSITION_RELATIVE: |
2215 ESC 0 CHAR [ CHAR ] ESC 1 | 2217 ESC 0 CHAR [ CHAR ] ESC 1 |
2216 COMPOSITOIN_WITH_RULE: | 2218 COMPOSITION_WITH_RULE: |
2217 ESC 2 CHAR [ RULE CHAR ] ESC 1 | 2219 ESC 2 CHAR [ RULE CHAR ] ESC 1 |
2218 COMPOSITION_WITH_ALTCHARS: | 2220 COMPOSITION_WITH_ALTCHARS: |
2219 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 | 2221 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 |
2220 COMPOSITION_WITH_RULE_ALTCHARS: | 2222 COMPOSITION_WITH_RULE_ALTCHARS: |
2221 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */ | 2223 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */ |
4533 } | 4535 } |
4534 | 4536 |
4535 | 4537 |
4536 /*** 7. C library functions ***/ | 4538 /*** 7. C library functions ***/ |
4537 | 4539 |
4538 /* In Emacs Lisp, coding system is represented by a Lisp symbol which | |
4539 has a property `coding-system'. The value of this property is a | |
4540 vector of length 5 (called as coding-vector). Among elements of | |
4541 this vector, the first (element[0]) and the fifth (element[4]) | |
4542 carry important information for decoding/encoding. Before | |
4543 decoding/encoding, this information should be set in fields of a | |
4544 structure of type `coding_system'. | |
4545 | |
4546 A value of property `coding-system' can be a symbol of another | |
4547 subsidiary coding-system. In that case, Emacs gets coding-vector | |
4548 from that symbol. | |
4549 | |
4550 `element[0]' contains information to be set in `coding->type'. The | |
4551 value and its meaning is as follows: | |
4552 | |
4553 0 -- coding_type_emacs_mule | |
4554 1 -- coding_type_sjis | |
4555 2 -- coding_type_iso_2022 | |
4556 3 -- coding_type_big5 | |
4557 4 -- coding_type_ccl encoder/decoder written in CCL | |
4558 nil -- coding_type_no_conversion | |
4559 t -- coding_type_undecided (automatic conversion on decoding, | |
4560 no-conversion on encoding) | |
4561 | |
4562 `element[4]' contains information to be set in `coding->flags' and | |
4563 `coding->spec'. The meaning varies by `coding->type'. | |
4564 | |
4565 If `coding->type' is `coding_type_iso_2022', element[4] is a vector | |
4566 of length 32 (of which the first 13 sub-elements are used now). | |
4567 Meanings of these sub-elements are: | |
4568 | |
4569 sub-element[N] where N is 0 through 3: to be set in `coding->spec.iso_2022' | |
4570 If the value is an integer of valid charset, the charset is | |
4571 assumed to be designated to graphic register N initially. | |
4572 | |
4573 If the value is minus, it is a minus value of charset which | |
4574 reserves graphic register N, which means that the charset is | |
4575 not designated initially but should be designated to graphic | |
4576 register N just before encoding a character in that charset. | |
4577 | |
4578 If the value is nil, graphic register N is never used on | |
4579 encoding. | |
4580 | |
4581 sub-element[N] where N is 4 through 11: to be set in `coding->flags' | |
4582 Each value takes t or nil. See the section ISO2022 of | |
4583 `coding.h' for more information. | |
4584 | |
4585 If `coding->type' is `coding_type_big5', element[4] is t to denote | |
4586 BIG5-ETen or nil to denote BIG5-HKU. | |
4587 | |
4588 If `coding->type' takes the other value, element[4] is ignored. | |
4589 | |
4590 Emacs Lisp's coding system also carries information about format of | |
4591 end-of-line in a value of property `eol-type'. If the value is | |
4592 integer, 0 means eol_lf, 1 means eol_crlf, and 2 means eol_cr. If | |
4593 it is not integer, it should be a vector of subsidiary coding | |
4594 systems of which property `eol-type' has one of above values. | |
4595 | |
4596 */ | |
4597 | |
4598 /* Setup coding context CODING from information about CODING_SYSTEM. | 4540 /* Setup coding context CODING from information about CODING_SYSTEM. |
4599 If CODING_SYSTEM is nil, `no-conversion' is assumed. If | 4541 If CODING_SYSTEM is nil, `no-conversion' is assumed. If |
4600 CODING_SYSTEM is invalid, signal an error. */ | 4542 CODING_SYSTEM is invalid, signal an error. */ |
4601 | 4543 |
4602 void | 4544 void |