comparison src/coding.c @ 88771:75c78754826d

comments
author Dave Love <fx@gnu.org>
date Sun, 16 Jun 2002 19:57:54 +0000
parents 7f284ac55b07
children 64b8f6168269
comparison
equal deleted inserted replaced
88770:7df1e731d256 88771:75c78754826d
92 section 8. 92 section 8.
93 93
94 o BIG5 94 o BIG5
95 95
96 A coding system to encode character sets: ASCII and Big5. Widely 96 A coding system to encode character sets: ASCII and Big5. Widely
97 used by Chinese (mainly in Taiwan and Hong Kong). Details are 97 used for Chinese (mainly in Taiwan and Hong Kong). Details are
98 described in section 8. In this file, when we write "big5" (all 98 described in section 8. In this file, when we write "big5" (all
99 lowercase), we mean the coding system, and when we write "Big5" 99 lowercase), we mean the coding system, and when we write "Big5"
100 (capitalized), we mean the character set. 100 (capitalized), we mean the character set.
101 101
102 o CCL 102 o CCL
106 CCL (Code Conversion Language) programs. Emacs executes the CCL 106 CCL (Code Conversion Language) programs. Emacs executes the CCL
107 program while decoding/encoding. 107 program while decoding/encoding.
108 108
109 o Raw-text 109 o Raw-text
110 110
111 A coding system for a text containing raw eight-bit data. Emacs 111 A coding system for text containing raw eight-bit data. Emacs
112 treats each byte of source text as a character (except for 112 treats each byte of source text as a character (except for
113 end-of-line conversion). 113 end-of-line conversion).
114 114
115 o No-conversion 115 o No-conversion
116 116
585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder) 585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder)
586 #define CODING_CCL_VALIDS(coding) \ 586 #define CODING_CCL_VALIDS(coding) \
587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ 587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \
588 ->data) 588 ->data)
589 589
590 /* Index for each coding category in `coding_category_table' */ 590 /* Index for each coding category in `coding_categories' */
591 591
592 enum coding_category 592 enum coding_category
593 { 593 {
594 coding_category_iso_7, 594 coding_category_iso_7,
595 coding_category_iso_7_tight, 595 coding_category_iso_7_tight,
2047 2047
2048 /*** 7. ISO2022 handlers ***/ 2048 /*** 7. ISO2022 handlers ***/
2049 2049
2050 /* The following note describes the coding system ISO2022 briefly. 2050 /* The following note describes the coding system ISO2022 briefly.
2051 Since the intention of this note is to help understand the 2051 Since the intention of this note is to help understand the
2052 functions in this file, some parts are NOT ACCURATE or OVERLY 2052 functions in this file, some parts are NOT ACCURATE or are OVERLY
2053 SIMPLIFIED. For thorough understanding, please refer to the 2053 SIMPLIFIED. For thorough understanding, please refer to the
2054 original document of ISO2022. 2054 original document of ISO2022. This is equivalent to the standard
2055 ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*).
2055 2056
2056 ISO2022 provides many mechanisms to encode several character sets 2057 ISO2022 provides many mechanisms to encode several character sets
2057 in 7-bit and 8-bit environments. For 7-bite environments, all text 2058 in 7-bit and 8-bit environments. For 7-bit environments, all text
2058 is encoded using bytes less than 128. This may make the encoded 2059 is encoded using bytes less than 128. This may make the encoded
2059 text a little bit longer, but the text passes more easily through 2060 text a little bit longer, but the text passes more easily through
2060 several gateways, some of which strip off MSB (Most Signigant Bit). 2061 several types of gateway, some of which strip off the MSB (Most
2061 2062 Significant Bit).
2062 There are two kinds of character sets: control character set and 2063
2063 graphic character set. The former contains control characters such 2064 There are two kinds of character sets: control character sets and
2065 graphic character sets. The former contain control characters such
2064 as `newline' and `escape' to provide control functions (control 2066 as `newline' and `escape' to provide control functions (control
2065 functions are also provided by escape sequences). The latter 2067 functions are also provided by escape sequences). The latter
2066 contains graphic characters such as 'A' and '-'. Emacs recognizes 2068 contain graphic characters such as 'A' and '-'. Emacs recognizes
2067 two control character sets and many graphic character sets. 2069 two control character sets and many graphic character sets.
2068 2070
2069 Graphic character sets are classified into one of the following 2071 Graphic character sets are classified into one of the following
2070 four classes, according to the number of bytes (DIMENSION) and 2072 four classes, according to the number of bytes (DIMENSION) and
2071 number of characters in one dimension (CHARS) of the set: 2073 number of characters in one dimension (CHARS) of the set:
2073 - DIMENSION1_CHARS96 2075 - DIMENSION1_CHARS96
2074 - DIMENSION2_CHARS94 2076 - DIMENSION2_CHARS94
2075 - DIMENSION2_CHARS96 2077 - DIMENSION2_CHARS96
2076 2078
2077 In addition, each character set is assigned an identification tag, 2079 In addition, each character set is assigned an identification tag,
2078 unique for each set, called "final character" (denoted as <F> 2080 unique for each set, called the "final character" (denoted as <F>
2079 hereafter). The <F> of each character set is decided by ECMA(*) 2081 hereafter). The <F> of each character set is decided by ECMA(*)
2080 when it is registered in ISO. The code range of <F> is 0x30..0x7F 2082 when it is registered in ISO. The code range of <F> is 0x30..0x7F
2081 (0x30..0x3F are for private use only). 2083 (0x30..0x3F are for private use only).
2082 2084
2083 Note (*): ECMA = European Computer Manufacturers Association 2085 Note (*): ECMA = European Computer Manufacturers Association
2084 2086
2085 Here are examples of graphic character set [NAME(<F>)]: 2087 Here are examples of graphic character sets [NAME(<F>)]:
2086 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... 2088 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ...
2087 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... 2089 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ...
2088 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... 2090 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
2089 o DIMENSION2_CHARS96 -- none for the moment 2091 o DIMENSION2_CHARS96 -- none for the moment
2090 2092
2173 7-bit environment, non-locking-shift, and non-single-shift. 2175 7-bit environment, non-locking-shift, and non-single-shift.
2174 2176
2175 Note (**): If <F> is '@', 'A', or 'B', the intermediate character 2177 Note (**): If <F> is '@', 'A', or 'B', the intermediate character
2176 '(' must be omitted. We refer to this as "short-form" hereafter. 2178 '(' must be omitted. We refer to this as "short-form" hereafter.
2177 2179
2178 Now you may notice that there are a lot of ways for encoding the 2180 Now you may notice that there are a lot of ways of encoding the
2179 same multilingual text in ISO2022. Actually, there exist many 2181 same multilingual text in ISO2022. Actually, there exist many
2180 coding systems such as Compound Text (used in X11's inter client 2182 coding systems such as Compound Text (used in X11's inter client
2181 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR 2183 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR
2182 (used in Korean internet), EUC (Extended UNIX Code, used in Asian 2184 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian
2183 localized platforms), and all of these are variants of ISO2022. 2185 localized platforms), and all of these are variants of ISO2022.
2184 2186
2185 In addition to the above, Emacs handles two more kinds of escape 2187 In addition to the above, Emacs handles two more kinds of escape
2186 sequences: ISO6429's direction specification and Emacs' private 2188 sequences: ISO6429's direction specification and Emacs' private
2187 sequence for specifying character composition. 2189 sequence for specifying character composition.
2199 o ESC '1' -- end composition 2201 o ESC '1' -- end composition
2200 o ESC '2' -- start rule-base composition (*) 2202 o ESC '2' -- start rule-base composition (*)
2201 o ESC '3' -- start relative composition with alternate chars (**) 2203 o ESC '3' -- start relative composition with alternate chars (**)
2202 o ESC '4' -- start rule-base composition with alternate chars (**) 2204 o ESC '4' -- start rule-base composition with alternate chars (**)
2203 Since these are not standard escape sequences of any ISO standard, 2205 Since these are not standard escape sequences of any ISO standard,
2204 the use of them for these meaning is restricted to Emacs only. 2206 the use of them with these meanings is restricted to Emacs only.
2205 2207
2206 (*) This form is used only in Emacs 20.5 and the older versions, 2208 (*) This form is used only in Emacs 20.7 and older versions,
2207 but the newer versions can safely decode it. 2209 but newer versions can safely decode it.
2208 (**) This form is used only in Emacs 21.1 and the newer versions, 2210 (**) This form is used only in Emacs 21.1 and newer versions,
2209 and the older versions can't decode it. 2211 and older versions can't decode it.
2210 2212
2211 Here's a list of examples usages of these composition escape 2213 Here's a list of example usages of these composition escape
2212 sequences (categorized by `enum composition_method'). 2214 sequences (categorized by `enum composition_method').
2213 2215
2214 COMPOSITION_RELATIVE: 2216 COMPOSITION_RELATIVE:
2215 ESC 0 CHAR [ CHAR ] ESC 1 2217 ESC 0 CHAR [ CHAR ] ESC 1
2216 COMPOSITOIN_WITH_RULE: 2218 COMPOSITION_WITH_RULE:
2217 ESC 2 CHAR [ RULE CHAR ] ESC 1 2219 ESC 2 CHAR [ RULE CHAR ] ESC 1
2218 COMPOSITION_WITH_ALTCHARS: 2220 COMPOSITION_WITH_ALTCHARS:
2219 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 2221 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1
2220 COMPOSITION_WITH_RULE_ALTCHARS: 2222 COMPOSITION_WITH_RULE_ALTCHARS:
2221 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */ 2223 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */
4533 } 4535 }
4534 4536
4535 4537
4536 /*** 7. C library functions ***/ 4538 /*** 7. C library functions ***/
4537 4539
4538 /* In Emacs Lisp, coding system is represented by a Lisp symbol which
4539 has a property `coding-system'. The value of this property is a
4540 vector of length 5 (called as coding-vector). Among elements of
4541 this vector, the first (element[0]) and the fifth (element[4])
4542 carry important information for decoding/encoding. Before
4543 decoding/encoding, this information should be set in fields of a
4544 structure of type `coding_system'.
4545
4546 A value of property `coding-system' can be a symbol of another
4547 subsidiary coding-system. In that case, Emacs gets coding-vector
4548 from that symbol.
4549
4550 `element[0]' contains information to be set in `coding->type'. The
4551 value and its meaning is as follows:
4552
4553 0 -- coding_type_emacs_mule
4554 1 -- coding_type_sjis
4555 2 -- coding_type_iso_2022
4556 3 -- coding_type_big5
4557 4 -- coding_type_ccl encoder/decoder written in CCL
4558 nil -- coding_type_no_conversion
4559 t -- coding_type_undecided (automatic conversion on decoding,
4560 no-conversion on encoding)
4561
4562 `element[4]' contains information to be set in `coding->flags' and
4563 `coding->spec'. The meaning varies by `coding->type'.
4564
4565 If `coding->type' is `coding_type_iso_2022', element[4] is a vector
4566 of length 32 (of which the first 13 sub-elements are used now).
4567 Meanings of these sub-elements are:
4568
4569 sub-element[N] where N is 0 through 3: to be set in `coding->spec.iso_2022'
4570 If the value is an integer of valid charset, the charset is
4571 assumed to be designated to graphic register N initially.
4572
4573 If the value is minus, it is a minus value of charset which
4574 reserves graphic register N, which means that the charset is
4575 not designated initially but should be designated to graphic
4576 register N just before encoding a character in that charset.
4577
4578 If the value is nil, graphic register N is never used on
4579 encoding.
4580
4581 sub-element[N] where N is 4 through 11: to be set in `coding->flags'
4582 Each value takes t or nil. See the section ISO2022 of
4583 `coding.h' for more information.
4584
4585 If `coding->type' is `coding_type_big5', element[4] is t to denote
4586 BIG5-ETen or nil to denote BIG5-HKU.
4587
4588 If `coding->type' takes the other value, element[4] is ignored.
4589
4590 Emacs Lisp's coding system also carries information about format of
4591 end-of-line in a value of property `eol-type'. If the value is
4592 integer, 0 means eol_lf, 1 means eol_crlf, and 2 means eol_cr. If
4593 it is not integer, it should be a vector of subsidiary coding
4594 systems of which property `eol-type' has one of above values.
4595
4596 */
4597
4598 /* Setup coding context CODING from information about CODING_SYSTEM. 4540 /* Setup coding context CODING from information about CODING_SYSTEM.
4599 If CODING_SYSTEM is nil, `no-conversion' is assumed. If 4541 If CODING_SYSTEM is nil, `no-conversion' is assumed. If
4600 CODING_SYSTEM is invalid, signal an error. */ 4542 CODING_SYSTEM is invalid, signal an error. */
4601 4543
4602 void 4544 void