Mercurial > emacs
changeset 35053:e3e1ff3616fa
Commentary changes.
(detect_eol_type_in_2_octet_form): Declare arg big_endian_p.
author | Dave Love <fx@gnu.org> |
---|---|
date | Thu, 04 Jan 2001 17:35:26 +0000 |
parents | 07b5f5fdb0ce |
children | 3efcaab7433e |
files | src/coding.c |
diffstat | 1 files changed, 75 insertions(+), 71 deletions(-) [+] |
line wrap: on
line diff
--- a/src/coding.c Thu Jan 04 15:39:04 2001 +0000 +++ b/src/coding.c Thu Jan 04 17:35:26 2001 +0000 @@ -37,18 +37,18 @@ /*** 0. General comments ***/ -/*** GENERAL NOTE on CODING SYSTEM *** - - Coding system is an encoding mechanism of one or more character +/*** GENERAL NOTE on CODING SYSTEMS *** + + A coding system is an encoding mechanism for one or more character sets. Here's a list of coding systems which Emacs can handle. When we say "decode", it means converting some other coding system to - Emacs' internal format (emacs-internal), and when we say "encode", + Emacs' internal format (emacs-mule), and when we say "encode", it means converting the coding system emacs-mule to some other coding system. 0. Emacs' internal format (emacs-mule) - Emacs itself holds a multi-lingual character in a buffer and a string + Emacs itself holds a multi-lingual character in buffers and strings in a special format. Details are described in section 2. 1. ISO2022 @@ -66,21 +66,21 @@ 3. BIG5 - A coding system to encode character sets: ASCII and Big5. Widely - used by Chinese (mainly in Taiwan and Hong Kong). Details are + A coding system to encode the character sets ASCII and Big5. Widely + used for Chinese (mainly in Taiwan and Hong Kong). Details are described in section 4. In this file, when we write "BIG5" (all uppercase), we mean the coding system, and when we write "Big5" (capitalized), we mean the character set. 4. Raw text - A coding system for a text containing random 8-bit code. Emacs does - no code conversion on such a text except for end-of-line format. + A coding system for text containing random 8-bit code. Emacs does + no code conversion on such text except for end-of-line format. 5. Other - If a user wants to read/write a text encoded in a coding system not - listed above, he can supply a decoder and an encoder for it in CCL + If a user wants to read/write text encoded in a coding system not + listed above, he can supply a decoder and an encoder for it as CCL (Code Conversion Language) programs. Emacs executes the CCL program while reading/writing. @@ -93,16 +93,16 @@ /*** GENERAL NOTES on END-OF-LINE FORMAT *** - How end-of-line of a text is encoded depends on a system. For - instance, Unix's format is just one byte of `line-feed' code, + How end-of-line of text is encoded depends on the operating system. + For instance, Unix's format is just one byte of `line-feed' code, whereas DOS's format is two-byte sequence of `carriage-return' and `line-feed' codes. MacOS's format is usually one byte of `carriage-return'. - Since text characters encoding and end-of-line encoding are - independent, any coding system described above can take - any format of end-of-line. So, Emacs has information of format of - end-of-line in each coding-system. See section 6 for more details. + Since text character encoding and end-of-line encoding are + independent, any coding system described above can have any + end-of-line format. So Emacs has information about end-of-line + format in each coding-system. See section 6 for more details. */ @@ -110,9 +110,9 @@ These functions check if a text between SRC and SRC_END is encoded in the coding system category XXX. Each returns an integer value in - which appropriate flag bits for the category XXX is set. The flag + which appropriate flag bits for the category XXX are set. The flag bits are defined in macros CODING_CATEGORY_MASK_XXX. Below is the - template of these functions. If MULTIBYTEP is nonzero, 8-bit codes + template for these functions. If MULTIBYTEP is nonzero, 8-bit codes of the range 0x80..0x9F are in multibyte form. */ #if 0 int @@ -131,16 +131,17 @@ multibyte text goes to a place pointed to by DESTINATION, the length of which should not exceed DST_BYTES. - These functions set the information of original and decoded texts in - the members produced, produced_char, consumed, and consumed_char of - the structure *CODING. They also set the member result to one of - CODING_FINISH_XXX indicating how the decoding finished. - - DST_BYTES zero means that source area and destination area are + These functions set the information about original and decoded texts + in the members `produced', `produced_char', `consumed', and + `consumed_char' of the structure *CODING. They also set the member + `result' to one of CODING_FINISH_XXX indicating how the decoding + finished. + + DST_BYTES zero means that the source area and destination area are overlapped, which means that we can produce a decoded text until it - reaches at the head of not-yet-decoded source text. - - Below is a template of these functions. */ + reaches the head of the not-yet-decoded source text. + + Below is a template for these functions. */ #if 0 static void decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) @@ -154,21 +155,22 @@ /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** - These functions encode SRC_BYTES length text at SOURCE of Emacs' + These functions encode SRC_BYTES length text at SOURCE from Emacs' internal multibyte format to CODING. The resulting unibyte text goes to a place pointed to by DESTINATION, the length of which should not exceed DST_BYTES. - These functions set the information of original and encoded texts in - the members produced, produced_char, consumed, and consumed_char of - the structure *CODING. They also set the member result to one of - CODING_FINISH_XXX indicating how the encoding finished. - - DST_BYTES zero means that source area and destination area are - overlapped, which means that we can produce a encoded text until it - reaches at the head of not-yet-encoded source text. - - Below is a template of these functions. */ + These functions set the information about original and encoded texts + in the members `produced', `produced_char', `consumed', and + `consumed_char' of the structure *CODING. They also set the member + `result' to one of CODING_FINISH_XXX indicating how the encoding + finished. + + DST_BYTES zero means that the source area and destination area are + overlapped, which means that we can produce encoded text until it + reaches at the head of the not-yet-encoded source text. + + Below is a template for these functions. */ #if 0 static void encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) @@ -260,7 +262,7 @@ /* Produce a multibyte form of characater C to `dst'. Jump to `label_end_of_loop' if there's not enough space at `dst'. - If we are now in the middle of composition sequence, the decoded + If we are now in the middle of a composition sequence, the decoded character may be ALTCHAR (for the current composition). In that case, the character goes to coding->cmp_data->data instead of `dst'. @@ -1125,21 +1127,23 @@ /* The following note describes the coding system ISO2022 briefly. Since the intention of this note is to help understand the - functions in this file, some parts are NOT ACCURATE or OVERLY + functions in this file, some parts are NOT ACCURATE or are OVERLY SIMPLIFIED. For thorough understanding, please refer to the - original document of ISO2022. + original document of ISO2022. This is equivalent to the standard + ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*). ISO2022 provides many mechanisms to encode several character sets - in 7-bit and 8-bit environments. For 7-bite environments, all text + in 7-bit and 8-bit environments. For 7-bit environments, all text is encoded using bytes less than 128. This may make the encoded text a little bit longer, but the text passes more easily through - several gateways, some of which strip off MSB (Most Signigant Bit). - - There are two kinds of character sets: control character set and - graphic character set. The former contains control characters such + several types of gateway, some of which strip off the MSB (Most + Signigant Bit). + + There are two kinds of character sets: control character sets and + graphic character sets. The former contain control characters such as `newline' and `escape' to provide control functions (control functions are also provided by escape sequences). The latter - contains graphic characters such as 'A' and '-'. Emacs recognizes + contain graphic characters such as 'A' and '-'. Emacs recognizes two control character sets and many graphic character sets. Graphic character sets are classified into one of the following @@ -1151,14 +1155,14 @@ - DIMENSION2_CHARS96 In addition, each character set is assigned an identification tag, - unique for each set, called "final character" (denoted as <F> + unique for each set, called the "final character" (denoted as <F> hereafter). The <F> of each character set is decided by ECMA(*) when it is registered in ISO. The code range of <F> is 0x30..0x7F (0x30..0x3F are for private use only). Note (*): ECMA = European Computer Manufacturers Association - Here are examples of graphic character set [NAME(<F>)]: + Here are examples of graphic character sets [NAME(<F>)]: o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... @@ -1251,7 +1255,7 @@ Note (**): If <F> is '@', 'A', or 'B', the intermediate character '(' can be omitted. We refer to this as "short-form" hereafter. - Now you may notice that there are a lot of ways for encoding the + Now you may notice that there are a lot of ways of encoding the same multilingual text in ISO2022. Actually, there exist many coding systems such as Compound Text (used in X11's inter client communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR @@ -1277,14 +1281,14 @@ o ESC '3' -- start relative composition with alternate chars (**) o ESC '4' -- start rule-base composition with alternate chars (**) Since these are not standard escape sequences of any ISO standard, - the use of them for these meaning is restricted to Emacs only. - - (*) This form is used only in Emacs 20.5 and the older versions, + the use of them with these meanings is restricted to Emacs only. + + (*) This form is used only in Emacs 20.5 and older versions, but the newer versions can safely decode it. - (**) This form is used only in Emacs 21.1 and the newer versions, + (**) This form is used only in Emacs 21.1 and newer versions, and the older versions can't decode it. - Here's a list of examples usages of these composition escape + Here's a list of example usages of these composition escape sequences (categorized by `enum composition_method'). COMPOSITION_RELATIVE: @@ -1311,7 +1315,7 @@ (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding_system_table[idx], 1) >= 0) /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions". - Check if a text is encoded in ISO2022. If it is, returns an + Check if a text is encoded in ISO2022. If it is, return an integer in which appropriate flag bits any of: CODING_CATEGORY_MASK_ISO_7 CODING_CATEGORY_MASK_ISO_7_TIGHT @@ -2040,7 +2044,7 @@ /* It is not enough to say just "ISO2022" on encoding, we have to - specify more details. In Emacs, each coding system of ISO2022 + specify more details. In Emacs, each ISO2022 coding system variant has the following specifications: 1. Initial designation to G0 thru G3. 2. Allows short-form designation? @@ -2635,7 +2639,7 @@ /*** 4. SJIS and BIG5 handlers ***/ -/* Although SJIS and BIG5 are not ISO's coding system, they are used +/* Although SJIS and BIG5 are not ISO coding systems, they are used quite widely. So, for the moment, Emacs supports them in the bare C code. But, in the future, they may be supported only by CCL. */ @@ -2644,7 +2648,7 @@ as is. A character of charset katakana-jisx0201 is encoded by "position-code + 0x80". A character of charset japanese-jisx0208 is encoded in 2-byte but two position-codes are divided and shifted - so that it fit in the range below. + so that it fits in the range below. --- CODE RANGE of SJIS --- (character set) (range) @@ -2658,7 +2662,7 @@ /* BIG5 is a coding system encoding two character sets: ASCII and Big5. An ASCII character is encoded as is. Big5 is a two-byte - character set and is encoded in two-byte. + character set and is encoded in two bytes. --- CODE RANGE of BIG5 --- (character set) (range) @@ -3310,15 +3314,15 @@ /*** 7. C library functions ***/ -/* In Emacs Lisp, coding system is represented by a Lisp symbol which +/* In Emacs Lisp, a coding system is represented by a Lisp symbol which has a property `coding-system'. The value of this property is a - vector of length 5 (called as coding-vector). Among elements of + vector of length 5 (called the coding-vector). Among elements of this vector, the first (element[0]) and the fifth (element[4]) carry important information for decoding/encoding. Before decoding/encoding, this information should be set in fields of a structure of type `coding_system'. - A value of property `coding-system' can be a symbol of another + The value of the property `coding-system' can be a symbol of another subsidiary coding-system. In that case, Emacs gets coding-vector from that symbol. @@ -3362,12 +3366,12 @@ If `coding->type' takes the other value, element[4] is ignored. - Emacs Lisp's coding system also carries information about format of + Emacs Lisp's coding systems also carry information about format of end-of-line in a value of property `eol-type'. If the value is integer, 0 means CODING_EOL_LF, 1 means CODING_EOL_CRLF, and 2 means CODING_EOL_CR. If it is not integer, it should be a vector of subsidiary coding systems of which property `eol-type' has one - of above values. + of the above values. */ @@ -3895,10 +3899,10 @@ `no-conversion' by default. Each of them is a Lisp symbol and the value is an actual - `coding-system's (this is also a Lisp symbol) assigned by a user. + `coding-system' (this is also a Lisp symbol) assigned by a user. What Emacs does actually is to detect a category of coding system. Then, it uses a `coding-system' assigned to it. If Emacs can't - decide only one possible category, it selects a category of the + decide a single possible category, it selects a category of the highest priority. Priorities of categories are also specified by a user in a Lisp variable `coding-category-list'. @@ -4188,7 +4192,7 @@ static int detect_eol_type_in_2_octet_form (source, src_bytes, skip, big_endian_p) unsigned char *source; - int src_bytes, *skip; + int src_bytes, *skip, big_endian_p; { unsigned char *src = source, *src_end = src + src_bytes; unsigned int c1, c2; @@ -6406,7 +6410,7 @@ DEFUN ("decode-coding-region", Fdecode_coding_region, Sdecode_coding_region, 3, 3, "r\nzCoding system: ", - "Decode the current region by specified coding system.\n\ + "Decode the current region from the specified coding system.\n\ When called from a program, takes three arguments:\n\ START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ This function sets `last-coding-system-used' to the precise coding system\n\ @@ -6421,7 +6425,7 @@ DEFUN ("encode-coding-region", Fencode_coding_region, Sencode_coding_region, 3, 3, "r\nzCoding system: ", - "Encode the current region by specified coding system.\n\ + "Encode the current region into the specified coding system.\n\ When called from a program, takes three arguments:\n\ START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ This function sets `last-coding-system-used' to the precise coding system\n\