Mercurial > emacs
comparison doc/lispref/nonascii.texi @ 106711:b87d77f96245
Consistently hex notation to represent character codes.
* nonascii.texi (Text Representations, Character Codes)
(Converting Representations, Explicit Encoding)
(Translation of Characters): Use hex notation consistently.
(Character Sets): Fix map-charset-chars doc (Bug#5197).
author | Chong Yidong <cyd@stupidchicken.com> |
---|---|
date | Sat, 02 Jan 2010 13:55:19 -0500 |
parents | 810bd90737d5 |
children | 1d1d5d9bd884 |
comparison
equal
deleted
inserted
replaced
106710:a96887ed3368 | 106711:b87d77f96245 |
---|---|
44 @cindex Unicode | 44 @cindex Unicode |
45 To support this multitude of characters and scripts, Emacs closely | 45 To support this multitude of characters and scripts, Emacs closely |
46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a | 46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a |
47 unique number, called a @dfn{codepoint}, to each and every character. | 47 unique number, called a @dfn{codepoint}, to each and every character. |
48 The range of codepoints defined by Unicode, or the Unicode | 48 The range of codepoints defined by Unicode, or the Unicode |
49 @dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs | 49 @dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation), |
50 extends this range with codepoints in the range @code{110000..3FFFFF}, | 50 inclusive. Emacs extends this range with codepoints in the range |
51 which it uses for representing characters that are not unified with | 51 @code{#x110000..#x3FFFFF}, which it uses for representing characters |
52 Unicode and raw 8-bit bytes that cannot be interpreted as characters | 52 that are not unified with Unicode and @dfn{raw 8-bit bytes} that |
53 (the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a | 53 cannot be interpreted as characters. Thus, a character codepoint in |
54 character codepoint in Emacs is a 22-bit integer number. | 54 Emacs is a 22-bit integer number. |
55 | 55 |
56 @cindex internal representation of characters | 56 @cindex internal representation of characters |
57 @cindex characters, representation in buffers and strings | 57 @cindex characters, representation in buffers and strings |
58 @cindex multibyte text | 58 @cindex multibyte text |
59 To conserve memory, Emacs does not hold fixed-length 22-bit numbers | 59 To conserve memory, Emacs does not hold fixed-length 22-bit numbers |
187 This function returns a multibyte string containing the same sequence | 187 This function returns a multibyte string containing the same sequence |
188 of characters as @var{string}. If @var{string} is a multibyte string, | 188 of characters as @var{string}. If @var{string} is a multibyte string, |
189 it is returned unchanged. The function assumes that @var{string} | 189 it is returned unchanged. The function assumes that @var{string} |
190 includes only @acronym{ASCII} characters and raw 8-bit bytes; the | 190 includes only @acronym{ASCII} characters and raw 8-bit bytes; the |
191 latter are converted to their multibyte representation corresponding | 191 latter are converted to their multibyte representation corresponding |
192 to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text | 192 to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive |
193 Representations, codepoints}). | 193 (@pxref{Text Representations, codepoints}). |
194 @end defun | 194 @end defun |
195 | 195 |
196 @defun string-to-unibyte string | 196 @defun string-to-unibyte string |
197 This function returns a unibyte string containing the same sequence of | 197 This function returns a unibyte string containing the same sequence of |
198 characters as @var{string}. It signals an error if @var{string} | 198 characters as @var{string}. It signals an error if @var{string} |
269 @section Character Codes | 269 @section Character Codes |
270 @cindex character codes | 270 @cindex character codes |
271 | 271 |
272 The unibyte and multibyte text representations use different | 272 The unibyte and multibyte text representations use different |
273 character codes. The valid character codes for unibyte representation | 273 character codes. The valid character codes for unibyte representation |
274 range from 0 to 255---the values that can fit in one byte. The valid | 274 range from 0 to @code{#xFF} (255)---the values that can fit in one |
275 character codes for multibyte representation range from 0 to 4194303 | 275 byte. The valid character codes for multibyte representation range |
276 (#x3FFFFF). In this code space, values 0 through 127 are for | 276 from 0 to @code{#x3FFFFF}. In this code space, values 0 through |
277 @acronym{ASCII} characters, and values 128 through 4194175 (#x3FFF7F) | 277 @code{#x7F} (127) are for @acronym{ASCII} characters, and values |
278 are for non-@acronym{ASCII} characters. Values 0 through 1114111 | 278 @code{#x80} (128) through @code{#x3FFF7F} (4194175) are for |
279 (#10FFFF) correspond to Unicode characters of the same codepoint; | 279 non-@acronym{ASCII} characters. |
280 values 1114112 (#110000) through 4194175 (#x3FFF7F) represent | 280 |
281 characters that are not unified with Unicode; and values 4194176 | 281 Emacs character codes are a superset of the Unicode standard. |
282 (#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes. | 282 Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode |
283 characters of the same codepoint; values @code{#x110000} (1114112) | |
284 through @code{#x3FFF7F} (4194175) represent characters that are not | |
285 unified with Unicode; and values @code{#x3FFF80} (4194176) through | |
286 @code{#x3FFFFF} (4194303) represent eight-bit raw bytes. | |
283 | 287 |
284 @defun characterp charcode | 288 @defun characterp charcode |
285 This returns @code{t} if @var{charcode} is a valid character, and | 289 This returns @code{t} if @var{charcode} is a valid character, and |
286 @code{nil} otherwise. | 290 @code{nil} otherwise. |
287 | 291 |
538 @cindex @code{emacs}, a charset | 542 @cindex @code{emacs}, a charset |
539 @cindex @code{unicode}, a charset | 543 @cindex @code{unicode}, a charset |
540 @cindex @code{eight-bit}, a charset | 544 @cindex @code{eight-bit}, a charset |
541 Emacs defines several special character sets. The character set | 545 Emacs defines several special character sets. The character set |
542 @code{unicode} includes all the characters whose Emacs code points are | 546 @code{unicode} includes all the characters whose Emacs code points are |
543 in the range @code{0..10FFFF}. The character set @code{emacs} | 547 in the range @code{0..#x10FFFF}. The character set @code{emacs} |
544 includes all @acronym{ASCII} and non-@acronym{ASCII} characters. | 548 includes all @acronym{ASCII} and non-@acronym{ASCII} characters. |
545 Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; | 549 Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; |
546 Emacs uses it to represent raw bytes encountered in text. | 550 Emacs uses it to represent raw bytes encountered in text. |
547 | 551 |
548 @defun charsetp object | 552 @defun charsetp object |
626 @end defun | 630 @end defun |
627 | 631 |
628 The following function comes in handy for applying a certain | 632 The following function comes in handy for applying a certain |
629 function to all or part of the characters in a charset: | 633 function to all or part of the characters in a charset: |
630 | 634 |
631 @defun map-charset-chars function charset &optional arg from to | 635 @defun map-charset-chars function charset &optional arg from-code to-code |
632 Call @var{function} for characters in @var{charset}. @var{function} | 636 Call @var{function} for characters in @var{charset}. @var{function} |
633 is called with two arguments. The first one is a cons cell | 637 is called with two arguments. The first one is a cons cell |
634 @code{(@var{from} . @var{to})}, where @var{from} and @var{to} | 638 @code{(@var{from} . @var{to})}, where @var{from} and @var{to} |
635 indicate a range of characters contained in charset. The second | 639 indicate a range of characters contained in charset. The second |
636 argument is the optional argument @var{arg}. | 640 argument passed to @var{function} is @var{arg}. |
637 | 641 |
638 By default, the range of codepoints passed to @var{function} includes | 642 By default, the range of codepoints passed to @var{function} includes |
639 all the characters in @var{charset}, but optional arguments | 643 all the characters in @var{charset}, but optional arguments |
640 @var{from-code} and @var{to-code} limit that to the range of | 644 @var{from-code} and @var{to-code} limit that to the range of |
641 characters between these two codepoints of @var{charset}. If either | 645 characters between these two codepoints of @var{charset}. If either |
749 This variable automatically becomes buffer-local when set. | 753 This variable automatically becomes buffer-local when set. |
750 @end defvar | 754 @end defvar |
751 | 755 |
752 @defun make-translation-table-from-vector vec | 756 @defun make-translation-table-from-vector vec |
753 This function returns a translation table made from @var{vec} that is | 757 This function returns a translation table made from @var{vec} that is |
754 an array of 256 elements to map byte values 0 through 255 to | 758 an array of 256 elements to map bytes (values 0 through #xFF) to |
755 characters. Elements may be @code{nil} for untranslated bytes. The | 759 characters. Elements may be @code{nil} for untranslated bytes. The |
756 returned table has a translation table for reverse mapping in the | 760 returned table has a translation table for reverse mapping in the |
757 first extra slot, and the value @code{1} in the second extra slot. | 761 first extra slot, and the value @code{1} in the second extra slot. |
758 | 762 |
759 This function provides an easy way to make a private coding system | 763 This function provides an easy way to make a private coding system |
1560 | 1564 |
1561 The result of encoding, and the input to decoding, are not ordinary | 1565 The result of encoding, and the input to decoding, are not ordinary |
1562 text. They logically consist of a series of byte values; that is, a | 1566 text. They logically consist of a series of byte values; that is, a |
1563 series of @acronym{ASCII} and eight-bit characters. In unibyte | 1567 series of @acronym{ASCII} and eight-bit characters. In unibyte |
1564 buffers and strings, these characters have codes in the range 0 | 1568 buffers and strings, these characters have codes in the range 0 |
1565 through 255. In a multibyte buffer or string, eight-bit characters | 1569 through #xFF (255). In a multibyte buffer or string, eight-bit |
1566 have character codes higher than 255 (@pxref{Text Representations}), | 1570 characters have character codes higher than #xFF (@pxref{Text |
1567 but Emacs transparently converts them to their single-byte values when | 1571 Representations}), but Emacs transparently converts them to their |
1568 you encode or decode such text. | 1572 single-byte values when you encode or decode such text. |
1569 | 1573 |
1570 The usual way to read a file into a buffer as a sequence of bytes, so | 1574 The usual way to read a file into a buffer as a sequence of bytes, so |
1571 you can decode the contents explicitly, is with | 1575 you can decode the contents explicitly, is with |
1572 @code{insert-file-contents-literally} (@pxref{Reading from Files}); | 1576 @code{insert-file-contents-literally} (@pxref{Reading from Files}); |
1573 alternatively, specify a non-@code{nil} @var{rawfile} argument when | 1577 alternatively, specify a non-@code{nil} @var{rawfile} argument when |