comparison doc/lispref/nonascii.texi @ 106711:b87d77f96245

Consistently hex notation to represent character codes. * nonascii.texi (Text Representations, Character Codes) (Converting Representations, Explicit Encoding) (Translation of Characters): Use hex notation consistently. (Character Sets): Fix map-charset-chars doc (Bug#5197).
author Chong Yidong <cyd@stupidchicken.com>
date Sat, 02 Jan 2010 13:55:19 -0500
parents 810bd90737d5
children 1d1d5d9bd884
comparison
equal deleted inserted replaced
106710:a96887ed3368 106711:b87d77f96245
44 @cindex Unicode 44 @cindex Unicode
45 To support this multitude of characters and scripts, Emacs closely 45 To support this multitude of characters and scripts, Emacs closely
46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a 46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
47 unique number, called a @dfn{codepoint}, to each and every character. 47 unique number, called a @dfn{codepoint}, to each and every character.
48 The range of codepoints defined by Unicode, or the Unicode 48 The range of codepoints defined by Unicode, or the Unicode
49 @dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs 49 @dfn{codespace}, is @code{0..#x10FFFF} (in hexadecimal notation),
50 extends this range with codepoints in the range @code{110000..3FFFFF}, 50 inclusive. Emacs extends this range with codepoints in the range
51 which it uses for representing characters that are not unified with 51 @code{#x110000..#x3FFFFF}, which it uses for representing characters
52 Unicode and raw 8-bit bytes that cannot be interpreted as characters 52 that are not unified with Unicode and @dfn{raw 8-bit bytes} that
53 (the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a 53 cannot be interpreted as characters. Thus, a character codepoint in
54 character codepoint in Emacs is a 22-bit integer number. 54 Emacs is a 22-bit integer number.
55 55
56 @cindex internal representation of characters 56 @cindex internal representation of characters
57 @cindex characters, representation in buffers and strings 57 @cindex characters, representation in buffers and strings
58 @cindex multibyte text 58 @cindex multibyte text
59 To conserve memory, Emacs does not hold fixed-length 22-bit numbers 59 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
187 This function returns a multibyte string containing the same sequence 187 This function returns a multibyte string containing the same sequence
188 of characters as @var{string}. If @var{string} is a multibyte string, 188 of characters as @var{string}. If @var{string} is a multibyte string,
189 it is returned unchanged. The function assumes that @var{string} 189 it is returned unchanged. The function assumes that @var{string}
190 includes only @acronym{ASCII} characters and raw 8-bit bytes; the 190 includes only @acronym{ASCII} characters and raw 8-bit bytes; the
191 latter are converted to their multibyte representation corresponding 191 latter are converted to their multibyte representation corresponding
192 to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text 192 to the codepoints @code{#x3FFF80} through @code{#x3FFFFF}, inclusive
193 Representations, codepoints}). 193 (@pxref{Text Representations, codepoints}).
194 @end defun 194 @end defun
195 195
196 @defun string-to-unibyte string 196 @defun string-to-unibyte string
197 This function returns a unibyte string containing the same sequence of 197 This function returns a unibyte string containing the same sequence of
198 characters as @var{string}. It signals an error if @var{string} 198 characters as @var{string}. It signals an error if @var{string}
269 @section Character Codes 269 @section Character Codes
270 @cindex character codes 270 @cindex character codes
271 271
272 The unibyte and multibyte text representations use different 272 The unibyte and multibyte text representations use different
273 character codes. The valid character codes for unibyte representation 273 character codes. The valid character codes for unibyte representation
274 range from 0 to 255---the values that can fit in one byte. The valid 274 range from 0 to @code{#xFF} (255)---the values that can fit in one
275 character codes for multibyte representation range from 0 to 4194303 275 byte. The valid character codes for multibyte representation range
276 (#x3FFFFF). In this code space, values 0 through 127 are for 276 from 0 to @code{#x3FFFFF}. In this code space, values 0 through
277 @acronym{ASCII} characters, and values 128 through 4194175 (#x3FFF7F) 277 @code{#x7F} (127) are for @acronym{ASCII} characters, and values
278 are for non-@acronym{ASCII} characters. Values 0 through 1114111 278 @code{#x80} (128) through @code{#x3FFF7F} (4194175) are for
279 (#10FFFF) correspond to Unicode characters of the same codepoint; 279 non-@acronym{ASCII} characters.
280 values 1114112 (#110000) through 4194175 (#x3FFF7F) represent 280
281 characters that are not unified with Unicode; and values 4194176 281 Emacs character codes are a superset of the Unicode standard.
282 (#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes. 282 Values 0 through @code{#x10FFFF} (1114111) correspond to Unicode
283 characters of the same codepoint; values @code{#x110000} (1114112)
284 through @code{#x3FFF7F} (4194175) represent characters that are not
285 unified with Unicode; and values @code{#x3FFF80} (4194176) through
286 @code{#x3FFFFF} (4194303) represent eight-bit raw bytes.
283 287
284 @defun characterp charcode 288 @defun characterp charcode
285 This returns @code{t} if @var{charcode} is a valid character, and 289 This returns @code{t} if @var{charcode} is a valid character, and
286 @code{nil} otherwise. 290 @code{nil} otherwise.
287 291
538 @cindex @code{emacs}, a charset 542 @cindex @code{emacs}, a charset
539 @cindex @code{unicode}, a charset 543 @cindex @code{unicode}, a charset
540 @cindex @code{eight-bit}, a charset 544 @cindex @code{eight-bit}, a charset
541 Emacs defines several special character sets. The character set 545 Emacs defines several special character sets. The character set
542 @code{unicode} includes all the characters whose Emacs code points are 546 @code{unicode} includes all the characters whose Emacs code points are
543 in the range @code{0..10FFFF}. The character set @code{emacs} 547 in the range @code{0..#x10FFFF}. The character set @code{emacs}
544 includes all @acronym{ASCII} and non-@acronym{ASCII} characters. 548 includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
545 Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; 549 Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
546 Emacs uses it to represent raw bytes encountered in text. 550 Emacs uses it to represent raw bytes encountered in text.
547 551
548 @defun charsetp object 552 @defun charsetp object
626 @end defun 630 @end defun
627 631
628 The following function comes in handy for applying a certain 632 The following function comes in handy for applying a certain
629 function to all or part of the characters in a charset: 633 function to all or part of the characters in a charset:
630 634
631 @defun map-charset-chars function charset &optional arg from to 635 @defun map-charset-chars function charset &optional arg from-code to-code
632 Call @var{function} for characters in @var{charset}. @var{function} 636 Call @var{function} for characters in @var{charset}. @var{function}
633 is called with two arguments. The first one is a cons cell 637 is called with two arguments. The first one is a cons cell
634 @code{(@var{from} . @var{to})}, where @var{from} and @var{to} 638 @code{(@var{from} . @var{to})}, where @var{from} and @var{to}
635 indicate a range of characters contained in charset. The second 639 indicate a range of characters contained in charset. The second
636 argument is the optional argument @var{arg}. 640 argument passed to @var{function} is @var{arg}.
637 641
638 By default, the range of codepoints passed to @var{function} includes 642 By default, the range of codepoints passed to @var{function} includes
639 all the characters in @var{charset}, but optional arguments 643 all the characters in @var{charset}, but optional arguments
640 @var{from-code} and @var{to-code} limit that to the range of 644 @var{from-code} and @var{to-code} limit that to the range of
641 characters between these two codepoints of @var{charset}. If either 645 characters between these two codepoints of @var{charset}. If either
749 This variable automatically becomes buffer-local when set. 753 This variable automatically becomes buffer-local when set.
750 @end defvar 754 @end defvar
751 755
752 @defun make-translation-table-from-vector vec 756 @defun make-translation-table-from-vector vec
753 This function returns a translation table made from @var{vec} that is 757 This function returns a translation table made from @var{vec} that is
754 an array of 256 elements to map byte values 0 through 255 to 758 an array of 256 elements to map bytes (values 0 through #xFF) to
755 characters. Elements may be @code{nil} for untranslated bytes. The 759 characters. Elements may be @code{nil} for untranslated bytes. The
756 returned table has a translation table for reverse mapping in the 760 returned table has a translation table for reverse mapping in the
757 first extra slot, and the value @code{1} in the second extra slot. 761 first extra slot, and the value @code{1} in the second extra slot.
758 762
759 This function provides an easy way to make a private coding system 763 This function provides an easy way to make a private coding system
1560 1564
1561 The result of encoding, and the input to decoding, are not ordinary 1565 The result of encoding, and the input to decoding, are not ordinary
1562 text. They logically consist of a series of byte values; that is, a 1566 text. They logically consist of a series of byte values; that is, a
1563 series of @acronym{ASCII} and eight-bit characters. In unibyte 1567 series of @acronym{ASCII} and eight-bit characters. In unibyte
1564 buffers and strings, these characters have codes in the range 0 1568 buffers and strings, these characters have codes in the range 0
1565 through 255. In a multibyte buffer or string, eight-bit characters 1569 through #xFF (255). In a multibyte buffer or string, eight-bit
1566 have character codes higher than 255 (@pxref{Text Representations}), 1570 characters have character codes higher than #xFF (@pxref{Text
1567 but Emacs transparently converts them to their single-byte values when 1571 Representations}), but Emacs transparently converts them to their
1568 you encode or decode such text. 1572 single-byte values when you encode or decode such text.
1569 1573
1570 The usual way to read a file into a buffer as a sequence of bytes, so 1574 The usual way to read a file into a buffer as a sequence of bytes, so
1571 you can decode the contents explicitly, is with 1575 you can decode the contents explicitly, is with
1572 @code{insert-file-contents-literally} (@pxref{Reading from Files}); 1576 @code{insert-file-contents-literally} (@pxref{Reading from Files});
1573 alternatively, specify a non-@code{nil} @var{rawfile} argument when 1577 alternatively, specify a non-@code{nil} @var{rawfile} argument when