Mercurial > emacs
comparison doc/lispref/nonascii.texi @ 99313:175420e76f65
(Text Representations): Rewrite to make consistent with Emacs 23
internal representation of characters. Document `unibyte-string'.
author | Eli Zaretskii <eliz@gnu.org> |
---|---|
date | Sat, 01 Nov 2008 16:31:47 +0000 |
parents | 7c989edf1f9f |
children | 512ddf0d1748 |
comparison
equal
deleted
inserted
replaced
99312:90b4d44d8513 | 99313:175420e76f65 |
---|---|
8 @chapter Non-@acronym{ASCII} Characters | 8 @chapter Non-@acronym{ASCII} Characters |
9 @cindex multibyte characters | 9 @cindex multibyte characters |
10 @cindex characters, multi-byte | 10 @cindex characters, multi-byte |
11 @cindex non-@acronym{ASCII} characters | 11 @cindex non-@acronym{ASCII} characters |
12 | 12 |
13 This chapter covers the special issues relating to non-@acronym{ASCII} | 13 This chapter covers the special issues relating to characters and |
14 characters and how they are stored in strings and buffers. | 14 how they are stored in strings and buffers. |
15 | 15 |
16 @menu | 16 @menu |
17 * Text Representations:: Unibyte and multibyte representations | 17 * Text Representations:: How Emacs represents text. |
18 * Converting Representations:: Converting unibyte to multibyte and vice versa. | 18 * Converting Representations:: Converting unibyte to multibyte and vice versa. |
19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi. | 19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi. |
20 * Character Codes:: How unibyte and multibyte relate to | 20 * Character Codes:: How unibyte and multibyte relate to |
21 codes of individual characters. | 21 codes of individual characters. |
22 * Character Sets:: The space of possible character codes | 22 * Character Sets:: The space of possible character codes |
31 * Locales:: Interacting with the POSIX locale. | 31 * Locales:: Interacting with the POSIX locale. |
32 @end menu | 32 @end menu |
33 | 33 |
34 @node Text Representations | 34 @node Text Representations |
35 @section Text Representations | 35 @section Text Representations |
36 @cindex text representations | 36 @cindex text representation |
37 | 37 |
38 Emacs has two @dfn{text representations}---two ways to represent text | 38 Emacs buffers and strings support a large repertoire of characters |
39 in a string or buffer. These are called @dfn{unibyte} and | 39 from many different scripts. This is so users could type and display |
40 @dfn{multibyte}. Each string, and each buffer, uses one of these two | 40 text in most any known written language. |
41 representations. For most purposes, you can ignore the issue of | 41 |
42 representations, because Emacs converts text between them as | 42 @cindex character codepoint |
43 appropriate. Occasionally in Lisp programming you will need to pay | 43 @cindex codespace |
44 attention to the difference. | 44 @cindex Unicode |
45 To support this multitude of characters and scripts, Emacs closely | |
46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a | |
47 unique number, called a @dfn{codepoint}, to each and every character. | |
48 The range of codepoints defined by Unicode, or the Unicode | |
49 @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs | |
50 extends this range with codepoints in the range @code{3FFF80..3FFFFF}, | |
51 which it uses for representing raw 8-bit bytes that cannot be | |
52 interpreted as characters. Thus, a character codepoint in Emacs is a | |
53 22-bit integer number. | |
54 | |
55 @cindex internal representation of characters | |
56 @cindex characters, representation in buffers and strings | |
57 @cindex multibyte text | |
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers | |
59 that are codepoints of text characters within buffers and strings. | |
60 Rather, Emacs uses a variable-length internal representation of | |
61 characters, that stores each character as a sequence of 1 to 5 8-bit | |
62 bytes, depending on the magnitude of its codepoint@footnote{ | |
63 This internal representation is based on one of the encodings defined | |
64 by the Unicode Standard, called @dfn{UTF-8}, for representing any | |
65 Unicode codepoint, but Emacs extends UTF-8 to represent the additional | |
66 codepoints it uses for raw 8-bit bytes.}. | |
67 For example, any @acronym{ASCII} character takes up only 1 byte, a | |
68 Latin-1 character takes up 2 bytes, etc. We call this representation | |
69 of text @dfn{multibyte}, because it uses several bytes for each | |
70 character. | |
71 | |
72 Outside Emacs, characters can be represented in many different | |
73 encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts | |
74 between these external encodings and the internal representation, as | |
75 appropriate, when it reads text into a buffer or a string, or when it | |
76 writes text to a disk file or passes it to some other process. | |
77 | |
78 Occasionally, Emacs needs to hold and manipulate encoded text or | |
79 binary non-text data in its buffer or string. For example, when Emacs | |
80 visits a file, it first reads the file's text verbatim into a buffer, | |
81 and only then converts it to the internal representation. Before the | |
82 conversion, the buffer holds encoded text. | |
45 | 83 |
46 @cindex unibyte text | 84 @cindex unibyte text |
47 In unibyte representation, each character occupies one byte and | 85 Encoded text is not really text, as far as Emacs is concerned, but |
48 therefore the possible character codes range from 0 to 255. Codes 0 | 86 rather a sequence of raw 8-bit bytes. We call buffers and strings |
49 through 127 are @acronym{ASCII} characters; the codes from 128 through 255 | 87 that hold encoded text @dfn{unibyte} buffers and strings, because |
50 are used for one non-@acronym{ASCII} character set (you can choose which | 88 Emacs treats them as a sequence of individual bytes. In particular, |
51 character set by setting the variable @code{nonascii-insert-offset}). | 89 Emacs usually displays unibyte buffers and strings as octal codes such |
52 | 90 as @code{\237}. We recommend that you never use unibyte buffers and |
53 @cindex leading code | 91 strings except for manipulating encoded text or binary non-text data. |
54 @cindex multibyte text | |
55 @cindex trailing codes | |
56 In multibyte representation, a character may occupy more than one | |
57 byte, and as a result, the full range of Emacs character codes can be | |
58 stored. The first byte of a multibyte character is always in the range | |
59 128 through 159 (octal 0200 through 0237). These values are called | |
60 @dfn{leading codes}. The second and subsequent bytes of a multibyte | |
61 character are always in the range 160 through 255 (octal 0240 through | |
62 0377); these values are @dfn{trailing codes}. | |
63 | |
64 Some sequences of bytes are not valid in multibyte text: for example, | |
65 a single isolated byte in the range 128 through 159 is not allowed. But | |
66 character codes 128 through 159 can appear in multibyte text, | |
67 represented as two-byte sequences. All the character codes 128 through | |
68 255 are possible (though slightly abnormal) in multibyte text; they | |
69 appear in multibyte buffers and strings when you do explicit encoding | |
70 and decoding (@pxref{Explicit Encoding}). | |
71 | 92 |
72 In a buffer, the buffer-local value of the variable | 93 In a buffer, the buffer-local value of the variable |
73 @code{enable-multibyte-characters} specifies the representation used. | 94 @code{enable-multibyte-characters} specifies the representation used. |
74 The representation for a string is determined and recorded in the string | 95 The representation for a string is determined and recorded in the string |
75 when the string is constructed. | 96 when the string is constructed. |
76 | 97 |
77 @defvar enable-multibyte-characters | 98 @defvar enable-multibyte-characters |
78 This variable specifies the current buffer's text representation. | 99 This variable specifies the current buffer's text representation. |
79 If it is non-@code{nil}, the buffer contains multibyte text; otherwise, | 100 If it is non-@code{nil}, the buffer contains multibyte text; otherwise, |
80 it contains unibyte text. | 101 it contains unibyte encoded text or binary non-text data. |
81 | 102 |
82 You cannot set this variable directly; instead, use the function | 103 You cannot set this variable directly; instead, use the function |
83 @code{set-buffer-multibyte} to change a buffer's representation. | 104 @code{set-buffer-multibyte} to change a buffer's representation. |
84 @end defvar | 105 @end defvar |
85 | 106 |
94 The @samp{--unibyte} command line option does its job by setting the | 115 The @samp{--unibyte} command line option does its job by setting the |
95 default value to @code{nil} early in startup. | 116 default value to @code{nil} early in startup. |
96 @end defvar | 117 @end defvar |
97 | 118 |
98 @defun position-bytes position | 119 @defun position-bytes position |
99 Return the byte-position corresponding to buffer position | 120 Buffer positions are measured in character units. This function |
121 returns the byte-position corresponding to buffer position | |
100 @var{position} in the current buffer. This is 1 at the start of the | 122 @var{position} in the current buffer. This is 1 at the start of the |
101 buffer, and counts upward in bytes. If @var{position} is out of | 123 buffer, and counts upward in bytes. If @var{position} is out of |
102 range, the value is @code{nil}. | 124 range, the value is @code{nil}. |
103 @end defun | 125 @end defun |
104 | 126 |
105 @defun byte-to-position byte-position | 127 @defun byte-to-position byte-position |
106 Return the buffer position corresponding to byte-position | 128 Return the buffer position, in character units, corresponding to |
107 @var{byte-position} in the current buffer. If @var{byte-position} is | 129 byte-position @var{byte-position} in the current buffer. If |
108 out of range, the value is @code{nil}. | 130 @var{byte-position} is out of range, the value is @code{nil}. |
109 @end defun | 131 @end defun |
110 | 132 |
111 @defun multibyte-string-p string | 133 @defun multibyte-string-p string |
112 Return @code{t} if @var{string} is a multibyte string. | 134 Return @code{t} if @var{string} is a multibyte string, @code{nil} |
135 otherwise. | |
113 @end defun | 136 @end defun |
114 | 137 |
115 @defun string-bytes string | 138 @defun string-bytes string |
116 @cindex string, number of bytes | 139 @cindex string, number of bytes |
117 This function returns the number of bytes in @var{string}. | 140 This function returns the number of bytes in @var{string}. |
118 If @var{string} is a multibyte string, this can be greater than | 141 If @var{string} is a multibyte string, this can be greater than |
119 @code{(length @var{string})}. | 142 @code{(length @var{string})}. |
143 @end defun | |
144 | |
145 @defun unibyte-string &rest bytes | |
146 This function concatenates all its argument @var{bytes} and makes the | |
147 result a unibyte string. | |
120 @end defun | 148 @end defun |
121 | 149 |
122 @node Converting Representations | 150 @node Converting Representations |
123 @section Converting Text Representations | 151 @section Converting Text Representations |
124 | 152 |