comparison doc/lispref/nonascii.texi @ 99313:175420e76f65

(Text Representations): Rewrite to make consistent with Emacs 23 internal representation of characters. Document `unibyte-string'.
author Eli Zaretskii <eliz@gnu.org>
date Sat, 01 Nov 2008 16:31:47 +0000
parents 7c989edf1f9f
children 512ddf0d1748
comparison
equal deleted inserted replaced
99312:90b4d44d8513 99313:175420e76f65
8 @chapter Non-@acronym{ASCII} Characters 8 @chapter Non-@acronym{ASCII} Characters
9 @cindex multibyte characters 9 @cindex multibyte characters
10 @cindex characters, multi-byte 10 @cindex characters, multi-byte
11 @cindex non-@acronym{ASCII} characters 11 @cindex non-@acronym{ASCII} characters
12 12
13 This chapter covers the special issues relating to non-@acronym{ASCII} 13 This chapter covers the special issues relating to characters and
14 characters and how they are stored in strings and buffers. 14 how they are stored in strings and buffers.
15 15
16 @menu 16 @menu
17 * Text Representations:: Unibyte and multibyte representations 17 * Text Representations:: How Emacs represents text.
18 * Converting Representations:: Converting unibyte to multibyte and vice versa. 18 * Converting Representations:: Converting unibyte to multibyte and vice versa.
19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi. 19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi.
20 * Character Codes:: How unibyte and multibyte relate to 20 * Character Codes:: How unibyte and multibyte relate to
21 codes of individual characters. 21 codes of individual characters.
22 * Character Sets:: The space of possible character codes 22 * Character Sets:: The space of possible character codes
31 * Locales:: Interacting with the POSIX locale. 31 * Locales:: Interacting with the POSIX locale.
32 @end menu 32 @end menu
33 33
34 @node Text Representations 34 @node Text Representations
35 @section Text Representations 35 @section Text Representations
36 @cindex text representations 36 @cindex text representation
37 37
38 Emacs has two @dfn{text representations}---two ways to represent text 38 Emacs buffers and strings support a large repertoire of characters
39 in a string or buffer. These are called @dfn{unibyte} and 39 from many different scripts. This is so users could type and display
40 @dfn{multibyte}. Each string, and each buffer, uses one of these two 40 text in most any known written language.
41 representations. For most purposes, you can ignore the issue of 41
42 representations, because Emacs converts text between them as 42 @cindex character codepoint
43 appropriate. Occasionally in Lisp programming you will need to pay 43 @cindex codespace
44 attention to the difference. 44 @cindex Unicode
45 To support this multitude of characters and scripts, Emacs closely
46 follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
47 unique number, called a @dfn{codepoint}, to each and every character.
48 The range of codepoints defined by Unicode, or the Unicode
49 @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs
50 extends this range with codepoints in the range @code{3FFF80..3FFFFF},
51 which it uses for representing raw 8-bit bytes that cannot be
52 interpreted as characters. Thus, a character codepoint in Emacs is a
53 22-bit integer number.
54
55 @cindex internal representation of characters
56 @cindex characters, representation in buffers and strings
57 @cindex multibyte text
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
59 that are codepoints of text characters within buffers and strings.
60 Rather, Emacs uses a variable-length internal representation of
61 characters, that stores each character as a sequence of 1 to 5 8-bit
62 bytes, depending on the magnitude of its codepoint@footnote{
63 This internal representation is based on one of the encodings defined
64 by the Unicode Standard, called @dfn{UTF-8}, for representing any
65 Unicode codepoint, but Emacs extends UTF-8 to represent the additional
66 codepoints it uses for raw 8-bit bytes.}.
67 For example, any @acronym{ASCII} character takes up only 1 byte, a
68 Latin-1 character takes up 2 bytes, etc. We call this representation
69 of text @dfn{multibyte}, because it uses several bytes for each
70 character.
71
72 Outside Emacs, characters can be represented in many different
73 encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
74 between these external encodings and the internal representation, as
75 appropriate, when it reads text into a buffer or a string, or when it
76 writes text to a disk file or passes it to some other process.
77
78 Occasionally, Emacs needs to hold and manipulate encoded text or
79 binary non-text data in its buffer or string. For example, when Emacs
80 visits a file, it first reads the file's text verbatim into a buffer,
81 and only then converts it to the internal representation. Before the
82 conversion, the buffer holds encoded text.
45 83
46 @cindex unibyte text 84 @cindex unibyte text
47 In unibyte representation, each character occupies one byte and 85 Encoded text is not really text, as far as Emacs is concerned, but
48 therefore the possible character codes range from 0 to 255. Codes 0 86 rather a sequence of raw 8-bit bytes. We call buffers and strings
49 through 127 are @acronym{ASCII} characters; the codes from 128 through 255 87 that hold encoded text @dfn{unibyte} buffers and strings, because
50 are used for one non-@acronym{ASCII} character set (you can choose which 88 Emacs treats them as a sequence of individual bytes. In particular,
51 character set by setting the variable @code{nonascii-insert-offset}). 89 Emacs usually displays unibyte buffers and strings as octal codes such
52 90 as @code{\237}. We recommend that you never use unibyte buffers and
53 @cindex leading code 91 strings except for manipulating encoded text or binary non-text data.
54 @cindex multibyte text
55 @cindex trailing codes
56 In multibyte representation, a character may occupy more than one
57 byte, and as a result, the full range of Emacs character codes can be
58 stored. The first byte of a multibyte character is always in the range
59 128 through 159 (octal 0200 through 0237). These values are called
60 @dfn{leading codes}. The second and subsequent bytes of a multibyte
61 character are always in the range 160 through 255 (octal 0240 through
62 0377); these values are @dfn{trailing codes}.
63
64 Some sequences of bytes are not valid in multibyte text: for example,
65 a single isolated byte in the range 128 through 159 is not allowed. But
66 character codes 128 through 159 can appear in multibyte text,
67 represented as two-byte sequences. All the character codes 128 through
68 255 are possible (though slightly abnormal) in multibyte text; they
69 appear in multibyte buffers and strings when you do explicit encoding
70 and decoding (@pxref{Explicit Encoding}).
71 92
72 In a buffer, the buffer-local value of the variable 93 In a buffer, the buffer-local value of the variable
73 @code{enable-multibyte-characters} specifies the representation used. 94 @code{enable-multibyte-characters} specifies the representation used.
74 The representation for a string is determined and recorded in the string 95 The representation for a string is determined and recorded in the string
75 when the string is constructed. 96 when the string is constructed.
76 97
77 @defvar enable-multibyte-characters 98 @defvar enable-multibyte-characters
78 This variable specifies the current buffer's text representation. 99 This variable specifies the current buffer's text representation.
79 If it is non-@code{nil}, the buffer contains multibyte text; otherwise, 100 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
80 it contains unibyte text. 101 it contains unibyte encoded text or binary non-text data.
81 102
82 You cannot set this variable directly; instead, use the function 103 You cannot set this variable directly; instead, use the function
83 @code{set-buffer-multibyte} to change a buffer's representation. 104 @code{set-buffer-multibyte} to change a buffer's representation.
84 @end defvar 105 @end defvar
85 106
94 The @samp{--unibyte} command line option does its job by setting the 115 The @samp{--unibyte} command line option does its job by setting the
95 default value to @code{nil} early in startup. 116 default value to @code{nil} early in startup.
96 @end defvar 117 @end defvar
97 118
98 @defun position-bytes position 119 @defun position-bytes position
99 Return the byte-position corresponding to buffer position 120 Buffer positions are measured in character units. This function
121 returns the byte-position corresponding to buffer position
100 @var{position} in the current buffer. This is 1 at the start of the 122 @var{position} in the current buffer. This is 1 at the start of the
101 buffer, and counts upward in bytes. If @var{position} is out of 123 buffer, and counts upward in bytes. If @var{position} is out of
102 range, the value is @code{nil}. 124 range, the value is @code{nil}.
103 @end defun 125 @end defun
104 126
105 @defun byte-to-position byte-position 127 @defun byte-to-position byte-position
106 Return the buffer position corresponding to byte-position 128 Return the buffer position, in character units, corresponding to
107 @var{byte-position} in the current buffer. If @var{byte-position} is 129 byte-position @var{byte-position} in the current buffer. If
108 out of range, the value is @code{nil}. 130 @var{byte-position} is out of range, the value is @code{nil}.
109 @end defun 131 @end defun
110 132
111 @defun multibyte-string-p string 133 @defun multibyte-string-p string
112 Return @code{t} if @var{string} is a multibyte string. 134 Return @code{t} if @var{string} is a multibyte string, @code{nil}
135 otherwise.
113 @end defun 136 @end defun
114 137
115 @defun string-bytes string 138 @defun string-bytes string
116 @cindex string, number of bytes 139 @cindex string, number of bytes
117 This function returns the number of bytes in @var{string}. 140 This function returns the number of bytes in @var{string}.
118 If @var{string} is a multibyte string, this can be greater than 141 If @var{string} is a multibyte string, this can be greater than
119 @code{(length @var{string})}. 142 @code{(length @var{string})}.
143 @end defun
144
145 @defun unibyte-string &rest bytes
146 This function concatenates all its argument @var{bytes} and makes the
147 result a unibyte string.
120 @end defun 148 @end defun
121 149
122 @node Converting Representations 150 @node Converting Representations
123 @section Converting Text Representations 151 @section Converting Text Representations
124 152