Mercurial > emacs
comparison lispref/nonascii.texi @ 52978:1a5c50faf357
Replace @sc{foo} with @acronym{FOO}.
author | Eli Zaretskii <eliz@gnu.org> |
---|---|
date | Sun, 02 Nov 2003 06:29:59 +0000 |
parents | 814620b1c1af |
children | 04d2bf306bd2 |
comparison
equal
deleted
inserted
replaced
52977:8af8c70252c1 | 52978:1a5c50faf357 |
---|---|
2 @c This is part of the GNU Emacs Lisp Reference Manual. | 2 @c This is part of the GNU Emacs Lisp Reference Manual. |
3 @c Copyright (C) 1998, 1999 Free Software Foundation, Inc. | 3 @c Copyright (C) 1998, 1999 Free Software Foundation, Inc. |
4 @c See the file elisp.texi for copying conditions. | 4 @c See the file elisp.texi for copying conditions. |
5 @setfilename ../info/characters | 5 @setfilename ../info/characters |
6 @node Non-ASCII Characters, Searching and Matching, Text, Top | 6 @node Non-ASCII Characters, Searching and Matching, Text, Top |
7 @chapter Non-@sc{ascii} Characters | 7 @chapter Non-@acronym{ASCII} Characters |
8 @cindex multibyte characters | 8 @cindex multibyte characters |
9 @cindex non-@sc{ascii} characters | 9 @cindex non-@acronym{ASCII} characters |
10 | 10 |
11 This chapter covers the special issues relating to non-@sc{ascii} | 11 This chapter covers the special issues relating to non-@acronym{ASCII} |
12 characters and how they are stored in strings and buffers. | 12 characters and how they are stored in strings and buffers. |
13 | 13 |
14 @menu | 14 @menu |
15 * Text Representations:: Unibyte and multibyte representations | 15 * Text Representations:: Unibyte and multibyte representations |
16 * Converting Representations:: Converting unibyte to multibyte and vice versa. | 16 * Converting Representations:: Converting unibyte to multibyte and vice versa. |
42 attention to the difference. | 42 attention to the difference. |
43 | 43 |
44 @cindex unibyte text | 44 @cindex unibyte text |
45 In unibyte representation, each character occupies one byte and | 45 In unibyte representation, each character occupies one byte and |
46 therefore the possible character codes range from 0 to 255. Codes 0 | 46 therefore the possible character codes range from 0 to 255. Codes 0 |
47 through 127 are @sc{ascii} characters; the codes from 128 through 255 | 47 through 127 are @acronym{ASCII} characters; the codes from 128 through 255 |
48 are used for one non-@sc{ascii} character set (you can choose which | 48 are used for one non-@acronym{ASCII} character set (you can choose which |
49 character set by setting the variable @code{nonascii-insert-offset}). | 49 character set by setting the variable @code{nonascii-insert-offset}). |
50 | 50 |
51 @cindex leading code | 51 @cindex leading code |
52 @cindex multibyte text | 52 @cindex multibyte text |
53 @cindex trailing codes | 53 @cindex trailing codes |
132 the characters that might be in the multibyte text. The other natural | 132 the characters that might be in the multibyte text. The other natural |
133 alternative, to convert the buffer contents to multibyte, is not | 133 alternative, to convert the buffer contents to multibyte, is not |
134 acceptable because the buffer's representation is a choice made by the | 134 acceptable because the buffer's representation is a choice made by the |
135 user that cannot be overridden automatically. | 135 user that cannot be overridden automatically. |
136 | 136 |
137 Converting unibyte text to multibyte text leaves @sc{ascii} characters | 137 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters |
138 unchanged, and likewise character codes 128 through 159. It converts | 138 unchanged, and likewise character codes 128 through 159. It converts |
139 the non-@sc{ascii} codes 160 through 255 by adding the value | 139 the non-@acronym{ASCII} codes 160 through 255 by adding the value |
140 @code{nonascii-insert-offset} to each character code. By setting this | 140 @code{nonascii-insert-offset} to each character code. By setting this |
141 variable, you specify which character set the unibyte characters | 141 variable, you specify which character set the unibyte characters |
142 correspond to (@pxref{Character Sets}). For example, if | 142 correspond to (@pxref{Character Sets}). For example, if |
143 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char | 143 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char |
144 'latin-iso8859-1) 128)}, then the unibyte non-@sc{ascii} characters | 144 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters |
145 correspond to Latin 1. If it is 2688, which is @code{(- (make-char | 145 correspond to Latin 1. If it is 2688, which is @code{(- (make-char |
146 'greek-iso8859-7) 128)}, then they correspond to Greek letters. | 146 'greek-iso8859-7) 128)}, then they correspond to Greek letters. |
147 | 147 |
148 Converting multibyte text to unibyte is simpler: it discards all but | 148 Converting multibyte text to unibyte is simpler: it discards all but |
149 the low 8 bits of each character code. If @code{nonascii-insert-offset} | 149 the low 8 bits of each character code. If @code{nonascii-insert-offset} |
151 set, this conversion is the inverse of the other: converting unibyte | 151 set, this conversion is the inverse of the other: converting unibyte |
152 text to multibyte and back to unibyte reproduces the original unibyte | 152 text to multibyte and back to unibyte reproduces the original unibyte |
153 text. | 153 text. |
154 | 154 |
155 @defvar nonascii-insert-offset | 155 @defvar nonascii-insert-offset |
156 This variable specifies the amount to add to a non-@sc{ascii} character | 156 This variable specifies the amount to add to a non-@acronym{ASCII} character |
157 when converting unibyte text to multibyte. It also applies when | 157 when converting unibyte text to multibyte. It also applies when |
158 @code{self-insert-command} inserts a character in the unibyte | 158 @code{self-insert-command} inserts a character in the unibyte |
159 non-@sc{ascii} range, 128 through 255. However, the functions | 159 non-@acronym{ASCII} range, 128 through 255. However, the functions |
160 @code{insert} and @code{insert-char} do not perform this conversion. | 160 @code{insert} and @code{insert-char} do not perform this conversion. |
161 | 161 |
162 The right value to use to select character set @var{cs} is @code{(- | 162 The right value to use to select character set @var{cs} is @code{(- |
163 (make-char @var{cs}) 128)}. If the value of | 163 (make-char @var{cs}) 128)}. If the value of |
164 @code{nonascii-insert-offset} is zero, then conversion actually uses the | 164 @code{nonascii-insert-offset} is zero, then conversion actually uses the |
261 0 to 255---the values that can fit in one byte. The valid character | 261 0 to 255---the values that can fit in one byte. The valid character |
262 codes for multibyte representation range from 0 to 524287, but not all | 262 codes for multibyte representation range from 0 to 524287, but not all |
263 values in that range are valid. The values 128 through 255 are not | 263 values in that range are valid. The values 128 through 255 are not |
264 entirely proper in multibyte text, but they can occur if you do explicit | 264 entirely proper in multibyte text, but they can occur if you do explicit |
265 encoding and decoding (@pxref{Explicit Encoding}). Some other character | 265 encoding and decoding (@pxref{Explicit Encoding}). Some other character |
266 codes cannot occur at all in multibyte text. Only the @sc{ascii} codes | 266 codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes |
267 0 through 127 are completely legitimate in both representations. | 267 0 through 127 are completely legitimate in both representations. |
268 | 268 |
269 @defun char-valid-p charcode &optional genericp | 269 @defun char-valid-p charcode &optional genericp |
270 This returns @code{t} if @var{charcode} is valid for either one of the two | 270 This returns @code{t} if @var{charcode} is valid for either one of the two |
271 text representations. | 271 text representations. |
299 cases, characters that would logically be grouped together are split | 299 cases, characters that would logically be grouped together are split |
300 into several character sets. For example, one set of Chinese | 300 into several character sets. For example, one set of Chinese |
301 characters, generally known as Big 5, is divided into two Emacs | 301 characters, generally known as Big 5, is divided into two Emacs |
302 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. | 302 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. |
303 | 303 |
304 @sc{ascii} characters are in character set @code{ascii}. The | 304 @acronym{ASCII} characters are in character set @code{ascii}. The |
305 non-@sc{ascii} characters 128 through 159 are in character set | 305 non-@acronym{ASCII} characters 128 through 159 are in character set |
306 @code{eight-bit-control}, and codes 160 through 255 are in character set | 306 @code{eight-bit-control}, and codes 160 through 255 are in character set |
307 @code{eight-bit-graphic}. | 307 @code{eight-bit-graphic}. |
308 | 308 |
309 @defun charsetp object | 309 @defun charsetp object |
310 Returns @code{t} if @var{object} is a symbol that names a character set, | 310 Returns @code{t} if @var{object} is a symbol that names a character set, |
334 | 334 |
335 @cindex introduction sequence | 335 @cindex introduction sequence |
336 @cindex dimension (of character set) | 336 @cindex dimension (of character set) |
337 In multibyte representation, each character occupies one or more | 337 In multibyte representation, each character occupies one or more |
338 bytes. Each character set has an @dfn{introduction sequence}, which is | 338 bytes. Each character set has an @dfn{introduction sequence}, which is |
339 normally one or two bytes long. (Exception: the @sc{ascii} character | 339 normally one or two bytes long. (Exception: the @code{ascii} character |
340 set and the @sc{eight-bit-graphic} character set have a zero-length | 340 set and the @code{eight-bit-graphic} character set have a zero-length |
341 introduction sequence.) The introduction sequence is the beginning of | 341 introduction sequence.) The introduction sequence is the beginning of |
342 the byte sequence for any character in the character set. The rest of | 342 the byte sequence for any character in the character set. The rest of |
343 the character's bytes distinguish it from the other characters in the | 343 the character's bytes distinguish it from the other characters in the |
344 same character set. Depending on the character set, there are either | 344 same character set. Depending on the character set, there are either |
345 one or two distinguishing bytes; the number of such bytes is called the | 345 one or two distinguishing bytes; the number of such bytes is called the |
424 @result{} t | 424 @result{} t |
425 (split-char 2176) | 425 (split-char 2176) |
426 @result{} (latin-iso8859-1 0) | 426 @result{} (latin-iso8859-1 0) |
427 @end example | 427 @end example |
428 | 428 |
429 The character sets @sc{ascii}, @sc{eight-bit-control}, and | 429 The character sets @code{ascii}, @code{eight-bit-control}, and |
430 @sc{eight-bit-graphic} don't have corresponding generic characters. If | 430 @code{eight-bit-graphic} don't have corresponding generic characters. If |
431 @var{charset} is one of them and you don't supply @var{code1}, | 431 @var{charset} is one of them and you don't supply @var{code1}, |
432 @code{make-char} returns the character code corresponding to the | 432 @code{make-char} returns the character code corresponding to the |
433 smallest code in @var{charset}. | 433 smallest code in @var{charset}. |
434 | 434 |
435 @node Scanning Charsets | 435 @node Scanning Charsets |
742 handle decoding the text that was scanned. They are listed in order of | 742 handle decoding the text that was scanned. They are listed in order of |
743 decreasing priority. But if @var{highest} is non-@code{nil}, then the | 743 decreasing priority. But if @var{highest} is non-@code{nil}, then the |
744 return value is just one coding system, the one that is highest in | 744 return value is just one coding system, the one that is highest in |
745 priority. | 745 priority. |
746 | 746 |
747 If the region contains only @sc{ascii} characters, the value | 747 If the region contains only @acronym{ASCII} characters, the value |
748 is @code{undecided} or @code{(undecided)}. | 748 is @code{undecided} or @code{(undecided)}. |
749 @end defun | 749 @end defun |
750 | 750 |
751 @defun detect-coding-string string highest | 751 @defun detect-coding-string string highest |
752 This function is like @code{detect-coding-region} except that it | 752 This function is like @code{detect-coding-region} except that it |
844 reading and writing particular files. Each element has the form | 844 reading and writing particular files. Each element has the form |
845 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular | 845 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular |
846 expression that matches certain file names. The element applies to file | 846 expression that matches certain file names. The element applies to file |
847 names that match @var{pattern}. | 847 names that match @var{pattern}. |
848 | 848 |
849 The @sc{cdr} of the element, @var{coding}, should be either a coding | 849 The @acronym{CDR} of the element, @var{coding}, should be either a coding |
850 system, a cons cell containing two coding systems, or a function name (a | 850 system, a cons cell containing two coding systems, or a function name (a |
851 symbol with a function definition). If @var{coding} is a coding system, | 851 symbol with a function definition). If @var{coding} is a coding system, |
852 that coding system is used for both reading the file and writing it. If | 852 that coding system is used for both reading the file and writing it. If |
853 @var{coding} is a cons cell containing two coding systems, its @sc{car} | 853 @var{coding} is a cons cell containing two coding systems, its @acronym{CAR} |
854 specifies the coding system for decoding, and its @sc{cdr} specifies the | 854 specifies the coding system for decoding, and its @acronym{cdr} specifies the |
855 coding system for encoding. | 855 coding system for encoding. |
856 | 856 |
857 If @var{coding} is a function name, the function must return a coding | 857 If @var{coding} is a function name, the function must return a coding |
858 system or a cons cell containing two coding systems. This value is used | 858 system or a cons cell containing two coding systems. This value is used |
859 as described above. | 859 as described above. |
973 you should not globally set it to any other value. Here is an example | 973 you should not globally set it to any other value. Here is an example |
974 of the right way to use the variable: | 974 of the right way to use the variable: |
975 | 975 |
976 @example | 976 @example |
977 ;; @r{Read the file with no character code conversion.} | 977 ;; @r{Read the file with no character code conversion.} |
978 ;; @r{Assume @sc{crlf} represents end-of-line.} | 978 ;; @r{Assume @acronym{crlf} represents end-of-line.} |
979 (let ((coding-system-for-write 'emacs-mule-dos)) | 979 (let ((coding-system-for-write 'emacs-mule-dos)) |
980 (insert-file-contents filename)) | 980 (insert-file-contents filename)) |
981 @end example | 981 @end example |
982 | 982 |
983 When its value is non-@code{nil}, @code{coding-system-for-read} takes | 983 When its value is non-@code{nil}, @code{coding-system-for-read} takes |
1173 | 1173 |
1174 @node Input Methods | 1174 @node Input Methods |
1175 @section Input Methods | 1175 @section Input Methods |
1176 @cindex input methods | 1176 @cindex input methods |
1177 | 1177 |
1178 @dfn{Input methods} provide convenient ways of entering non-@sc{ascii} | 1178 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII} |
1179 characters from the keyboard. Unlike coding systems, which translate | 1179 characters from the keyboard. Unlike coding systems, which translate |
1180 non-@sc{ascii} characters to and from encodings meant to be read by | 1180 non-@acronym{ASCII} characters to and from encodings meant to be read by |
1181 programs, input methods provide human-friendly commands. (@xref{Input | 1181 programs, input methods provide human-friendly commands. (@xref{Input |
1182 Methods,,, emacs, The GNU Emacs Manual}, for information on how users | 1182 Methods,,, emacs, The GNU Emacs Manual}, for information on how users |
1183 use input methods to enter text.) How to define input methods is not | 1183 use input methods to enter text.) How to define input methods is not |
1184 yet documented in this manual, but here we describe how to use them. | 1184 yet documented in this manual, but here we describe how to use them. |
1185 | 1185 |