comparison lispref/nonascii.texi @ 52978:1a5c50faf357

Replace @sc{foo} with @acronym{FOO}.
author Eli Zaretskii <eliz@gnu.org>
date Sun, 02 Nov 2003 06:29:59 +0000
parents 814620b1c1af
children 04d2bf306bd2
comparison
equal deleted inserted replaced
52977:8af8c70252c1 52978:1a5c50faf357
2 @c This is part of the GNU Emacs Lisp Reference Manual. 2 @c This is part of the GNU Emacs Lisp Reference Manual.
3 @c Copyright (C) 1998, 1999 Free Software Foundation, Inc. 3 @c Copyright (C) 1998, 1999 Free Software Foundation, Inc.
4 @c See the file elisp.texi for copying conditions. 4 @c See the file elisp.texi for copying conditions.
5 @setfilename ../info/characters 5 @setfilename ../info/characters
6 @node Non-ASCII Characters, Searching and Matching, Text, Top 6 @node Non-ASCII Characters, Searching and Matching, Text, Top
7 @chapter Non-@sc{ascii} Characters 7 @chapter Non-@acronym{ASCII} Characters
8 @cindex multibyte characters 8 @cindex multibyte characters
9 @cindex non-@sc{ascii} characters 9 @cindex non-@acronym{ASCII} characters
10 10
11 This chapter covers the special issues relating to non-@sc{ascii} 11 This chapter covers the special issues relating to non-@acronym{ASCII}
12 characters and how they are stored in strings and buffers. 12 characters and how they are stored in strings and buffers.
13 13
14 @menu 14 @menu
15 * Text Representations:: Unibyte and multibyte representations 15 * Text Representations:: Unibyte and multibyte representations
16 * Converting Representations:: Converting unibyte to multibyte and vice versa. 16 * Converting Representations:: Converting unibyte to multibyte and vice versa.
42 attention to the difference. 42 attention to the difference.
43 43
44 @cindex unibyte text 44 @cindex unibyte text
45 In unibyte representation, each character occupies one byte and 45 In unibyte representation, each character occupies one byte and
46 therefore the possible character codes range from 0 to 255. Codes 0 46 therefore the possible character codes range from 0 to 255. Codes 0
47 through 127 are @sc{ascii} characters; the codes from 128 through 255 47 through 127 are @acronym{ASCII} characters; the codes from 128 through 255
48 are used for one non-@sc{ascii} character set (you can choose which 48 are used for one non-@acronym{ASCII} character set (you can choose which
49 character set by setting the variable @code{nonascii-insert-offset}). 49 character set by setting the variable @code{nonascii-insert-offset}).
50 50
51 @cindex leading code 51 @cindex leading code
52 @cindex multibyte text 52 @cindex multibyte text
53 @cindex trailing codes 53 @cindex trailing codes
132 the characters that might be in the multibyte text. The other natural 132 the characters that might be in the multibyte text. The other natural
133 alternative, to convert the buffer contents to multibyte, is not 133 alternative, to convert the buffer contents to multibyte, is not
134 acceptable because the buffer's representation is a choice made by the 134 acceptable because the buffer's representation is a choice made by the
135 user that cannot be overridden automatically. 135 user that cannot be overridden automatically.
136 136
137 Converting unibyte text to multibyte text leaves @sc{ascii} characters 137 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
138 unchanged, and likewise character codes 128 through 159. It converts 138 unchanged, and likewise character codes 128 through 159. It converts
139 the non-@sc{ascii} codes 160 through 255 by adding the value 139 the non-@acronym{ASCII} codes 160 through 255 by adding the value
140 @code{nonascii-insert-offset} to each character code. By setting this 140 @code{nonascii-insert-offset} to each character code. By setting this
141 variable, you specify which character set the unibyte characters 141 variable, you specify which character set the unibyte characters
142 correspond to (@pxref{Character Sets}). For example, if 142 correspond to (@pxref{Character Sets}). For example, if
143 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char 143 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
144 'latin-iso8859-1) 128)}, then the unibyte non-@sc{ascii} characters 144 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
145 correspond to Latin 1. If it is 2688, which is @code{(- (make-char 145 correspond to Latin 1. If it is 2688, which is @code{(- (make-char
146 'greek-iso8859-7) 128)}, then they correspond to Greek letters. 146 'greek-iso8859-7) 128)}, then they correspond to Greek letters.
147 147
148 Converting multibyte text to unibyte is simpler: it discards all but 148 Converting multibyte text to unibyte is simpler: it discards all but
149 the low 8 bits of each character code. If @code{nonascii-insert-offset} 149 the low 8 bits of each character code. If @code{nonascii-insert-offset}
151 set, this conversion is the inverse of the other: converting unibyte 151 set, this conversion is the inverse of the other: converting unibyte
152 text to multibyte and back to unibyte reproduces the original unibyte 152 text to multibyte and back to unibyte reproduces the original unibyte
153 text. 153 text.
154 154
155 @defvar nonascii-insert-offset 155 @defvar nonascii-insert-offset
156 This variable specifies the amount to add to a non-@sc{ascii} character 156 This variable specifies the amount to add to a non-@acronym{ASCII} character
157 when converting unibyte text to multibyte. It also applies when 157 when converting unibyte text to multibyte. It also applies when
158 @code{self-insert-command} inserts a character in the unibyte 158 @code{self-insert-command} inserts a character in the unibyte
159 non-@sc{ascii} range, 128 through 255. However, the functions 159 non-@acronym{ASCII} range, 128 through 255. However, the functions
160 @code{insert} and @code{insert-char} do not perform this conversion. 160 @code{insert} and @code{insert-char} do not perform this conversion.
161 161
162 The right value to use to select character set @var{cs} is @code{(- 162 The right value to use to select character set @var{cs} is @code{(-
163 (make-char @var{cs}) 128)}. If the value of 163 (make-char @var{cs}) 128)}. If the value of
164 @code{nonascii-insert-offset} is zero, then conversion actually uses the 164 @code{nonascii-insert-offset} is zero, then conversion actually uses the
261 0 to 255---the values that can fit in one byte. The valid character 261 0 to 255---the values that can fit in one byte. The valid character
262 codes for multibyte representation range from 0 to 524287, but not all 262 codes for multibyte representation range from 0 to 524287, but not all
263 values in that range are valid. The values 128 through 255 are not 263 values in that range are valid. The values 128 through 255 are not
264 entirely proper in multibyte text, but they can occur if you do explicit 264 entirely proper in multibyte text, but they can occur if you do explicit
265 encoding and decoding (@pxref{Explicit Encoding}). Some other character 265 encoding and decoding (@pxref{Explicit Encoding}). Some other character
266 codes cannot occur at all in multibyte text. Only the @sc{ascii} codes 266 codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes
267 0 through 127 are completely legitimate in both representations. 267 0 through 127 are completely legitimate in both representations.
268 268
269 @defun char-valid-p charcode &optional genericp 269 @defun char-valid-p charcode &optional genericp
270 This returns @code{t} if @var{charcode} is valid for either one of the two 270 This returns @code{t} if @var{charcode} is valid for either one of the two
271 text representations. 271 text representations.
299 cases, characters that would logically be grouped together are split 299 cases, characters that would logically be grouped together are split
300 into several character sets. For example, one set of Chinese 300 into several character sets. For example, one set of Chinese
301 characters, generally known as Big 5, is divided into two Emacs 301 characters, generally known as Big 5, is divided into two Emacs
302 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. 302 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}.
303 303
304 @sc{ascii} characters are in character set @code{ascii}. The 304 @acronym{ASCII} characters are in character set @code{ascii}. The
305 non-@sc{ascii} characters 128 through 159 are in character set 305 non-@acronym{ASCII} characters 128 through 159 are in character set
306 @code{eight-bit-control}, and codes 160 through 255 are in character set 306 @code{eight-bit-control}, and codes 160 through 255 are in character set
307 @code{eight-bit-graphic}. 307 @code{eight-bit-graphic}.
308 308
309 @defun charsetp object 309 @defun charsetp object
310 Returns @code{t} if @var{object} is a symbol that names a character set, 310 Returns @code{t} if @var{object} is a symbol that names a character set,
334 334
335 @cindex introduction sequence 335 @cindex introduction sequence
336 @cindex dimension (of character set) 336 @cindex dimension (of character set)
337 In multibyte representation, each character occupies one or more 337 In multibyte representation, each character occupies one or more
338 bytes. Each character set has an @dfn{introduction sequence}, which is 338 bytes. Each character set has an @dfn{introduction sequence}, which is
339 normally one or two bytes long. (Exception: the @sc{ascii} character 339 normally one or two bytes long. (Exception: the @code{ascii} character
340 set and the @sc{eight-bit-graphic} character set have a zero-length 340 set and the @code{eight-bit-graphic} character set have a zero-length
341 introduction sequence.) The introduction sequence is the beginning of 341 introduction sequence.) The introduction sequence is the beginning of
342 the byte sequence for any character in the character set. The rest of 342 the byte sequence for any character in the character set. The rest of
343 the character's bytes distinguish it from the other characters in the 343 the character's bytes distinguish it from the other characters in the
344 same character set. Depending on the character set, there are either 344 same character set. Depending on the character set, there are either
345 one or two distinguishing bytes; the number of such bytes is called the 345 one or two distinguishing bytes; the number of such bytes is called the
424 @result{} t 424 @result{} t
425 (split-char 2176) 425 (split-char 2176)
426 @result{} (latin-iso8859-1 0) 426 @result{} (latin-iso8859-1 0)
427 @end example 427 @end example
428 428
429 The character sets @sc{ascii}, @sc{eight-bit-control}, and 429 The character sets @code{ascii}, @code{eight-bit-control}, and
430 @sc{eight-bit-graphic} don't have corresponding generic characters. If 430 @code{eight-bit-graphic} don't have corresponding generic characters. If
431 @var{charset} is one of them and you don't supply @var{code1}, 431 @var{charset} is one of them and you don't supply @var{code1},
432 @code{make-char} returns the character code corresponding to the 432 @code{make-char} returns the character code corresponding to the
433 smallest code in @var{charset}. 433 smallest code in @var{charset}.
434 434
435 @node Scanning Charsets 435 @node Scanning Charsets
742 handle decoding the text that was scanned. They are listed in order of 742 handle decoding the text that was scanned. They are listed in order of
743 decreasing priority. But if @var{highest} is non-@code{nil}, then the 743 decreasing priority. But if @var{highest} is non-@code{nil}, then the
744 return value is just one coding system, the one that is highest in 744 return value is just one coding system, the one that is highest in
745 priority. 745 priority.
746 746
747 If the region contains only @sc{ascii} characters, the value 747 If the region contains only @acronym{ASCII} characters, the value
748 is @code{undecided} or @code{(undecided)}. 748 is @code{undecided} or @code{(undecided)}.
749 @end defun 749 @end defun
750 750
751 @defun detect-coding-string string highest 751 @defun detect-coding-string string highest
752 This function is like @code{detect-coding-region} except that it 752 This function is like @code{detect-coding-region} except that it
844 reading and writing particular files. Each element has the form 844 reading and writing particular files. Each element has the form
845 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular 845 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
846 expression that matches certain file names. The element applies to file 846 expression that matches certain file names. The element applies to file
847 names that match @var{pattern}. 847 names that match @var{pattern}.
848 848
849 The @sc{cdr} of the element, @var{coding}, should be either a coding 849 The @acronym{CDR} of the element, @var{coding}, should be either a coding
850 system, a cons cell containing two coding systems, or a function name (a 850 system, a cons cell containing two coding systems, or a function name (a
851 symbol with a function definition). If @var{coding} is a coding system, 851 symbol with a function definition). If @var{coding} is a coding system,
852 that coding system is used for both reading the file and writing it. If 852 that coding system is used for both reading the file and writing it. If
853 @var{coding} is a cons cell containing two coding systems, its @sc{car} 853 @var{coding} is a cons cell containing two coding systems, its @acronym{CAR}
854 specifies the coding system for decoding, and its @sc{cdr} specifies the 854 specifies the coding system for decoding, and its @acronym{cdr} specifies the
855 coding system for encoding. 855 coding system for encoding.
856 856
857 If @var{coding} is a function name, the function must return a coding 857 If @var{coding} is a function name, the function must return a coding
858 system or a cons cell containing two coding systems. This value is used 858 system or a cons cell containing two coding systems. This value is used
859 as described above. 859 as described above.
973 you should not globally set it to any other value. Here is an example 973 you should not globally set it to any other value. Here is an example
974 of the right way to use the variable: 974 of the right way to use the variable:
975 975
976 @example 976 @example
977 ;; @r{Read the file with no character code conversion.} 977 ;; @r{Read the file with no character code conversion.}
978 ;; @r{Assume @sc{crlf} represents end-of-line.} 978 ;; @r{Assume @acronym{crlf} represents end-of-line.}
979 (let ((coding-system-for-write 'emacs-mule-dos)) 979 (let ((coding-system-for-write 'emacs-mule-dos))
980 (insert-file-contents filename)) 980 (insert-file-contents filename))
981 @end example 981 @end example
982 982
983 When its value is non-@code{nil}, @code{coding-system-for-read} takes 983 When its value is non-@code{nil}, @code{coding-system-for-read} takes
1173 1173
1174 @node Input Methods 1174 @node Input Methods
1175 @section Input Methods 1175 @section Input Methods
1176 @cindex input methods 1176 @cindex input methods
1177 1177
1178 @dfn{Input methods} provide convenient ways of entering non-@sc{ascii} 1178 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
1179 characters from the keyboard. Unlike coding systems, which translate 1179 characters from the keyboard. Unlike coding systems, which translate
1180 non-@sc{ascii} characters to and from encodings meant to be read by 1180 non-@acronym{ASCII} characters to and from encodings meant to be read by
1181 programs, input methods provide human-friendly commands. (@xref{Input 1181 programs, input methods provide human-friendly commands. (@xref{Input
1182 Methods,,, emacs, The GNU Emacs Manual}, for information on how users 1182 Methods,,, emacs, The GNU Emacs Manual}, for information on how users
1183 use input methods to enter text.) How to define input methods is not 1183 use input methods to enter text.) How to define input methods is not
1184 yet documented in this manual, but here we describe how to use them. 1184 yet documented in this manual, but here we describe how to use them.
1185 1185