Mercurial > emacs
annotate doc/lispref/nonascii.texi @ 97366:d2c211c8ceda
(w32_list_system_processes, w32_system_process_attributes): Add prototypes.
(Qeuid, Qegid, Qcomm, Qstate, Qppid, Qpgrp, Qsess, Qttname)
(Qminflt, Qmajflt, Qcminflt, Qcmajflt, Qutime, Qstime, Qcutime)
(Qpri, Qnice, Qthcount, Qstart, Qvsize, Qrss, Qargs, Quser, Qgroup)
(Qetime, Qpcpu, Qpmem, Qtpgid, Qcstime): Add extern declarations.
author | Eli Zaretskii <eliz@gnu.org> |
---|---|
date | Sat, 09 Aug 2008 17:53:30 +0000 |
parents | 0fd94280462b |
children | df0ee162b492 |
rev | line source |
---|---|
84090 | 1 @c -*-texinfo-*- |
2 @c This is part of the GNU Emacs Lisp Reference Manual. | |
3 @c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004, | |
87649 | 4 @c 2005, 2006, 2007, 2008 Free Software Foundation, Inc. |
84090 | 5 @c See the file elisp.texi for copying conditions. |
84116
0ba80d073e27
(setfilename): Go up one more level to ../../info.
Glenn Morris <rgm@gnu.org>
parents:
84090
diff
changeset
|
6 @setfilename ../../info/characters |
84090 | 7 @node Non-ASCII Characters, Searching and Matching, Text, Top |
8 @chapter Non-@acronym{ASCII} Characters | |
9 @cindex multibyte characters | |
10 @cindex characters, multi-byte | |
11 @cindex non-@acronym{ASCII} characters | |
12 | |
13 This chapter covers the special issues relating to non-@acronym{ASCII} | |
14 characters and how they are stored in strings and buffers. | |
15 | |
16 @menu | |
17 * Text Representations:: Unibyte and multibyte representations | |
18 * Converting Representations:: Converting unibyte to multibyte and vice versa. | |
19 * Selecting a Representation:: Treating a byte sequence as unibyte or multi. | |
20 * Character Codes:: How unibyte and multibyte relate to | |
21 codes of individual characters. | |
22 * Character Sets:: The space of possible character codes | |
23 is divided into various character sets. | |
24 * Chars and Bytes:: More information about multibyte encodings. | |
25 * Splitting Characters:: Converting a character to its byte sequence. | |
26 * Scanning Charsets:: Which character sets are used in a buffer? | |
27 * Translation of Characters:: Translation tables are used for conversion. | |
28 * Coding Systems:: Coding systems are conversions for saving files. | |
29 * Input Methods:: Input methods allow users to enter various | |
30 non-ASCII characters without special keyboards. | |
31 * Locales:: Interacting with the POSIX locale. | |
32 @end menu | |
33 | |
34 @node Text Representations | |
35 @section Text Representations | |
36 @cindex text representations | |
37 | |
38 Emacs has two @dfn{text representations}---two ways to represent text | |
39 in a string or buffer. These are called @dfn{unibyte} and | |
40 @dfn{multibyte}. Each string, and each buffer, uses one of these two | |
41 representations. For most purposes, you can ignore the issue of | |
42 representations, because Emacs converts text between them as | |
43 appropriate. Occasionally in Lisp programming you will need to pay | |
44 attention to the difference. | |
45 | |
46 @cindex unibyte text | |
47 In unibyte representation, each character occupies one byte and | |
48 therefore the possible character codes range from 0 to 255. Codes 0 | |
49 through 127 are @acronym{ASCII} characters; the codes from 128 through 255 | |
50 are used for one non-@acronym{ASCII} character set (you can choose which | |
51 character set by setting the variable @code{nonascii-insert-offset}). | |
52 | |
53 @cindex leading code | |
54 @cindex multibyte text | |
55 @cindex trailing codes | |
56 In multibyte representation, a character may occupy more than one | |
57 byte, and as a result, the full range of Emacs character codes can be | |
58 stored. The first byte of a multibyte character is always in the range | |
59 128 through 159 (octal 0200 through 0237). These values are called | |
60 @dfn{leading codes}. The second and subsequent bytes of a multibyte | |
61 character are always in the range 160 through 255 (octal 0240 through | |
62 0377); these values are @dfn{trailing codes}. | |
63 | |
64 Some sequences of bytes are not valid in multibyte text: for example, | |
65 a single isolated byte in the range 128 through 159 is not allowed. But | |
66 character codes 128 through 159 can appear in multibyte text, | |
67 represented as two-byte sequences. All the character codes 128 through | |
68 255 are possible (though slightly abnormal) in multibyte text; they | |
69 appear in multibyte buffers and strings when you do explicit encoding | |
70 and decoding (@pxref{Explicit Encoding}). | |
71 | |
72 In a buffer, the buffer-local value of the variable | |
73 @code{enable-multibyte-characters} specifies the representation used. | |
74 The representation for a string is determined and recorded in the string | |
75 when the string is constructed. | |
76 | |
77 @defvar enable-multibyte-characters | |
78 This variable specifies the current buffer's text representation. | |
79 If it is non-@code{nil}, the buffer contains multibyte text; otherwise, | |
80 it contains unibyte text. | |
81 | |
82 You cannot set this variable directly; instead, use the function | |
83 @code{set-buffer-multibyte} to change a buffer's representation. | |
84 @end defvar | |
85 | |
86 @defvar default-enable-multibyte-characters | |
87 This variable's value is entirely equivalent to @code{(default-value | |
88 'enable-multibyte-characters)}, and setting this variable changes that | |
89 default value. Setting the local binding of | |
90 @code{enable-multibyte-characters} in a specific buffer is not allowed, | |
91 but changing the default value is supported, and it is a reasonable | |
92 thing to do, because it has no effect on existing buffers. | |
93 | |
94 The @samp{--unibyte} command line option does its job by setting the | |
95 default value to @code{nil} early in startup. | |
96 @end defvar | |
97 | |
98 @defun position-bytes position | |
99 Return the byte-position corresponding to buffer position | |
100 @var{position} in the current buffer. This is 1 at the start of the | |
101 buffer, and counts upward in bytes. If @var{position} is out of | |
102 range, the value is @code{nil}. | |
103 @end defun | |
104 | |
105 @defun byte-to-position byte-position | |
106 Return the buffer position corresponding to byte-position | |
107 @var{byte-position} in the current buffer. If @var{byte-position} is | |
108 out of range, the value is @code{nil}. | |
109 @end defun | |
110 | |
111 @defun multibyte-string-p string | |
112 Return @code{t} if @var{string} is a multibyte string. | |
113 @end defun | |
114 | |
115 @defun string-bytes string | |
116 @cindex string, number of bytes | |
117 This function returns the number of bytes in @var{string}. | |
118 If @var{string} is a multibyte string, this can be greater than | |
119 @code{(length @var{string})}. | |
120 @end defun | |
121 | |
122 @node Converting Representations | |
123 @section Converting Text Representations | |
124 | |
125 Emacs can convert unibyte text to multibyte; it can also convert | |
126 multibyte text to unibyte, though this conversion loses information. In | |
127 general these conversions happen when inserting text into a buffer, or | |
128 when putting text from several strings together in one string. You can | |
129 also explicitly convert a string's contents to either representation. | |
130 | |
131 Emacs chooses the representation for a string based on the text that | |
132 it is constructed from. The general rule is to convert unibyte text to | |
133 multibyte text when combining it with other multibyte text, because the | |
134 multibyte representation is more general and can hold whatever | |
135 characters the unibyte text has. | |
136 | |
137 When inserting text into a buffer, Emacs converts the text to the | |
138 buffer's representation, as specified by | |
139 @code{enable-multibyte-characters} in that buffer. In particular, when | |
140 you insert multibyte text into a unibyte buffer, Emacs converts the text | |
141 to unibyte, even though this conversion cannot in general preserve all | |
142 the characters that might be in the multibyte text. The other natural | |
143 alternative, to convert the buffer contents to multibyte, is not | |
144 acceptable because the buffer's representation is a choice made by the | |
145 user that cannot be overridden automatically. | |
146 | |
147 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters | |
148 unchanged, and likewise character codes 128 through 159. It converts | |
149 the non-@acronym{ASCII} codes 160 through 255 by adding the value | |
150 @code{nonascii-insert-offset} to each character code. By setting this | |
151 variable, you specify which character set the unibyte characters | |
152 correspond to (@pxref{Character Sets}). For example, if | |
153 @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char | |
154 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters | |
155 correspond to Latin 1. If it is 2688, which is @code{(- (make-char | |
156 'greek-iso8859-7) 128)}, then they correspond to Greek letters. | |
157 | |
158 Converting multibyte text to unibyte is simpler: it discards all but | |
159 the low 8 bits of each character code. If @code{nonascii-insert-offset} | |
160 has a reasonable value, corresponding to the beginning of some character | |
161 set, this conversion is the inverse of the other: converting unibyte | |
162 text to multibyte and back to unibyte reproduces the original unibyte | |
163 text. | |
164 | |
165 @defvar nonascii-insert-offset | |
166 This variable specifies the amount to add to a non-@acronym{ASCII} character | |
167 when converting unibyte text to multibyte. It also applies when | |
168 @code{self-insert-command} inserts a character in the unibyte | |
169 non-@acronym{ASCII} range, 128 through 255. However, the functions | |
170 @code{insert} and @code{insert-char} do not perform this conversion. | |
171 | |
172 The right value to use to select character set @var{cs} is @code{(- | |
173 (make-char @var{cs}) 128)}. If the value of | |
174 @code{nonascii-insert-offset} is zero, then conversion actually uses the | |
175 value for the Latin 1 character set, rather than zero. | |
176 @end defvar | |
177 | |
178 @defvar nonascii-translation-table | |
179 This variable provides a more general alternative to | |
180 @code{nonascii-insert-offset}. You can use it to specify independently | |
181 how to translate each code in the range of 128 through 255 into a | |
182 multibyte character. The value should be a char-table, or @code{nil}. | |
183 If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}. | |
184 @end defvar | |
185 | |
186 The next three functions either return the argument @var{string}, or a | |
187 newly created string with no text properties. | |
188 | |
189 @defun string-make-unibyte string | |
190 This function converts the text of @var{string} to unibyte | |
191 representation, if it isn't already, and returns the result. If | |
192 @var{string} is a unibyte string, it is returned unchanged. Multibyte | |
193 character codes are converted to unibyte according to | |
194 @code{nonascii-translation-table} or, if that is @code{nil}, using | |
195 @code{nonascii-insert-offset}. If the lookup in the translation table | |
196 fails, this function takes just the low 8 bits of each character. | |
197 @end defun | |
198 | |
199 @defun string-make-multibyte string | |
200 This function converts the text of @var{string} to multibyte | |
201 representation, if it isn't already, and returns the result. If | |
202 @var{string} is a multibyte string or consists entirely of | |
203 @acronym{ASCII} characters, it is returned unchanged. In particular, | |
204 if @var{string} is unibyte and entirely @acronym{ASCII}, the returned | |
205 string is unibyte. (When the characters are all @acronym{ASCII}, | |
206 Emacs primitives will treat the string the same way whether it is | |
207 unibyte or multibyte.) If @var{string} is unibyte and contains | |
208 non-@acronym{ASCII} characters, the function | |
209 @code{unibyte-char-to-multibyte} is used to convert each unibyte | |
210 character to a multibyte character. | |
211 @end defun | |
212 | |
213 @defun string-to-multibyte string | |
214 This function returns a multibyte string containing the same sequence | |
215 of character codes as @var{string}. Unlike | |
216 @code{string-make-multibyte}, this function unconditionally returns a | |
217 multibyte string. If @var{string} is a multibyte string, it is | |
218 returned unchanged. | |
219 @end defun | |
220 | |
221 @defun multibyte-char-to-unibyte char | |
222 This convert the multibyte character @var{char} to a unibyte | |
223 character, based on @code{nonascii-translation-table} and | |
224 @code{nonascii-insert-offset}. | |
225 @end defun | |
226 | |
227 @defun unibyte-char-to-multibyte char | |
228 This convert the unibyte character @var{char} to a multibyte | |
229 character, based on @code{nonascii-translation-table} and | |
230 @code{nonascii-insert-offset}. | |
231 @end defun | |
232 | |
233 @node Selecting a Representation | |
234 @section Selecting a Representation | |
235 | |
236 Sometimes it is useful to examine an existing buffer or string as | |
237 multibyte when it was unibyte, or vice versa. | |
238 | |
239 @defun set-buffer-multibyte multibyte | |
240 Set the representation type of the current buffer. If @var{multibyte} | |
241 is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} | |
242 is @code{nil}, the buffer becomes unibyte. | |
243 | |
244 This function leaves the buffer contents unchanged when viewed as a | |
245 sequence of bytes. As a consequence, it can change the contents viewed | |
246 as characters; a sequence of two bytes which is treated as one character | |
247 in multibyte representation will count as two characters in unibyte | |
248 representation. Character codes 128 through 159 are an exception. They | |
249 are represented by one byte in a unibyte buffer, but when the buffer is | |
250 set to multibyte, they are converted to two-byte sequences, and vice | |
251 versa. | |
252 | |
253 This function sets @code{enable-multibyte-characters} to record which | |
254 representation is in use. It also adjusts various data in the buffer | |
255 (including overlays, text properties and markers) so that they cover the | |
256 same text as they did before. | |
257 | |
258 You cannot use @code{set-buffer-multibyte} on an indirect buffer, | |
259 because indirect buffers always inherit the representation of the | |
260 base buffer. | |
261 @end defun | |
262 | |
263 @defun string-as-unibyte string | |
264 This function returns a string with the same bytes as @var{string} but | |
265 treating each byte as a character. This means that the value may have | |
266 more characters than @var{string} has. | |
267 | |
268 If @var{string} is already a unibyte string, then the value is | |
269 @var{string} itself. Otherwise it is a newly created string, with no | |
270 text properties. If @var{string} is multibyte, any characters it | |
271 contains of charset @code{eight-bit-control} or @code{eight-bit-graphic} | |
272 are converted to the corresponding single byte. | |
273 @end defun | |
274 | |
275 @defun string-as-multibyte string | |
276 This function returns a string with the same bytes as @var{string} but | |
277 treating each multibyte sequence as one character. This means that the | |
278 value may have fewer characters than @var{string} has. | |
279 | |
280 If @var{string} is already a multibyte string, then the value is | |
281 @var{string} itself. Otherwise it is a newly created string, with no | |
282 text properties. If @var{string} is unibyte and contains any individual | |
283 8-bit bytes (i.e.@: not part of a multibyte form), they are converted to | |
284 the corresponding multibyte character of charset @code{eight-bit-control} | |
285 or @code{eight-bit-graphic}. | |
286 @end defun | |
287 | |
288 @node Character Codes | |
289 @section Character Codes | |
290 @cindex character codes | |
291 | |
292 The unibyte and multibyte text representations use different character | |
293 codes. The valid character codes for unibyte representation range from | |
294 0 to 255---the values that can fit in one byte. The valid character | |
295 codes for multibyte representation range from 0 to 524287, but not all | |
296 values in that range are valid. The values 128 through 255 are not | |
297 entirely proper in multibyte text, but they can occur if you do explicit | |
298 encoding and decoding (@pxref{Explicit Encoding}). Some other character | |
299 codes cannot occur at all in multibyte text. Only the @acronym{ASCII} codes | |
300 0 through 127 are completely legitimate in both representations. | |
301 | |
302 @defun char-valid-p charcode &optional genericp | |
303 This returns @code{t} if @var{charcode} is valid (either for unibyte | |
304 text or for multibyte text). | |
305 | |
306 @example | |
307 (char-valid-p 65) | |
308 @result{} t | |
309 (char-valid-p 256) | |
310 @result{} nil | |
311 (char-valid-p 2248) | |
312 @result{} t | |
313 @end example | |
314 | |
315 If the optional argument @var{genericp} is non-@code{nil}, this | |
316 function also returns @code{t} if @var{charcode} is a generic | |
317 character (@pxref{Splitting Characters}). | |
318 @end defun | |
319 | |
320 @node Character Sets | |
321 @section Character Sets | |
322 @cindex character sets | |
323 | |
324 Emacs classifies characters into various @dfn{character sets}, each of | |
325 which has a name which is a symbol. Each character belongs to one and | |
326 only one character set. | |
327 | |
328 In general, there is one character set for each distinct script. For | |
329 example, @code{latin-iso8859-1} is one character set, | |
330 @code{greek-iso8859-7} is another, and @code{ascii} is another. An | |
331 Emacs character set can hold at most 9025 characters; therefore, in some | |
332 cases, characters that would logically be grouped together are split | |
333 into several character sets. For example, one set of Chinese | |
334 characters, generally known as Big 5, is divided into two Emacs | |
335 character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. | |
336 | |
337 @acronym{ASCII} characters are in character set @code{ascii}. The | |
338 non-@acronym{ASCII} characters 128 through 159 are in character set | |
339 @code{eight-bit-control}, and codes 160 through 255 are in character set | |
340 @code{eight-bit-graphic}. | |
341 | |
342 @defun charsetp object | |
343 Returns @code{t} if @var{object} is a symbol that names a character set, | |
344 @code{nil} otherwise. | |
345 @end defun | |
346 | |
347 @defvar charset-list | |
348 The value is a list of all defined character set names. | |
349 @end defvar | |
350 | |
351 @defun charset-list | |
352 This function returns the value of @code{charset-list}. It is only | |
353 provided for backward compatibility. | |
354 @end defun | |
355 | |
356 @defun char-charset character | |
357 This function returns the name of the character set that @var{character} | |
358 belongs to, or the symbol @code{unknown} if @var{character} is not a | |
359 valid character. | |
360 @end defun | |
361 | |
362 @defun charset-plist charset | |
363 This function returns the charset property list of the character set | |
364 @var{charset}. Although @var{charset} is a symbol, this is not the same | |
365 as the property list of that symbol. Charset properties are used for | |
366 special purposes within Emacs. | |
367 @end defun | |
368 | |
369 @deffn Command list-charset-chars charset | |
370 This command displays a list of characters in the character set | |
371 @var{charset}. | |
372 @end deffn | |
373 | |
374 @node Chars and Bytes | |
375 @section Characters and Bytes | |
376 @cindex bytes and characters | |
377 | |
378 @cindex introduction sequence (of character) | |
379 @cindex dimension (of character set) | |
380 In multibyte representation, each character occupies one or more | |
381 bytes. Each character set has an @dfn{introduction sequence}, which is | |
382 normally one or two bytes long. (Exception: the @code{ascii} character | |
383 set and the @code{eight-bit-graphic} character set have a zero-length | |
384 introduction sequence.) The introduction sequence is the beginning of | |
385 the byte sequence for any character in the character set. The rest of | |
386 the character's bytes distinguish it from the other characters in the | |
387 same character set. Depending on the character set, there are either | |
388 one or two distinguishing bytes; the number of such bytes is called the | |
389 @dfn{dimension} of the character set. | |
390 | |
391 @defun charset-dimension charset | |
392 This function returns the dimension of @var{charset}; at present, the | |
393 dimension is always 1 or 2. | |
394 @end defun | |
395 | |
396 @defun charset-bytes charset | |
397 This function returns the number of bytes used to represent a character | |
398 in character set @var{charset}. | |
399 @end defun | |
400 | |
401 This is the simplest way to determine the byte length of a character | |
402 set's introduction sequence: | |
403 | |
404 @example | |
405 (- (charset-bytes @var{charset}) | |
406 (charset-dimension @var{charset})) | |
407 @end example | |
408 | |
409 @node Splitting Characters | |
410 @section Splitting Characters | |
411 @cindex character as bytes | |
412 | |
413 The functions in this section convert between characters and the byte | |
414 values used to represent them. For most purposes, there is no need to | |
415 be concerned with the sequence of bytes used to represent a character, | |
416 because Emacs translates automatically when necessary. | |
417 | |
418 @defun split-char character | |
419 Return a list containing the name of the character set of | |
420 @var{character}, followed by one or two byte values (integers) which | |
421 identify @var{character} within that character set. The number of byte | |
422 values is the character set's dimension. | |
423 | |
424 If @var{character} is invalid as a character code, @code{split-char} | |
425 returns a list consisting of the symbol @code{unknown} and @var{character}. | |
426 | |
427 @example | |
428 (split-char 2248) | |
429 @result{} (latin-iso8859-1 72) | |
430 (split-char 65) | |
431 @result{} (ascii 65) | |
432 (split-char 128) | |
433 @result{} (eight-bit-control 128) | |
434 @end example | |
435 @end defun | |
436 | |
437 @cindex generate characters in charsets | |
438 @defun make-char charset &optional code1 code2 | |
439 This function returns the character in character set @var{charset} whose | |
440 position codes are @var{code1} and @var{code2}. This is roughly the | |
441 inverse of @code{split-char}. Normally, you should specify either one | |
442 or both of @var{code1} and @var{code2} according to the dimension of | |
443 @var{charset}. For example, | |
444 | |
445 @example | |
446 (make-char 'latin-iso8859-1 72) | |
447 @result{} 2248 | |
448 @end example | |
449 | |
450 Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed | |
451 before they are used to index @var{charset}. Thus you may use, for | |
452 instance, an ISO 8859 character code rather than subtracting 128, as | |
453 is necessary to index the corresponding Emacs charset. | |
454 @end defun | |
455 | |
456 @cindex generic characters | |
457 If you call @code{make-char} with no @var{byte-values}, the result is | |
458 a @dfn{generic character} which stands for @var{charset}. A generic | |
459 character is an integer, but it is @emph{not} valid for insertion in the | |
460 buffer as a character. It can be used in @code{char-table-range} to | |
461 refer to the whole character set (@pxref{Char-Tables}). | |
462 @code{char-valid-p} returns @code{nil} for generic characters. | |
463 For example: | |
464 | |
465 @example | |
466 (make-char 'latin-iso8859-1) | |
467 @result{} 2176 | |
468 (char-valid-p 2176) | |
469 @result{} nil | |
470 (char-valid-p 2176 t) | |
471 @result{} t | |
472 (split-char 2176) | |
473 @result{} (latin-iso8859-1 0) | |
474 @end example | |
475 | |
476 The character sets @code{ascii}, @code{eight-bit-control}, and | |
477 @code{eight-bit-graphic} don't have corresponding generic characters. If | |
478 @var{charset} is one of them and you don't supply @var{code1}, | |
479 @code{make-char} returns the character code corresponding to the | |
480 smallest code in @var{charset}. | |
481 | |
482 @node Scanning Charsets | |
483 @section Scanning for Character Sets | |
484 | |
485 Sometimes it is useful to find out which character sets appear in a | |
486 part of a buffer or a string. One use for this is in determining which | |
487 coding systems (@pxref{Coding Systems}) are capable of representing all | |
488 of the text in question. | |
489 | |
490 @defun charset-after &optional pos | |
491 This function return the charset of a character in the current buffer | |
492 at position @var{pos}. If @var{pos} is omitted or @code{nil}, it | |
493 defaults to the current value of point. If @var{pos} is out of range, | |
494 the value is @code{nil}. | |
495 @end defun | |
496 | |
497 @defun find-charset-region beg end &optional translation | |
498 This function returns a list of the character sets that appear in the | |
499 current buffer between positions @var{beg} and @var{end}. | |
500 | |
501 The optional argument @var{translation} specifies a translation table to | |
502 be used in scanning the text (@pxref{Translation of Characters}). If it | |
503 is non-@code{nil}, then each character in the region is translated | |
504 through this table, and the value returned describes the translated | |
505 characters instead of the characters actually in the buffer. | |
506 @end defun | |
507 | |
508 @defun find-charset-string string &optional translation | |
509 This function returns a list of the character sets that appear in the | |
510 string @var{string}. It is just like @code{find-charset-region}, except | |
511 that it applies to the contents of @var{string} instead of part of the | |
512 current buffer. | |
513 @end defun | |
514 | |
515 @node Translation of Characters | |
516 @section Translation of Characters | |
517 @cindex character translation tables | |
518 @cindex translation tables | |
519 | |
520 A @dfn{translation table} is a char-table that specifies a mapping | |
521 of characters into characters. These tables are used in encoding and | |
522 decoding, and for other purposes. Some coding systems specify their | |
523 own particular translation tables; there are also default translation | |
524 tables which apply to all other coding systems. | |
525 | |
526 For instance, the coding-system @code{utf-8} has a translation table | |
527 that maps characters of various charsets (e.g., | |
528 @code{latin-iso8859-@var{x}}) into Unicode character sets. This way, | |
529 it can encode Latin-2 characters into UTF-8. Meanwhile, | |
530 @code{unify-8859-on-decoding-mode} operates by specifying | |
531 @code{standard-translation-table-for-decode} to translate | |
532 Latin-@var{x} characters into corresponding Unicode characters. | |
533 | |
534 @defun make-translation-table &rest translations | |
535 This function returns a translation table based on the argument | |
536 @var{translations}. Each element of @var{translations} should be a | |
537 list of elements of the form @code{(@var{from} . @var{to})}; this says | |
538 to translate the character @var{from} into @var{to}. | |
539 | |
540 The arguments and the forms in each argument are processed in order, | |
541 and if a previous form already translates @var{to} to some other | |
542 character, say @var{to-alt}, @var{from} is also translated to | |
543 @var{to-alt}. | |
544 | |
545 You can also map one whole character set into another character set with | |
546 the same dimension. To do this, you specify a generic character (which | |
547 designates a character set) for @var{from} (@pxref{Splitting Characters}). | |
548 In this case, if @var{to} is also a generic character, its character | |
549 set should have the same dimension as @var{from}'s. Then the | |
550 translation table translates each character of @var{from}'s character | |
551 set into the corresponding character of @var{to}'s character set. If | |
552 @var{from} is a generic character and @var{to} is an ordinary | |
553 character, then the translation table translates every character of | |
554 @var{from}'s character set into @var{to}. | |
555 @end defun | |
556 | |
557 In decoding, the translation table's translations are applied to the | |
558 characters that result from ordinary decoding. If a coding system has | |
559 property @code{translation-table-for-decode}, that specifies the | |
560 translation table to use. (This is a property of the coding system, | |
561 as returned by @code{coding-system-get}, not a property of the symbol | |
562 that is the coding system's name. @xref{Coding System Basics,, Basic | |
563 Concepts of Coding Systems}.) Otherwise, if | |
564 @code{standard-translation-table-for-decode} is non-@code{nil}, | |
565 decoding uses that table. | |
566 | |
567 In encoding, the translation table's translations are applied to the | |
568 characters in the buffer, and the result of translation is actually | |
569 encoded. If a coding system has property | |
570 @code{translation-table-for-encode}, that specifies the translation | |
571 table to use. Otherwise the variable | |
572 @code{standard-translation-table-for-encode} specifies the translation | |
573 table. | |
574 | |
575 @defvar standard-translation-table-for-decode | |
576 This is the default translation table for decoding, for | |
577 coding systems that don't specify any other translation table. | |
578 @end defvar | |
579 | |
580 @defvar standard-translation-table-for-encode | |
581 This is the default translation table for encoding, for | |
582 coding systems that don't specify any other translation table. | |
583 @end defvar | |
584 | |
585 @node Coding Systems | |
586 @section Coding Systems | |
587 | |
588 @cindex coding system | |
589 When Emacs reads or writes a file, and when Emacs sends text to a | |
590 subprocess or receives text from a subprocess, it normally performs | |
591 character code conversion and end-of-line conversion as specified | |
592 by a particular @dfn{coding system}. | |
593 | |
594 How to define a coding system is an arcane matter, and is not | |
595 documented here. | |
596 | |
597 @menu | |
598 * Coding System Basics:: Basic concepts. | |
599 * Encoding and I/O:: How file I/O functions handle coding systems. | |
600 * Lisp and Coding Systems:: Functions to operate on coding system names. | |
601 * User-Chosen Coding Systems:: Asking the user to choose a coding system. | |
602 * Default Coding Systems:: Controlling the default choices. | |
603 * Specifying Coding Systems:: Requesting a particular coding system | |
604 for a single file operation. | |
605 * Explicit Encoding:: Encoding or decoding text without doing I/O. | |
606 * Terminal I/O Encoding:: Use of encoding for terminal I/O. | |
607 * MS-DOS File Types:: How DOS "text" and "binary" files | |
608 relate to coding systems. | |
609 @end menu | |
610 | |
611 @node Coding System Basics | |
612 @subsection Basic Concepts of Coding Systems | |
613 | |
614 @cindex character code conversion | |
615 @dfn{Character code conversion} involves conversion between the encoding | |
616 used inside Emacs and some other encoding. Emacs supports many | |
617 different encodings, in that it can convert to and from them. For | |
618 example, it can convert text to or from encodings such as Latin 1, Latin | |
619 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some | |
620 cases, Emacs supports several alternative encodings for the same | |
621 characters; for example, there are three coding systems for the Cyrillic | |
622 (Russian) alphabet: ISO, Alternativnyj, and KOI8. | |
623 | |
624 Most coding systems specify a particular character code for | |
625 conversion, but some of them leave the choice unspecified---to be chosen | |
626 heuristically for each file, based on the data. | |
627 | |
628 In general, a coding system doesn't guarantee roundtrip identity: | |
629 decoding a byte sequence using coding system, then encoding the | |
630 resulting text in the same coding system, can produce a different byte | |
631 sequence. However, the following coding systems do guarantee that the | |
632 byte sequence will be the same as what you originally decoded: | |
633 | |
634 @quotation | |
635 chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule | |
636 greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3 | |
637 iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe | |
638 japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text | |
639 @end quotation | |
640 | |
641 Encoding buffer text and then decoding the result can also fail to | |
642 reproduce the original text. For instance, if you encode Latin-2 | |
643 characters with @code{utf-8} and decode the result using the same | |
644 coding system, you'll get Unicode characters (of charset | |
645 @code{mule-unicode-0100-24ff}). If you encode Unicode characters with | |
646 @code{iso-latin-2} and decode the result with the same coding system, | |
647 you'll get Latin-2 characters. | |
648 | |
649 @cindex EOL conversion | |
650 @cindex end-of-line conversion | |
651 @cindex line end conversion | |
652 @dfn{End of line conversion} handles three different conventions used | |
653 on various systems for representing end of line in files. The Unix | |
654 convention is to use the linefeed character (also called newline). The | |
655 DOS convention is to use a carriage-return and a linefeed at the end of | |
656 a line. The Mac convention is to use just carriage-return. | |
657 | |
658 @cindex base coding system | |
659 @cindex variant coding system | |
660 @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line | |
661 conversion unspecified, to be chosen based on the data. @dfn{Variant | |
662 coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and | |
663 @code{latin-1-mac} specify the end-of-line conversion explicitly as | |
664 well. Most base coding systems have three corresponding variants whose | |
665 names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}. | |
666 | |
667 The coding system @code{raw-text} is special in that it prevents | |
668 character code conversion, and causes the buffer visited with that | |
669 coding system to be a unibyte buffer. It does not specify the | |
670 end-of-line conversion, allowing that to be determined as usual by the | |
671 data, and has the usual three variants which specify the end-of-line | |
672 conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}: | |
673 it specifies no conversion of either character codes or end-of-line. | |
674 | |
675 The coding system @code{emacs-mule} specifies that the data is | |
676 represented in the internal Emacs encoding. This is like | |
677 @code{raw-text} in that no code conversion happens, but different in | |
678 that the result is multibyte data. | |
679 | |
680 @defun coding-system-get coding-system property | |
681 This function returns the specified property of the coding system | |
682 @var{coding-system}. Most coding system properties exist for internal | |
683 purposes, but one that you might find useful is @code{mime-charset}. | |
684 That property's value is the name used in MIME for the character coding | |
685 which this coding system can read and write. Examples: | |
686 | |
687 @example | |
688 (coding-system-get 'iso-latin-1 'mime-charset) | |
689 @result{} iso-8859-1 | |
690 (coding-system-get 'iso-2022-cn 'mime-charset) | |
691 @result{} iso-2022-cn | |
692 (coding-system-get 'cyrillic-koi8 'mime-charset) | |
693 @result{} koi8-r | |
694 @end example | |
695 | |
696 The value of the @code{mime-charset} property is also defined | |
697 as an alias for the coding system. | |
698 @end defun | |
699 | |
700 @node Encoding and I/O | |
701 @subsection Encoding and I/O | |
702 | |
703 The principal purpose of coding systems is for use in reading and | |
704 writing files. The function @code{insert-file-contents} uses | |
705 a coding system for decoding the file data, and @code{write-region} | |
706 uses one to encode the buffer contents. | |
707 | |
708 You can specify the coding system to use either explicitly | |
709 (@pxref{Specifying Coding Systems}), or implicitly using a default | |
710 mechanism (@pxref{Default Coding Systems}). But these methods may not | |
711 completely specify what to do. For example, they may choose a coding | |
712 system such as @code{undefined} which leaves the character code | |
713 conversion to be determined from the data. In these cases, the I/O | |
714 operation finishes the job of choosing a coding system. Very often | |
715 you will want to find out afterwards which coding system was chosen. | |
716 | |
717 @defvar buffer-file-coding-system | |
87276
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
718 This buffer-local variable records the coding system used for saving the |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
719 buffer and for writing part of the buffer with @code{write-region}. If |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
720 the text to be written cannot be safely encoded using the coding system |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
721 specified by this variable, these operations select an alternative |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
722 encoding by calling the function @code{select-safe-coding-system} |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
723 (@pxref{User-Chosen Coding Systems}). If selecting a different encoding |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
724 requires to ask the user to specify a coding system, |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
725 @code{buffer-file-coding-system} is updated to the newly selected coding |
c9e81d5cb2e7
(Encoding and I/O): Reword to avoid saying
Martin Rudalics <rudalics@gmx.at>
parents:
84116
diff
changeset
|
726 system. |
84090 | 727 |
728 @code{buffer-file-coding-system} does @emph{not} affect sending text | |
729 to a subprocess. | |
730 @end defvar | |
731 | |
732 @defvar save-buffer-coding-system | |
733 This variable specifies the coding system for saving the buffer (by | |
734 overriding @code{buffer-file-coding-system}). Note that it is not used | |
735 for @code{write-region}. | |
736 | |
737 When a command to save the buffer starts out to use | |
738 @code{buffer-file-coding-system} (or @code{save-buffer-coding-system}), | |
739 and that coding system cannot handle | |
740 the actual text in the buffer, the command asks the user to choose | |
741 another coding system (by calling @code{select-safe-coding-system}). | |
742 After that happens, the command also updates | |
743 @code{buffer-file-coding-system} to represent the coding system that | |
744 the user specified. | |
745 @end defvar | |
746 | |
747 @defvar last-coding-system-used | |
748 I/O operations for files and subprocesses set this variable to the | |
749 coding system name that was used. The explicit encoding and decoding | |
750 functions (@pxref{Explicit Encoding}) set it too. | |
751 | |
752 @strong{Warning:} Since receiving subprocess output sets this variable, | |
753 it can change whenever Emacs waits; therefore, you should copy the | |
754 value shortly after the function call that stores the value you are | |
755 interested in. | |
756 @end defvar | |
757 | |
758 The variable @code{selection-coding-system} specifies how to encode | |
759 selections for the window system. @xref{Window System Selections}. | |
760 | |
761 @defvar file-name-coding-system | |
762 The variable @code{file-name-coding-system} specifies the coding | |
763 system to use for encoding file names. Emacs encodes file names using | |
764 that coding system for all file operations. If | |
765 @code{file-name-coding-system} is @code{nil}, Emacs uses a default | |
766 coding system determined by the selected language environment. In the | |
767 default language environment, any non-@acronym{ASCII} characters in | |
768 file names are not encoded specially; they appear in the file system | |
769 using the internal Emacs representation. | |
770 @end defvar | |
771 | |
772 @strong{Warning:} if you change @code{file-name-coding-system} (or | |
773 the language environment) in the middle of an Emacs session, problems | |
774 can result if you have already visited files whose names were encoded | |
775 using the earlier coding system and are handled differently under the | |
776 new coding system. If you try to save one of these buffers under the | |
777 visited file name, saving may use the wrong file name, or it may get | |
778 an error. If such a problem happens, use @kbd{C-x C-w} to specify a | |
779 new file name for that buffer. | |
780 | |
781 @node Lisp and Coding Systems | |
782 @subsection Coding Systems in Lisp | |
783 | |
784 Here are the Lisp facilities for working with coding systems: | |
785 | |
786 @defun coding-system-list &optional base-only | |
787 This function returns a list of all coding system names (symbols). If | |
788 @var{base-only} is non-@code{nil}, the value includes only the | |
789 base coding systems. Otherwise, it includes alias and variant coding | |
790 systems as well. | |
791 @end defun | |
792 | |
793 @defun coding-system-p object | |
794 This function returns @code{t} if @var{object} is a coding system | |
795 name or @code{nil}. | |
796 @end defun | |
797 | |
798 @defun check-coding-system coding-system | |
799 This function checks the validity of @var{coding-system}. | |
800 If that is valid, it returns @var{coding-system}. | |
801 Otherwise it signals an error with condition @code{coding-system-error}. | |
802 @end defun | |
803 | |
804 @defun coding-system-eol-type coding-system | |
805 This function returns the type of end-of-line (a.k.a.@: @dfn{eol}) | |
806 conversion used by @var{coding-system}. If @var{coding-system} | |
807 specifies a certain eol conversion, the return value is an integer 0, | |
808 1, or 2, standing for @code{unix}, @code{dos}, and @code{mac}, | |
809 respectively. If @var{coding-system} doesn't specify eol conversion | |
810 explicitly, the return value is a vector of coding systems, each one | |
811 with one of the possible eol conversion types, like this: | |
812 | |
813 @lisp | |
814 (coding-system-eol-type 'latin-1) | |
815 @result{} [latin-1-unix latin-1-dos latin-1-mac] | |
816 @end lisp | |
817 | |
818 @noindent | |
819 If this function returns a vector, Emacs will decide, as part of the | |
820 text encoding or decoding process, what eol conversion to use. For | |
821 decoding, the end-of-line format of the text is auto-detected, and the | |
822 eol conversion is set to match it (e.g., DOS-style CRLF format will | |
823 imply @code{dos} eol conversion). For encoding, the eol conversion is | |
824 taken from the appropriate default coding system (e.g., | |
825 @code{default-buffer-file-coding-system} for | |
826 @code{buffer-file-coding-system}), or from the default eol conversion | |
827 appropriate for the underlying platform. | |
828 @end defun | |
829 | |
830 @defun coding-system-change-eol-conversion coding-system eol-type | |
831 This function returns a coding system which is like @var{coding-system} | |
832 except for its eol conversion, which is specified by @code{eol-type}. | |
833 @var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or | |
834 @code{nil}. If it is @code{nil}, the returned coding system determines | |
835 the end-of-line conversion from the data. | |
836 | |
837 @var{eol-type} may also be 0, 1 or 2, standing for @code{unix}, | |
838 @code{dos} and @code{mac}, respectively. | |
839 @end defun | |
840 | |
841 @defun coding-system-change-text-conversion eol-coding text-coding | |
842 This function returns a coding system which uses the end-of-line | |
843 conversion of @var{eol-coding}, and the text conversion of | |
844 @var{text-coding}. If @var{text-coding} is @code{nil}, it returns | |
845 @code{undecided}, or one of its variants according to @var{eol-coding}. | |
846 @end defun | |
847 | |
848 @defun find-coding-systems-region from to | |
849 This function returns a list of coding systems that could be used to | |
850 encode a text between @var{from} and @var{to}. All coding systems in | |
851 the list can safely encode any multibyte characters in that portion of | |
852 the text. | |
853 | |
854 If the text contains no multibyte characters, the function returns the | |
855 list @code{(undecided)}. | |
856 @end defun | |
857 | |
858 @defun find-coding-systems-string string | |
859 This function returns a list of coding systems that could be used to | |
860 encode the text of @var{string}. All coding systems in the list can | |
861 safely encode any multibyte characters in @var{string}. If the text | |
862 contains no multibyte characters, this returns the list | |
863 @code{(undecided)}. | |
864 @end defun | |
865 | |
866 @defun find-coding-systems-for-charsets charsets | |
867 This function returns a list of coding systems that could be used to | |
868 encode all the character sets in the list @var{charsets}. | |
869 @end defun | |
870 | |
871 @defun detect-coding-region start end &optional highest | |
872 This function chooses a plausible coding system for decoding the text | |
873 from @var{start} to @var{end}. This text should be a byte sequence | |
874 (@pxref{Explicit Encoding}). | |
875 | |
876 Normally this function returns a list of coding systems that could | |
877 handle decoding the text that was scanned. They are listed in order of | |
878 decreasing priority. But if @var{highest} is non-@code{nil}, then the | |
879 return value is just one coding system, the one that is highest in | |
880 priority. | |
881 | |
882 If the region contains only @acronym{ASCII} characters except for such | |
883 ISO-2022 control characters ISO-2022 as @code{ESC}, the value is | |
884 @code{undecided} or @code{(undecided)}, or a variant specifying | |
885 end-of-line conversion, if that can be deduced from the text. | |
886 @end defun | |
887 | |
888 @defun detect-coding-string string &optional highest | |
889 This function is like @code{detect-coding-region} except that it | |
890 operates on the contents of @var{string} instead of bytes in the buffer. | |
891 @end defun | |
892 | |
893 @xref{Coding systems for a subprocess,, Process Information}, in | |
894 particular the description of the functions | |
895 @code{process-coding-system} and @code{set-process-coding-system}, for | |
896 how to examine or set the coding systems used for I/O to a subprocess. | |
897 | |
898 @node User-Chosen Coding Systems | |
899 @subsection User-Chosen Coding Systems | |
900 | |
901 @cindex select safe coding system | |
902 @defun select-safe-coding-system from to &optional default-coding-system accept-default-p file | |
903 This function selects a coding system for encoding specified text, | |
904 asking the user to choose if necessary. Normally the specified text | |
905 is the text in the current buffer between @var{from} and @var{to}. If | |
906 @var{from} is a string, the string specifies the text to encode, and | |
907 @var{to} is ignored. | |
908 | |
909 If @var{default-coding-system} is non-@code{nil}, that is the first | |
910 coding system to try; if that can handle the text, | |
911 @code{select-safe-coding-system} returns that coding system. It can | |
912 also be a list of coding systems; then the function tries each of them | |
913 one by one. After trying all of them, it next tries the current | |
914 buffer's value of @code{buffer-file-coding-system} (if it is not | |
915 @code{undecided}), then the value of | |
916 @code{default-buffer-file-coding-system} and finally the user's most | |
917 preferred coding system, which the user can set using the command | |
918 @code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing | |
919 Coding Systems, emacs, The GNU Emacs Manual}). | |
920 | |
921 If one of those coding systems can safely encode all the specified | |
922 text, @code{select-safe-coding-system} chooses it and returns it. | |
923 Otherwise, it asks the user to choose from a list of coding systems | |
924 which can encode all the text, and returns the user's choice. | |
925 | |
926 @var{default-coding-system} can also be a list whose first element is | |
927 t and whose other elements are coding systems. Then, if no coding | |
928 system in the list can handle the text, @code{select-safe-coding-system} | |
929 queries the user immediately, without trying any of the three | |
930 alternatives described above. | |
931 | |
932 The optional argument @var{accept-default-p}, if non-@code{nil}, | |
933 should be a function to determine whether a coding system selected | |
934 without user interaction is acceptable. @code{select-safe-coding-system} | |
935 calls this function with one argument, the base coding system of the | |
936 selected coding system. If @var{accept-default-p} returns @code{nil}, | |
937 @code{select-safe-coding-system} rejects the silently selected coding | |
938 system, and asks the user to select a coding system from a list of | |
939 possible candidates. | |
940 | |
941 @vindex select-safe-coding-system-accept-default-p | |
942 If the variable @code{select-safe-coding-system-accept-default-p} is | |
943 non-@code{nil}, its value overrides the value of | |
944 @var{accept-default-p}. | |
945 | |
946 As a final step, before returning the chosen coding system, | |
947 @code{select-safe-coding-system} checks whether that coding system is | |
948 consistent with what would be selected if the contents of the region | |
949 were read from a file. (If not, this could lead to data corruption in | |
950 a file subsequently re-visited and edited.) Normally, | |
951 @code{select-safe-coding-system} uses @code{buffer-file-name} as the | |
952 file for this purpose, but if @var{file} is non-@code{nil}, it uses | |
953 that file instead (this can be relevant for @code{write-region} and | |
954 similar functions). If it detects an apparent inconsistency, | |
955 @code{select-safe-coding-system} queries the user before selecting the | |
956 coding system. | |
957 @end defun | |
958 | |
959 Here are two functions you can use to let the user specify a coding | |
960 system, with completion. @xref{Completion}. | |
961 | |
962 @defun read-coding-system prompt &optional default | |
963 This function reads a coding system using the minibuffer, prompting with | |
964 string @var{prompt}, and returns the coding system name as a symbol. If | |
965 the user enters null input, @var{default} specifies which coding system | |
966 to return. It should be a symbol or a string. | |
967 @end defun | |
968 | |
969 @defun read-non-nil-coding-system prompt | |
970 This function reads a coding system using the minibuffer, prompting with | |
971 string @var{prompt}, and returns the coding system name as a symbol. If | |
972 the user tries to enter null input, it asks the user to try again. | |
973 @xref{Coding Systems}. | |
974 @end defun | |
975 | |
976 @node Default Coding Systems | |
977 @subsection Default Coding Systems | |
978 | |
979 This section describes variables that specify the default coding | |
980 system for certain files or when running certain subprograms, and the | |
981 function that I/O operations use to access them. | |
982 | |
983 The idea of these variables is that you set them once and for all to the | |
984 defaults you want, and then do not change them again. To specify a | |
985 particular coding system for a particular operation in a Lisp program, | |
986 don't change these variables; instead, override them using | |
987 @code{coding-system-for-read} and @code{coding-system-for-write} | |
988 (@pxref{Specifying Coding Systems}). | |
989 | |
990 @defvar auto-coding-regexp-alist | |
991 This variable is an alist of text patterns and corresponding coding | |
992 systems. Each element has the form @code{(@var{regexp} | |
993 . @var{coding-system})}; a file whose first few kilobytes match | |
994 @var{regexp} is decoded with @var{coding-system} when its contents are | |
995 read into a buffer. The settings in this alist take priority over | |
996 @code{coding:} tags in the files and the contents of | |
997 @code{file-coding-system-alist} (see below). The default value is set | |
998 so that Emacs automatically recognizes mail files in Babyl format and | |
999 reads them with no code conversions. | |
1000 @end defvar | |
1001 | |
1002 @defvar file-coding-system-alist | |
1003 This variable is an alist that specifies the coding systems to use for | |
1004 reading and writing particular files. Each element has the form | |
1005 @code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular | |
1006 expression that matches certain file names. The element applies to file | |
1007 names that match @var{pattern}. | |
1008 | |
1009 The @sc{cdr} of the element, @var{coding}, should be either a coding | |
1010 system, a cons cell containing two coding systems, or a function name (a | |
1011 symbol with a function definition). If @var{coding} is a coding system, | |
1012 that coding system is used for both reading the file and writing it. If | |
1013 @var{coding} is a cons cell containing two coding systems, its @sc{car} | |
1014 specifies the coding system for decoding, and its @sc{cdr} specifies the | |
1015 coding system for encoding. | |
1016 | |
1017 If @var{coding} is a function name, the function should take one | |
1018 argument, a list of all arguments passed to | |
1019 @code{find-operation-coding-system}. It must return a coding system | |
1020 or a cons cell containing two coding systems. This value has the same | |
1021 meaning as described above. | |
1022 | |
1023 If @var{coding} (or what returned by the above function) is | |
1024 @code{undecided}, the normal code-detection is performed. | |
1025 @end defvar | |
1026 | |
1027 @defvar process-coding-system-alist | |
1028 This variable is an alist specifying which coding systems to use for a | |
1029 subprocess, depending on which program is running in the subprocess. It | |
1030 works like @code{file-coding-system-alist}, except that @var{pattern} is | |
1031 matched against the program name used to start the subprocess. The coding | |
1032 system or systems specified in this alist are used to initialize the | |
1033 coding systems used for I/O to the subprocess, but you can specify | |
1034 other coding systems later using @code{set-process-coding-system}. | |
1035 @end defvar | |
1036 | |
1037 @strong{Warning:} Coding systems such as @code{undecided}, which | |
1038 determine the coding system from the data, do not work entirely reliably | |
1039 with asynchronous subprocess output. This is because Emacs handles | |
1040 asynchronous subprocess output in batches, as it arrives. If the coding | |
1041 system leaves the character code conversion unspecified, or leaves the | |
1042 end-of-line conversion unspecified, Emacs must try to detect the proper | |
1043 conversion from one batch at a time, and this does not always work. | |
1044 | |
1045 Therefore, with an asynchronous subprocess, if at all possible, use a | |
1046 coding system which determines both the character code conversion and | |
1047 the end of line conversion---that is, one like @code{latin-1-unix}, | |
1048 rather than @code{undecided} or @code{latin-1}. | |
1049 | |
1050 @defvar network-coding-system-alist | |
1051 This variable is an alist that specifies the coding system to use for | |
1052 network streams. It works much like @code{file-coding-system-alist}, | |
1053 with the difference that the @var{pattern} in an element may be either a | |
1054 port number or a regular expression. If it is a regular expression, it | |
1055 is matched against the network service name used to open the network | |
1056 stream. | |
1057 @end defvar | |
1058 | |
1059 @defvar default-process-coding-system | |
1060 This variable specifies the coding systems to use for subprocess (and | |
1061 network stream) input and output, when nothing else specifies what to | |
1062 do. | |
1063 | |
1064 The value should be a cons cell of the form @code{(@var{input-coding} | |
1065 . @var{output-coding})}. Here @var{input-coding} applies to input from | |
1066 the subprocess, and @var{output-coding} applies to output to it. | |
1067 @end defvar | |
1068 | |
1069 @defvar auto-coding-functions | |
1070 This variable holds a list of functions that try to determine a | |
1071 coding system for a file based on its undecoded contents. | |
1072 | |
1073 Each function in this list should be written to look at text in the | |
1074 current buffer, but should not modify it in any way. The buffer will | |
1075 contain undecoded text of parts of the file. Each function should | |
1076 take one argument, @var{size}, which tells it how many characters to | |
1077 look at, starting from point. If the function succeeds in determining | |
1078 a coding system for the file, it should return that coding system. | |
1079 Otherwise, it should return @code{nil}. | |
1080 | |
1081 If a file has a @samp{coding:} tag, that takes precedence, so these | |
1082 functions won't be called. | |
1083 @end defvar | |
1084 | |
1085 @defun find-operation-coding-system operation &rest arguments | |
1086 This function returns the coding system to use (by default) for | |
1087 performing @var{operation} with @var{arguments}. The value has this | |
1088 form: | |
1089 | |
1090 @example | |
1091 (@var{decoding-system} . @var{encoding-system}) | |
1092 @end example | |
1093 | |
1094 The first element, @var{decoding-system}, is the coding system to use | |
1095 for decoding (in case @var{operation} does decoding), and | |
1096 @var{encoding-system} is the coding system for encoding (in case | |
1097 @var{operation} does encoding). | |
1098 | |
1099 The argument @var{operation} is a symbol, one of @code{write-region}, | |
1100 @code{start-process}, @code{call-process}, @code{call-process-region}, | |
1101 @code{insert-file-contents}, or @code{open-network-stream}. These are | |
1102 the names of the Emacs I/O primitives that can do character code and | |
1103 eol conversion. | |
1104 | |
1105 The remaining arguments should be the same arguments that might be given | |
1106 to the corresponding I/O primitive. Depending on the primitive, one | |
1107 of those arguments is selected as the @dfn{target}. For example, if | |
1108 @var{operation} does file I/O, whichever argument specifies the file | |
1109 name is the target. For subprocess primitives, the process name is the | |
1110 target. For @code{open-network-stream}, the target is the service name | |
1111 or port number. | |
1112 | |
1113 Depending on @var{operation}, this function looks up the target in | |
1114 @code{file-coding-system-alist}, @code{process-coding-system-alist}, | |
1115 or @code{network-coding-system-alist}. If the target is found in the | |
1116 alist, @code{find-operation-coding-system} returns its association in | |
1117 the alist; otherwise it returns @code{nil}. | |
1118 | |
1119 If @var{operation} is @code{insert-file-contents}, the argument | |
1120 corresponding to the target may be a cons cell of the form | |
1121 @code{(@var{filename} . @var{buffer})}). In that case, @var{filename} | |
1122 is a file name to look up in @code{file-coding-system-alist}, and | |
1123 @var{buffer} is a buffer that contains the file's contents (not yet | |
1124 decoded). If @code{file-coding-system-alist} specifies a function to | |
1125 call for this file, and that function needs to examine the file's | |
1126 contents (as it usually does), it should examine the contents of | |
1127 @var{buffer} instead of reading the file. | |
1128 @end defun | |
1129 | |
1130 @node Specifying Coding Systems | |
1131 @subsection Specifying a Coding System for One Operation | |
1132 | |
1133 You can specify the coding system for a specific operation by binding | |
1134 the variables @code{coding-system-for-read} and/or | |
1135 @code{coding-system-for-write}. | |
1136 | |
1137 @defvar coding-system-for-read | |
1138 If this variable is non-@code{nil}, it specifies the coding system to | |
1139 use for reading a file, or for input from a synchronous subprocess. | |
1140 | |
1141 It also applies to any asynchronous subprocess or network stream, but in | |
1142 a different way: the value of @code{coding-system-for-read} when you | |
1143 start the subprocess or open the network stream specifies the input | |
1144 decoding method for that subprocess or network stream. It remains in | |
1145 use for that subprocess or network stream unless and until overridden. | |
1146 | |
1147 The right way to use this variable is to bind it with @code{let} for a | |
1148 specific I/O operation. Its global value is normally @code{nil}, and | |
1149 you should not globally set it to any other value. Here is an example | |
1150 of the right way to use the variable: | |
1151 | |
1152 @example | |
1153 ;; @r{Read the file with no character code conversion.} | |
1154 ;; @r{Assume @acronym{crlf} represents end-of-line.} | |
1155 (let ((coding-system-for-read 'emacs-mule-dos)) | |
1156 (insert-file-contents filename)) | |
1157 @end example | |
1158 | |
1159 When its value is non-@code{nil}, this variable takes precedence over | |
1160 all other methods of specifying a coding system to use for input, | |
1161 including @code{file-coding-system-alist}, | |
1162 @code{process-coding-system-alist} and | |
1163 @code{network-coding-system-alist}. | |
1164 @end defvar | |
1165 | |
1166 @defvar coding-system-for-write | |
1167 This works much like @code{coding-system-for-read}, except that it | |
1168 applies to output rather than input. It affects writing to files, | |
1169 as well as sending output to subprocesses and net connections. | |
1170 | |
1171 When a single operation does both input and output, as do | |
1172 @code{call-process-region} and @code{start-process}, both | |
1173 @code{coding-system-for-read} and @code{coding-system-for-write} | |
1174 affect it. | |
1175 @end defvar | |
1176 | |
1177 @defvar inhibit-eol-conversion | |
1178 When this variable is non-@code{nil}, no end-of-line conversion is done, | |
1179 no matter which coding system is specified. This applies to all the | |
1180 Emacs I/O and subprocess primitives, and to the explicit encoding and | |
1181 decoding functions (@pxref{Explicit Encoding}). | |
1182 @end defvar | |
1183 | |
1184 @node Explicit Encoding | |
1185 @subsection Explicit Encoding and Decoding | |
1186 @cindex encoding in coding systems | |
1187 @cindex decoding in coding systems | |
1188 | |
1189 All the operations that transfer text in and out of Emacs have the | |
1190 ability to use a coding system to encode or decode the text. | |
1191 You can also explicitly encode and decode text using the functions | |
1192 in this section. | |
1193 | |
1194 The result of encoding, and the input to decoding, are not ordinary | |
1195 text. They logically consist of a series of byte values; that is, a | |
1196 series of characters whose codes are in the range 0 through 255. In a | |
1197 multibyte buffer or string, character codes 128 through 159 are | |
1198 represented by multibyte sequences, but this is invisible to Lisp | |
1199 programs. | |
1200 | |
1201 The usual way to read a file into a buffer as a sequence of bytes, so | |
1202 you can decode the contents explicitly, is with | |
1203 @code{insert-file-contents-literally} (@pxref{Reading from Files}); | |
1204 alternatively, specify a non-@code{nil} @var{rawfile} argument when | |
1205 visiting a file with @code{find-file-noselect}. These methods result in | |
1206 a unibyte buffer. | |
1207 | |
1208 The usual way to use the byte sequence that results from explicitly | |
1209 encoding text is to copy it to a file or process---for example, to write | |
1210 it with @code{write-region} (@pxref{Writing to Files}), and suppress | |
1211 encoding by binding @code{coding-system-for-write} to | |
1212 @code{no-conversion}. | |
1213 | |
1214 Here are the functions to perform explicit encoding or decoding. The | |
1215 encoding functions produce sequences of bytes; the decoding functions | |
1216 are meant to operate on sequences of bytes. All of these functions | |
1217 discard text properties. | |
1218 | |
1219 @deffn Command encode-coding-region start end coding-system | |
1220 This command encodes the text from @var{start} to @var{end} according | |
1221 to coding system @var{coding-system}. The encoded text replaces the | |
1222 original text in the buffer. The result of encoding is logically a | |
1223 sequence of bytes, but the buffer remains multibyte if it was multibyte | |
1224 before. | |
1225 | |
1226 This command returns the length of the encoded text. | |
1227 @end deffn | |
1228 | |
1229 @defun encode-coding-string string coding-system &optional nocopy | |
1230 This function encodes the text in @var{string} according to coding | |
1231 system @var{coding-system}. It returns a new string containing the | |
1232 encoded text, except when @var{nocopy} is non-@code{nil}, in which | |
1233 case the function may return @var{string} itself if the encoding | |
1234 operation is trivial. The result of encoding is a unibyte string. | |
1235 @end defun | |
1236 | |
1237 @deffn Command decode-coding-region start end coding-system | |
1238 This command decodes the text from @var{start} to @var{end} according | |
1239 to coding system @var{coding-system}. The decoded text replaces the | |
1240 original text in the buffer. To make explicit decoding useful, the text | |
1241 before decoding ought to be a sequence of byte values, but both | |
1242 multibyte and unibyte buffers are acceptable. | |
1243 | |
1244 This command returns the length of the decoded text. | |
1245 @end deffn | |
1246 | |
1247 @defun decode-coding-string string coding-system &optional nocopy | |
1248 This function decodes the text in @var{string} according to coding | |
1249 system @var{coding-system}. It returns a new string containing the | |
1250 decoded text, except when @var{nocopy} is non-@code{nil}, in which | |
1251 case the function may return @var{string} itself if the decoding | |
1252 operation is trivial. To make explicit decoding useful, the contents | |
1253 of @var{string} ought to be a sequence of byte values, but a multibyte | |
1254 string is acceptable. | |
1255 @end defun | |
1256 | |
1257 @defun decode-coding-inserted-region from to filename &optional visit beg end replace | |
1258 This function decodes the text from @var{from} to @var{to} as if | |
1259 it were being read from file @var{filename} using @code{insert-file-contents} | |
1260 using the rest of the arguments provided. | |
1261 | |
1262 The normal way to use this function is after reading text from a file | |
1263 without decoding, if you decide you would rather have decoded it. | |
1264 Instead of deleting the text and reading it again, this time with | |
1265 decoding, you can call this function. | |
1266 @end defun | |
1267 | |
1268 @node Terminal I/O Encoding | |
1269 @subsection Terminal I/O Encoding | |
1270 | |
1271 Emacs can decode keyboard input using a coding system, and encode | |
1272 terminal output. This is useful for terminals that transmit or display | |
1273 text using a particular encoding such as Latin-1. Emacs does not set | |
1274 @code{last-coding-system-used} for encoding or decoding for the | |
1275 terminal. | |
1276 | |
1277 @defun keyboard-coding-system | |
1278 This function returns the coding system that is in use for decoding | |
1279 keyboard input---or @code{nil} if no coding system is to be used. | |
1280 @end defun | |
1281 | |
1282 @deffn Command set-keyboard-coding-system coding-system | |
1283 This command specifies @var{coding-system} as the coding system to | |
1284 use for decoding keyboard input. If @var{coding-system} is @code{nil}, | |
1285 that means do not decode keyboard input. | |
1286 @end deffn | |
1287 | |
1288 @defun terminal-coding-system | |
1289 This function returns the coding system that is in use for encoding | |
1290 terminal output---or @code{nil} for no encoding. | |
1291 @end defun | |
1292 | |
1293 @deffn Command set-terminal-coding-system coding-system | |
1294 This command specifies @var{coding-system} as the coding system to use | |
1295 for encoding terminal output. If @var{coding-system} is @code{nil}, | |
1296 that means do not encode terminal output. | |
1297 @end deffn | |
1298 | |
1299 @node MS-DOS File Types | |
1300 @subsection MS-DOS File Types | |
1301 @cindex DOS file types | |
1302 @cindex MS-DOS file types | |
1303 @cindex Windows file types | |
1304 @cindex file types on MS-DOS and Windows | |
1305 @cindex text files and binary files | |
1306 @cindex binary files and text files | |
1307 | |
1308 On MS-DOS and Microsoft Windows, Emacs guesses the appropriate | |
1309 end-of-line conversion for a file by looking at the file's name. This | |
1310 feature classifies files as @dfn{text files} and @dfn{binary files}. By | |
1311 ``binary file'' we mean a file of literal byte values that are not | |
1312 necessarily meant to be characters; Emacs does no end-of-line conversion | |
1313 and no character code conversion for them. On the other hand, the bytes | |
1314 in a text file are intended to represent characters; when you create a | |
1315 new file whose name implies that it is a text file, Emacs uses DOS | |
1316 end-of-line conversion. | |
1317 | |
1318 @defvar buffer-file-type | |
1319 This variable, automatically buffer-local in each buffer, records the | |
1320 file type of the buffer's visited file. When a buffer does not specify | |
1321 a coding system with @code{buffer-file-coding-system}, this variable is | |
1322 used to determine which coding system to use when writing the contents | |
1323 of the buffer. It should be @code{nil} for text, @code{t} for binary. | |
1324 If it is @code{t}, the coding system is @code{no-conversion}. | |
1325 Otherwise, @code{undecided-dos} is used. | |
1326 | |
1327 Normally this variable is set by visiting a file; it is set to | |
1328 @code{nil} if the file was visited without any actual conversion. | |
1329 @end defvar | |
1330 | |
1331 @defopt file-name-buffer-file-type-alist | |
1332 This variable holds an alist for recognizing text and binary files. | |
1333 Each element has the form (@var{regexp} . @var{type}), where | |
1334 @var{regexp} is matched against the file name, and @var{type} may be | |
1335 @code{nil} for text, @code{t} for binary, or a function to call to | |
1336 compute which. If it is a function, then it is called with a single | |
1337 argument (the file name) and should return @code{t} or @code{nil}. | |
1338 | |
1339 When running on MS-DOS or MS-Windows, Emacs checks this alist to decide | |
1340 which coding system to use when reading a file. For a text file, | |
1341 @code{undecided-dos} is used. For a binary file, @code{no-conversion} | |
1342 is used. | |
1343 | |
1344 If no element in this alist matches a given file name, then | |
1345 @code{default-buffer-file-type} says how to treat the file. | |
1346 @end defopt | |
1347 | |
1348 @defopt default-buffer-file-type | |
1349 This variable says how to handle files for which | |
1350 @code{file-name-buffer-file-type-alist} says nothing about the type. | |
1351 | |
1352 If this variable is non-@code{nil}, then these files are treated as | |
1353 binary: the coding system @code{no-conversion} is used. Otherwise, | |
1354 nothing special is done for them---the coding system is deduced solely | |
1355 from the file contents, in the usual Emacs fashion. | |
1356 @end defopt | |
1357 | |
1358 @node Input Methods | |
1359 @section Input Methods | |
1360 @cindex input methods | |
1361 | |
1362 @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII} | |
1363 characters from the keyboard. Unlike coding systems, which translate | |
1364 non-@acronym{ASCII} characters to and from encodings meant to be read by | |
1365 programs, input methods provide human-friendly commands. (@xref{Input | |
1366 Methods,,, emacs, The GNU Emacs Manual}, for information on how users | |
1367 use input methods to enter text.) How to define input methods is not | |
1368 yet documented in this manual, but here we describe how to use them. | |
1369 | |
1370 Each input method has a name, which is currently a string; | |
1371 in the future, symbols may also be usable as input method names. | |
1372 | |
1373 @defvar current-input-method | |
1374 This variable holds the name of the input method now active in the | |
1375 current buffer. (It automatically becomes local in each buffer when set | |
1376 in any fashion.) It is @code{nil} if no input method is active in the | |
1377 buffer now. | |
1378 @end defvar | |
1379 | |
1380 @defopt default-input-method | |
1381 This variable holds the default input method for commands that choose an | |
1382 input method. Unlike @code{current-input-method}, this variable is | |
1383 normally global. | |
1384 @end defopt | |
1385 | |
1386 @deffn Command set-input-method input-method | |
1387 This command activates input method @var{input-method} for the current | |
1388 buffer. It also sets @code{default-input-method} to @var{input-method}. | |
1389 If @var{input-method} is @code{nil}, this command deactivates any input | |
1390 method for the current buffer. | |
1391 @end deffn | |
1392 | |
1393 @defun read-input-method-name prompt &optional default inhibit-null | |
1394 This function reads an input method name with the minibuffer, prompting | |
1395 with @var{prompt}. If @var{default} is non-@code{nil}, that is returned | |
1396 by default, if the user enters empty input. However, if | |
1397 @var{inhibit-null} is non-@code{nil}, empty input signals an error. | |
1398 | |
1399 The returned value is a string. | |
1400 @end defun | |
1401 | |
1402 @defvar input-method-alist | |
1403 This variable defines all the supported input methods. | |
1404 Each element defines one input method, and should have the form: | |
1405 | |
1406 @example | |
1407 (@var{input-method} @var{language-env} @var{activate-func} | |
1408 @var{title} @var{description} @var{args}...) | |
1409 @end example | |
1410 | |
1411 Here @var{input-method} is the input method name, a string; | |
1412 @var{language-env} is another string, the name of the language | |
1413 environment this input method is recommended for. (That serves only for | |
1414 documentation purposes.) | |
1415 | |
1416 @var{activate-func} is a function to call to activate this method. The | |
1417 @var{args}, if any, are passed as arguments to @var{activate-func}. All | |
1418 told, the arguments to @var{activate-func} are @var{input-method} and | |
1419 the @var{args}. | |
1420 | |
1421 @var{title} is a string to display in the mode line while this method is | |
1422 active. @var{description} is a string describing this method and what | |
1423 it is good for. | |
1424 @end defvar | |
1425 | |
1426 The fundamental interface to input methods is through the | |
1427 variable @code{input-method-function}. @xref{Reading One Event}, | |
1428 and @ref{Invoking the Input Method}. | |
1429 | |
1430 @node Locales | |
1431 @section Locales | |
1432 @cindex locale | |
1433 | |
1434 POSIX defines a concept of ``locales'' which control which language | |
1435 to use in language-related features. These Emacs variables control | |
1436 how Emacs interacts with these features. | |
1437 | |
1438 @defvar locale-coding-system | |
1439 @cindex keyboard input decoding on X | |
1440 This variable specifies the coding system to use for decoding system | |
1441 error messages and---on X Window system only---keyboard input, for | |
1442 encoding the format argument to @code{format-time-string}, and for | |
1443 decoding the return value of @code{format-time-string}. | |
1444 @end defvar | |
1445 | |
1446 @defvar system-messages-locale | |
1447 This variable specifies the locale to use for generating system error | |
1448 messages. Changing the locale can cause messages to come out in a | |
1449 different language or in a different orthography. If the variable is | |
1450 @code{nil}, the locale is specified by environment variables in the | |
1451 usual POSIX fashion. | |
1452 @end defvar | |
1453 | |
1454 @defvar system-time-locale | |
1455 This variable specifies the locale to use for formatting time values. | |
1456 Changing the locale can cause messages to appear according to the | |
1457 conventions of a different language. If the variable is @code{nil}, the | |
1458 locale is specified by environment variables in the usual POSIX fashion. | |
1459 @end defvar | |
1460 | |
1461 @defun locale-info item | |
1462 This function returns locale data @var{item} for the current POSIX | |
1463 locale, if available. @var{item} should be one of these symbols: | |
1464 | |
1465 @table @code | |
1466 @item codeset | |
1467 Return the character set as a string (locale item @code{CODESET}). | |
1468 | |
1469 @item days | |
1470 Return a 7-element vector of day names (locale items | |
1471 @code{DAY_1} through @code{DAY_7}); | |
1472 | |
1473 @item months | |
1474 Return a 12-element vector of month names (locale items @code{MON_1} | |
1475 through @code{MON_12}). | |
1476 | |
1477 @item paper | |
1478 Return a list @code{(@var{width} @var{height})} for the default paper | |
1479 size measured in millimeters (locale items @code{PAPER_WIDTH} and | |
1480 @code{PAPER_HEIGHT}). | |
1481 @end table | |
1482 | |
1483 If the system can't provide the requested information, or if | |
1484 @var{item} is not one of those symbols, the value is @code{nil}. All | |
1485 strings in the return value are decoded using | |
1486 @code{locale-coding-system}. @xref{Locales,,, libc, The GNU Libc Manual}, | |
1487 for more information about locales and locale items. | |
1488 @end defun | |
1489 | |
1490 @ignore | |
1491 arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb | |
1492 @end ignore |