# HG changeset patch # User Richard M. Stallman # Date 982433527 0 # Node ID 0fd801cdb9fd9f70ea04a499b3b081ef0f481bef # Parent 86e871a073b62d46c6d421133d3e7da8274cf296 Clarify undisplayable characters, --unibyte, locales. Clarify self-insertion of non-ASCII 8-bit chars. Clarify coding system detection of escape sequences. Clarify keyboard input methods and coding systems. Comment out the commands to inquire about character sets. Misc cleanups. diff -r 86e871a073b6 -r 0fd801cdb9fd man/mule.texi --- a/man/mule.texi Sat Feb 17 17:52:52 2001 +0000 +++ b/man/mule.texi Sat Feb 17 18:12:07 2001 +0000 @@ -42,7 +42,7 @@ ``MULti-lingual Enhancement to GNU Emacs'') Emacs also supports various encodings of these characters used by -internationalized software, such as word processors, mailers, etc. +other internationalized software, such as word processors and mailers. @menu * International Intro:: Basic concepts of multibyte characters. @@ -80,16 +80,31 @@ @kindex C-h h @findex view-hello-file @cindex undisplayable characters -@cindex ? -@cindex ?? +@cindex @samp{?} in display The command @kbd{C-h h} (@code{view-hello-file}) displays the file @file{etc/HELLO}, which shows how to say ``hello'' in many languages. -This illustrates various scripts. If the font you're using doesn't have -characters for all those different languages, you will see some hollow -boxes instead of characters; see @ref{Fontsets}. On non-windowing -displays, @samp{?} is displayed in place of the hollow box. More than -one @samp{?} is displayed for undisplayable characters that are wider -than one column. +This illustrates various scripts. If some characters can't be +displayed on your terminal, they appear as @samp{?} or as hollow boxes +(@pxref{Undisplayable Characters}). + + Keyboards, even in the countries where these character sets are used, +generally don't have keys for all the characters in them. So Emacs +supports various @dfn{input methods}, typically one for each script or +language, to make it convenient to type them. + +@kindex C-x RET + The prefix key @kbd{C-x @key{RET}} is used for commands that pertain +to multibyte characters, coding systems, and input methods. + +@ignore +@c This is commented out because it doesn't fit here, or anywhere. +@c This manual does not discuss "character sets" as they +@c are used in Mule, and it makes no sense to mention these commands +@c except as part of a larger discussion of the topic. +@c But it is not clear that topic is worth mentioning here, +@c since that is more of an implementation concept +@c than a user-level concept. And when we switch to Unicode, +@c character sets in the current sense may not even exist. @findex list-charset-chars @cindex characters in a certain charset @@ -101,15 +116,7 @@ The command @kbd{M-x describe-character-set} prompts for a character set name and displays information about that character set, including its internal representation within Emacs. - - Keyboards, even in the countries where these character sets are used, -generally don't have keys for all the characters in them. So Emacs -supports various @dfn{input methods}, typically one for each script or -language, to make it convenient to type them. - -@kindex C-x RET - The prefix key @kbd{C-x @key{RET}} is used for commands that pertain -to multibyte characters, coding systems, and input methods. +@end ignore @node Enabling Multibyte @section Enabling Multibyte Characters @@ -153,16 +160,22 @@ @cindex unibyte operation, and Lisp files @cindex init file, and non-ASCII characters @cindex environment variables, and non-ASCII characters - Multibyte strings are not created during initialization from the -values of environment variables, @file{/etc/passwd} entries etc.@: that -contain non-ASCII 8-bit characters. However, Lisp files, when they are -loaded for running, and in particular the initialization file -@file{.emacs}, are normally read as multibyte---even with -@samp{--unibyte}. To avoid multibyte strings being generated by -non-ASCII characters in Lisp files, put @samp{-*-unibyte: t;-*-} in a -comment on the first line, or specify the coding system @samp{raw-text} -with @kbd{C-x @key{RET} c}. Do the same for initialization files for -packages like Gnus. + With @samp{--unibyte}, multibyte strings are not created during +initialization from the values of environment variables, +@file{/etc/passwd} entries etc.@: that contain non-ASCII 8-bit +characters. + + Emacs normally loads Lisp files as multibyte, regardless of whether +you used @samp{--unibyte}. This includes the Emacs initialization +file, @file{.emacs}, and the initialization files of Emacs packages +such as Gnus. However, you can specify unibyte loading for a +particular Lisp file, by putting @samp{-*-unibyte: t;-*-} in a comment +on the first line. Then that file is always loaded as unibyte text, +even if you did not start Emacs with @samp{--unibyte}. The motivation +for these conventions is that it is more reliable to always load any +particular Lisp file in the same way. However, you can load a Lisp +file as unibyte, on any one occasion, by typing @kbd{C-x @key{RET} c +raw-text @key{RET}} immediately before loading it. The mode line indicates whether multibyte character support is enabled in the current buffer. If it is, there are two or more characters (most @@ -206,13 +219,12 @@ Dutch, Spanish, and Vietnamese. @end quotation -@cindex fonts, for displaying different languages - To be able to display the script(s) used by your language environment -on a windowed display, you need to have a suitable font installed. If -some of the characters appear as empty boxes, download and install the -GNU Intlfonts distribution, which includes fonts for all supported -scripts. @xref{Fontsets}, for more details about setting up your -fonts. +@cindex fonts for various scripts + To display the script(s) used by your language environment on a +graphical display, you need to have a suitable font. If some of the +characters appear as empty boxes, you should install the GNU Intlfonts +package, which includes fonts for all supported scripts. +@xref{Fontsets}, for more details about setting up your fonts. @findex set-locale-environment @vindex locale-language-names @@ -220,31 +232,21 @@ @cindex locales Some operating systems let you specify the language you are using by setting the locale environment variables @env{LC_ALL}, @env{LC_CTYPE}, -and @env{LANG}; the first of these which is nonempty specifies your -locale. Emacs handles this during startup by invoking the -@code{set-locale-environment} function, which matches your locale -against entries in the value of the variable +or @env{LANG}.@footnote{If more than one of these is set, the first +one that is nonempty specifies your locale for this purpose.} Emacs +handles this during startup by matching your locale against entries in +the value of the variables @code{locale-charset-language-names} and @code{locale-language-names} and selects the corresponding language -environment if a match is found. But if your locale also matches an -entry in the variable @code{locale-charset-language-names}, this entry -is preferred if its character set disagrees. For example, suppose the -locale @samp{en_GB.ISO8859-15} matches @code{"Latin-1"} in -@code{locale-language-names} and @code{"Latin-9"} in -@code{locale-charset-language-names}; since these two language -environments' character sets disagree, Emacs uses @code{"Latin-9"}. +environment if a match is found. (The former variable overrides the +latter.) It also adjusts the display table and terminal coding +system, the locale coding system, and the preferred coding system as +needed for the locale. - If all goes well, the @code{set-locale-environment} function selects -the language environment, since language is part of locale. It also -adjusts the display table and terminal coding system, the locale coding -system, and the preferred coding system as needed for the locale. + If you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG} +environment variables while running Emacs, you may want to invoke the +@code{set-locale-environment} function afterwards to readjust the +language environment from the new locale. - Since the @code{set-locale-environment} function is automatically -invoked during startup, you normally do not need to invoke it yourself. -However, if you modify the @env{LC_ALL}, @env{LC_CTYPE}, or @env{LANG} -environment variables, you may want to invoke the -@code{set-locale-environment} function afterwards. - -@findex set-locale-environment @vindex locale-preferred-coding-systems The @code{set-locale-environment} function normally uses the preferred coding system established by the language environment to decode system @@ -255,10 +257,10 @@ @code{locale-preferred-coding-systems}, Emacs uses that encoding even though it might normally use @code{japanese-iso-8bit}. - The environment chosen from the locale when Emacs starts is -overidden by any explicit use of the command -@code{set-language-environment} or customization of -@code{current-language-environment} in your init file. + You can override the language environment chosen at startup with +explicit use of the command @code{set-language-environment}, or with +customization of @code{current-language-environment} in your init +file. @kindex C-h L @findex describe-language-environment @@ -369,8 +371,10 @@ are in the minibuffer). @cindex Leim package -Input methods are implemented in the separate Leim package, which must -be installed with Emacs. + Input methods are implemented in the separate Leim package: they are +available only if the system administrator used Leim when building +Emacs. If Emacs was built without Leim, you will find that no input +methods are defined. @node Select Input Method @section Selecting an Input Method @@ -443,11 +447,12 @@ through 0377 (octal) are not really legitimate in the buffer. The valid non-ASCII printing characters have codes that start from 0400. - If you type a self-inserting character in the range 0240 -through 0377, Emacs assumes you intended to use one of the ISO -Latin-@var{n} character sets, and converts it to the Emacs code -representing that Latin-@var{n} character. You select @emph{which} ISO -Latin character set to use through your choice of language environment + If you type a self-inserting character in the range 0240 through +0377, or if you use @kbd{C-q} to insert one, Emacs assumes you +intended to use one of the ISO Latin-@var{n} character sets, and +converts it to the Emacs code representing that Latin-@var{n} +character. You select @emph{which} ISO Latin character set to use +through your choice of language environment @iftex (see above). @end iftex @@ -456,13 +461,12 @@ @end ifinfo If you do not specify a choice, the default is Latin-1. - The same thing happens when you use @kbd{C-q} to enter an octal code -in this range. If you enter a code in the range 0200 through 0237, -which forms the @code{eight-bit-control} character set, it is inserted + If you insert a character in the range 0200 through 0237, which +forms the @code{eight-bit-control} character set, it is inserted literally. You should normally avoid doing this since buffers containing such characters have to be written out in either the -@code{emacs-mule} or @code{raw-text} coding system, which is usually not -what you want. +@code{emacs-mule} or @code{raw-text} coding system, which is usually +not what you want. @node Coding Systems @section Coding Systems @@ -652,24 +656,24 @@ @cindex escape sequences in files By default, the automatic detection of coding system is sensitive to escape sequences. If Emacs sees a sequence of characters that begin -with an @key{ESC} character, and the sequence is valid as an ISO-2022 -code, the code is determined as one of ISO-2022 encoding, and the file -is decoded by the corresponding coding system -(e.g. @code{iso-2022-7bit}). +with an escape character, and the sequence is valid as an ISO-2022 +code, that tells Emacs to use one of the ISO-2022 encodings to decode +the file. - However, there may be cases that you want to read escape sequences in -a file as is. In such a case, you can set th variable + However, there may be cases that you want to read escape sequences +in a file as is. In such a case, you can set the variable @code{inhibit-iso-escape-detection} to non-@code{nil}. Then the code -detection will ignore any escape sequences, and so no file is detected -as being encoded in some of ISO-2022 encoding. The result is that all -escape sequences become visible in a buffer. +detection ignores any escape sequences, and never uses an ISO-2022 +encoding. The result is that all escape sequences become visible in +the buffer. The default value of @code{inhibit-iso-escape-detection} is -@code{nil}, and it is strongly recommended not to change it. That's -because many Emacs Lisp source files that contain non-ASCII characters -are encoded in the coding system @code{iso-2022-7bit} in the Emacs -distribution, and they won't be decoded correctly when you visit those -files if you suppress the escape sequence detection. +@code{nil}. We recommend that you not change it permanently, only for +one specific operation. That's because many Emacs Lisp source files +that contain non-ASCII characters are encoded in the coding system +@code{iso-2022-7bit} in the Emacs distribution, and they won't be +decoded correctly when you visit those files if you suppress the +escape sequence detection. @vindex coding You can specify the coding system for a particular file using the @@ -700,33 +704,34 @@ the buffer using @code{set-buffer-file-coding-system} (@pxref{Specify Coding}). - While editing a file, you will sometimes insert characters which -cannot be encoded with the coding system stored in -@code{buffer-file-coding-system}. For example, suppose you start with -an ASCII file and insert a few Latin-1 characters into it. Or you could -edit a text file in Polish encoded in @code{iso-8859-2} and add to it -translations of several Polish words into Russian. When you save the -buffer, Emacs can no longer use the previous value of the buffer's -coding system, because the characters you added cannot be encoded by -that coding system. + You can insert any possible character into any Emacs buffer, but +most coding systems can only handle some of the possible characters. +This means that you can insert characters that cannot be encoded with +the coding system that will be used to save the buffer. For example, +you could start with an ASCII file and insert a few Latin-1 characters +into it, or or you could edit a text file in Polish encoded in +@code{iso-8859-2} and add to it translations of several Polish words +into Russian. When you save the buffer, Emacs cannot use the current +value of @code{buffer-file-coding-system}, because the characters you +added cannot be encoded by that coding system. When that happens, Emacs tries the most-preferred coding system (set by @kbd{M-x prefer-coding-system} or @kbd{M-x -set-language-environment}), and if that coding system can safely encode -all of the characters in the buffer, Emacs uses it, and stores its value -in @code{buffer-file-coding-system}. Otherwise, Emacs pops up a window -with a list of coding systems suitable for encoding the buffer, and -prompts you to choose one of those coding systems. +set-language-environment}), and if that coding system can safely +encode all of the characters in the buffer, Emacs uses it, and stores +its value in @code{buffer-file-coding-system}. Otherwise, Emacs +displays a list of coding systems suitable for encoding the buffer's +contents, and asks to choose one of those coding systems. - If you insert characters which cannot be encoded by the buffer's -coding system while editing a mail message, Emacs behaves a bit -differently. It additionally checks whether the most-preferred coding -system is recommended for use in MIME messages; if it isn't, Emacs tells -you that the most-preferred coding system is not recommended and prompts -you for another coding system. This is so you won't inadvertently send -a message encoded in a way that your recipient's mail software will have -difficulty decoding. (If you do want to use the most-preferred coding -system, you can type its name to Emacs prompt anyway.) + If you insert the unsuitable characters in a mail message, Emacs +behaves a bit differently. It additionally checks whether the +most-preferred coding system is recommended for use in MIME messages; +if it isn't, Emacs tells you that the most-preferred coding system is +not recommended and prompts you for another coding system. This is so +you won't inadvertently send a message encoded in a way that your +recipient's mail software will have difficulty decoding. (If you do +want to use the most-preferred coding system, you can type its name to +Emacs prompt anyway.) @vindex sendmail-coding-system When you send a message with Mail mode (@pxref{Sending Mail}), Emacs has @@ -916,13 +921,14 @@ C-w} to specify a new file name for that buffer. @vindex locale-coding-system - The variable @code{locale-coding-system} specifies a coding system to -use when encoding and decoding system strings such as system error -messages and @code{format-time-string} formats and time stamps. This -coding system should be compatible with the underlying system's coding -system, which is normally specified by the first environment variable in -the list @env{LC_ALL}, @env{LC_CTYPE}, @env{LANG} whose value is -nonempty. + The variable @code{locale-coding-system} specifies a coding system +to use when encoding and decoding system strings such as system error +messages and @code{format-time-string} formats and time stamps. You +should choose a coding system that is compatible with the underlying +system's text representation, which is normally specified by one of +the environment variables @env{LC_ALL}, @env{LC_CTYPE}, and +@env{LANG}. (The first one whose value is nonempty is the one that +determines the text representation.) @node Fontsets @section Fontsets @@ -941,7 +947,7 @@ course, Emacs fontsets can use only the fonts that the X server supports; if certain characters appear on the screen as hollow boxes, this means that the fontset in use for them has no font for those -characters.@footnote{The installation instructions have information on +characters.@footnote{The Emacs installation instructions have information on additional font support.} Emacs creates two fontsets automatically: the @dfn{standard fontset} @@ -1099,23 +1105,27 @@ @node Undisplayable Characters @section Undisplayable Characters -Your terminal may not be able to display some non-@sc{ascii} characters. -Most non-windowing terminals can only use a single character set, -specified by the variable @code{default-terminal-coding-system} -(@pxref{Specify Coding}) and characters which can't be encoded in it are -displayed as @samp{?} by default. Windowing terminals may not have the -necessary font available to display a given character and display a -hollow box instead. You can change the default behavior. + Your terminal may be unable to display some non-@sc{ascii} +characters. Most non-windowing terminals can only use a single +character set (use the variable @code{default-terminal-coding-system} +(@pxref{Specify Coding}) to tell Emacs which one); characters which +can't be encoded in that coding system are displayed as @samp{?} by +default. + + Windowing terminals can display a broader range of characters, but +you may not have fonts installed for all of them; characters that have +no font appear as a hollow box. -If you use Latin-1 characters but your terminal can't display Latin-1, -you can arrange to display mnemonic @sc{ascii} sequences instead, e.g.@: -@samp{"o} for o-umlaut. Load the library @file{iso-ascii} to do this. + If you use Latin-1 characters but your terminal can't display +Latin-1, you can arrange to display mnemonic @sc{ascii} sequences +instead, e.g.@: @samp{"o} for o-umlaut. Load the library +@file{iso-ascii} to do this. -If your terminal can display Latin-1, you can display characters from -other European character sets using a mixture of equivalent Latin-1 -characters and @sc{ascii} mnemonics. Use the Custom option -@code{latin1-display} to enable this. The mnemonic @sc{ascii} sequences -mostly correspond to those of the prefix input methods. + If your terminal can display Latin-1, you can display characters +from other European character sets using a mixture of equivalent +Latin-1 characters and @sc{ascii} mnemonics. Use the Custom option +@code{latin1-display} to enable this. The mnemonic @sc{ascii} +sequences mostly correspond to those of the prefix input methods. @node Single-Byte Character Support @section Single-byte Character Set Support @@ -1172,18 +1182,18 @@ @findex set-keyboard-coding-system @vindex keyboard-coding-system If your keyboard can generate character codes 128 and up, representing -non-ASCII characters, use the command @code{M-x -set-keyboard-coding-system} or the Custom option -@code{keyboard-coding-system} to specify this in the same way as for -multibyte usage (@pxref{Specify Coding}). +non-ASCII you can type those character codes directly. -It is not necessary to do this under a window system which can -distinguish 8-bit characters and Meta keys. If you do this on a normal -terminal, you will probably need to use @kbd{ESC} to type Meta -characters.@footnote{In some cases, such as the Linux console and -@code{xterm}, you can arrange for Meta to be converted to @kbd{ESC} and -still be able type 8-bit characters present directly on the keyboard or -using @kbd{Compose} or @kbd{AltGr} keys.} @xref{User Input}. +On a windowing terminal, you should not need to do anything special to +use these keys; they should simply work. On a text-only terminal, you +should use the command @code{M-x set-keyboard-coding-system} or the +Custom option @code{keyboard-coding-system} to specify which coding +system your keyboard uses (@pxref{Specify Coding}). Enabling this +feature will probably require you to use @kbd{ESC} to type Meta +characters; however, on a Linux console or in @code{xterm}, you can +arrange for Meta to be converted to @kbd{ESC} and still be able type +8-bit characters present directly on the keyboard or using +@kbd{Compose} or @kbd{AltGr} keys. @xref{User Input}. @item You can use an input method for the selected language environment. @@ -1205,7 +1215,7 @@ library is loaded, the @key{ALT} modifier key, if you have one, serves the same purpose as @kbd{C-x 8}; use @key{ALT} together with an accent character to modify the following letter. In addition, if you have keys -for the Latin-1 ``dead accent characters'', they too are defined to +for the Latin-1 ``dead accent characters,'' they too are defined to compose with the following character, once @code{iso-transl} is loaded. Use @kbd{C-x 8 C-h} to list the available translations as mnemonic command names. @@ -1215,9 +1225,9 @@ @cindex ISO Accents mode @findex iso-accents-mode @cindex Latin-1, Latin-2 and Latin-3 input mode -For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs a -minor mode which provides a facility like the @code{latin-1-prefix} -input method but independent of the Leim package. This mode is -buffer-local. It can be customized for various languages with @kbd{M-x -iso-accents-customize}. +For Latin-1, Latin-2 and Latin-3, @kbd{M-x iso-accents-mode} installs +a minor mode which works much like the @code{latin-1-prefix} input +method does not depend on having the input methods installed. This +mode is buffer-local. It can be customized for various languages with +@kbd{M-x iso-accents-customize}. @end itemize