Mercurial > emacs
comparison doc/lispref/nonascii.texi @ 100246:1357cec2ef73
(Coding System Basics): Rewrite @ignore'd paragraph to speak about `undecided'.
(Character Properties): Don't explain the meaning of each property; instead,
identify their Unicode Standard names.
author | Eli Zaretskii <eliz@gnu.org> |
---|---|
date | Fri, 05 Dec 2008 16:11:03 +0000 |
parents | 60d9e250ee84 |
children | 3d8b80bc42ba |
comparison
equal
deleted
inserted
replaced
100245:53921407de01 | 100246:1357cec2ef73 |
---|---|
358 of character properties. In particular, Emacs supports the | 358 of character properties. In particular, Emacs supports the |
359 @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property | 359 @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property |
360 Model}, and the Emacs character property database is derived from the | 360 Model}, and the Emacs character property database is derived from the |
361 Unicode Character Database (@acronym{UCD}). See the | 361 Unicode Character Database (@acronym{UCD}). See the |
362 @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character | 362 @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character |
363 Properties chapter of the Unicode Standard}, for more details about | 363 Properties chapter of the Unicode Standard}, for detailed description |
364 Unicode character properties and their meaning. | 364 of Unicode character properties and their meaning. This section |
365 assumes you are already familiar with that chapter of the Unicode | |
366 Standard, and want to apply that knowledge to Emacs Lisp programs. | |
365 | 367 |
366 The facilities documented in this section are useful for setting and | 368 The facilities documented in this section are useful for setting and |
367 retrieving properties of characters. | 369 retrieving properties of characters. |
368 | 370 |
369 In Emacs, each property has a name, which is a symbol, and a set of | 371 In Emacs, each property has a name, which is a symbol, and a set of |
370 possible values, whose types depend on the property. Here's the full | 372 possible values, whose types depend on the property; if a character |
371 list of character properties that Emacs knows about: | 373 does not have a certain property, the value is @code{nil}. Here's the |
374 full list of value types for all the character properties that Emacs | |
375 knows about: | |
372 | 376 |
373 @table @code | 377 @table @code |
374 @item name | 378 @item name |
375 The character's canonical unique name. The value of the property is a | 379 This property corresponds to the Unicode @code{Name} property. The |
376 string consisting of upper-case Latin letters A to Z, digits, spaces, | 380 value is a string consisting of upper-case Latin letters A to Z, |
377 and hyphen @samp{-} characters. | 381 digits, spaces, and hyphen @samp{-} characters. |
378 | 382 |
379 @item general-category | 383 @item general-category |
380 This property assigns the character to one of the major classes, such | 384 This property corresponds to the Unicode @code{General_Category} |
381 as letters, punctuation, and symbols, and its important subclasses. | 385 property. The value is a symbol whose name is a 2-letter abbreviation |
382 The value is a symbol whose name is a 2-letter abbreviation. The | 386 of the character's classification. |
383 first letter specifies the character's major class and the second | |
384 letter designates a subclass of that major class. | |
385 | 387 |
386 @item canonical-combining-class | 388 @item canonical-combining-class |
387 This property classifies combining characters into several classes, | 389 Corresponds to the Unicode @code{Canonical_Combining_Class} property. |
388 depending on the details of their behavior in sequences of combining | 390 The value is an integer number. |
389 characters. The property's value is an integer number. | |
390 | 391 |
391 @item bidi-class | 392 @item bidi-class |
392 This property specifies character attributes required for correct | 393 Corresponds to the Unicode @code{Bidi_Class} property. The value is a |
393 display of @dfn{bidirectional text} used by right-to-left scripts, | 394 symbol whose name is the Unicode @dfn{directional type} of the |
394 such as Arabic and Hebrew. The value is a symbol whose name is the | 395 character. |
395 Unicode @dfn{directional type} of the character. | |
396 | 396 |
397 @item decomposition | 397 @item decomposition |
398 This property defines a mapping from a character to a sequence of one | 398 Corresponds to the Unicode @code{Decomposition_Type} and |
399 or more characters that is a canonical or compatibility equivalent to | 399 @code{Decomposition_Value} properties. The value is a list, whose |
400 it. The value is a list, whose first element may be a symbol | 400 first element may be a symbol representing a compatibility formatting |
401 representing a compatibility formatting tag, such as @code{<small>}; | 401 tag, such as @code{small}@footnote{ |
402 the other elements are characters that give the compatibility | 402 Note that Emacs strips the @samp{<..>} brackets from the corresponding |
403 decomposition sequence. | 403 Unicode tags; e.g., Unicode specifies @samp{<small>} where Emacs uses |
404 @samp{small}. | |
405 }; the other elements are characters that give the compatibility | |
406 decomposition sequence of this character. | |
404 | 407 |
405 @item decimal-digit-value | 408 @item decimal-digit-value |
406 This property specifies a numeric value of characters that represent | 409 Corresponds to the Unicode @code{Numeric_Value} property for |
407 decimal digits. The value is an integer number. | 410 characters whose @code{Numeric_Type} is @samp{Digit}. The value is an |
411 integer number. | |
408 | 412 |
409 @item digit | 413 @item digit |
410 This property specifies a numeric value of characters that represent | 414 Corresponds to the Unicode @code{Numeric_Value} property for |
411 digits, but not necessarily decimal. Examples include compatibility | 415 characters whose @code{Numeric_Type} is @samp{Decimal}. The value is |
412 subscript and superscript digits. The value is an integer number. | 416 an integer number. Examples of such characters include compatibility |
417 subscript and superscript digits, for which the value is the | |
418 corresponding number. | |
413 | 419 |
414 @item numeric-value | 420 @item numeric-value |
415 This property specifies whether the character represents a number. | 421 Corresponds to the Unicode @code{Numeric_Value} property for |
416 Examples of characters that do include fractions, subscripts, | 422 characters whose @code{Numeric_Type} is @samp{Numeric}. The value of |
423 this property is an integer of a floating-point number. Examples of | |
424 characters that have this property include fractions, subscripts, | |
417 superscripts, Roman numerals, currency numerators, and encircled | 425 superscripts, Roman numerals, currency numerators, and encircled |
418 numbers. The value is a symbol whose name gives the numeric value; | 426 numbers. For example, the value of this property for the character |
419 for example, the value of this property for the character | 427 @code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}. |
420 @code{U+2155} (@sc{vulgar fraction one fifth}) is the symbol | |
421 @samp{1/5}. | |
422 | 428 |
423 @item mirrored | 429 @item mirrored |
424 This is a property of characters such as parentheses, which need to be | 430 Corresponds to the Unicode @code{Bidi_Mirrored} property. The value |
425 mirrored horizontally in right to left scripts. The value is a | 431 of this property is a symbol, either @samp{Y} or @samp{N}. |
426 symbol, either @samp{Y} or @samp{N}. | |
427 | 432 |
428 @item old-name | 433 @item old-name |
429 This property's value specifies the name, if any, of the character in | 434 Corresponds to the Unicode @code{Unicode_1_Name} property. The value |
430 the old version 1.0 of the Unicode Standard. The value is a string. | 435 is a string. |
431 | 436 |
432 @item iso-10646-comment | 437 @item iso-10646-comment |
433 This character's comment field from the ISO 10646 standard. The value | 438 Corresponds to the Unicode @code{ISO_Comment} property. The value is |
434 is a string, or @code{nil} if there's no comment. | 439 a string. |
435 | 440 |
436 @item uppercase | 441 @item uppercase |
437 If this character has an upper-case equivalent that is a single | 442 Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property. |
438 character, then the value of this property is that upper-case | 443 The value of this property is a single character. |
439 equivalent. Otherwise, the value is @code{nil}. | |
440 | 444 |
441 @item lowercase | 445 @item lowercase |
442 If this character has an lower-case equivalent that is a single | 446 Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property. |
443 character, then the value of this property is that lower-case | 447 The value of this property is a single character. |
444 equivalent. Otherwise, the value is @code{nil}. | |
445 | 448 |
446 @item titlecase | 449 @item titlecase |
450 Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property. | |
447 @dfn{Title case} is a special form of a character used when the first | 451 @dfn{Title case} is a special form of a character used when the first |
448 character of a word needs to be capitalized. If a character has a | 452 character of a word needs to be capitalized. The value of this |
449 title-case equivalent that is a single character, then the value of | 453 property is a single character. |
450 this property is that title-case equivalent. Otherwise, the value is | |
451 @code{nil}. | |
452 @end table | 454 @end table |
453 | 455 |
454 @defun get-char-code-property char propname | 456 @defun get-char-code-property char propname |
455 This function returns the value of @var{char}'s @var{propname} property. | 457 This function returns the value of @var{char}'s @var{propname} property. |
456 | 458 |
791 several variants of ISO 2022. In some cases, Emacs supports several | 793 several variants of ISO 2022. In some cases, Emacs supports several |
792 alternative encodings for the same characters; for example, there are | 794 alternative encodings for the same characters; for example, there are |
793 three coding systems for the Cyrillic (Russian) alphabet: ISO, | 795 three coding systems for the Cyrillic (Russian) alphabet: ISO, |
794 Alternativnyj, and KOI8. | 796 Alternativnyj, and KOI8. |
795 | 797 |
796 @c I think this paragraph is no longer correct. | 798 Every coding system specifies a particular set of character code |
797 @ignore | 799 conversions, but the coding system @code{undecided} is special: it |
798 Most coding systems specify a particular character code for | 800 leaves the choice unspecified, to be chosen heuristically for each |
799 conversion, but some of them leave the choice unspecified---to be chosen | 801 file, based on the file's data. |
800 heuristically for each file, based on the data. | |
801 @end ignore | |
802 | 802 |
803 In general, a coding system doesn't guarantee roundtrip identity: | 803 In general, a coding system doesn't guarantee roundtrip identity: |
804 decoding a byte sequence using coding system, then encoding the | 804 decoding a byte sequence using coding system, then encoding the |
805 resulting text in the same coding system, can produce a different byte | 805 resulting text in the same coding system, can produce a different byte |
806 sequence. But some coding systems do guarantee that the byte sequence | 806 sequence. But some coding systems do guarantee that the byte sequence |