comparison doc/lispref/nonascii.texi @ 100246:1357cec2ef73

(Coding System Basics): Rewrite @ignore'd paragraph to speak about `undecided'. (Character Properties): Don't explain the meaning of each property; instead, identify their Unicode Standard names.
author Eli Zaretskii <eliz@gnu.org>
date Fri, 05 Dec 2008 16:11:03 +0000
parents 60d9e250ee84
children 3d8b80bc42ba
comparison
equal deleted inserted replaced
100245:53921407de01 100246:1357cec2ef73
358 of character properties. In particular, Emacs supports the 358 of character properties. In particular, Emacs supports the
359 @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property 359 @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
360 Model}, and the Emacs character property database is derived from the 360 Model}, and the Emacs character property database is derived from the
361 Unicode Character Database (@acronym{UCD}). See the 361 Unicode Character Database (@acronym{UCD}). See the
362 @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character 362 @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
363 Properties chapter of the Unicode Standard}, for more details about 363 Properties chapter of the Unicode Standard}, for detailed description
364 Unicode character properties and their meaning. 364 of Unicode character properties and their meaning. This section
365 assumes you are already familiar with that chapter of the Unicode
366 Standard, and want to apply that knowledge to Emacs Lisp programs.
365 367
366 The facilities documented in this section are useful for setting and 368 The facilities documented in this section are useful for setting and
367 retrieving properties of characters. 369 retrieving properties of characters.
368 370
369 In Emacs, each property has a name, which is a symbol, and a set of 371 In Emacs, each property has a name, which is a symbol, and a set of
370 possible values, whose types depend on the property. Here's the full 372 possible values, whose types depend on the property; if a character
371 list of character properties that Emacs knows about: 373 does not have a certain property, the value is @code{nil}. Here's the
374 full list of value types for all the character properties that Emacs
375 knows about:
372 376
373 @table @code 377 @table @code
374 @item name 378 @item name
375 The character's canonical unique name. The value of the property is a 379 This property corresponds to the Unicode @code{Name} property. The
376 string consisting of upper-case Latin letters A to Z, digits, spaces, 380 value is a string consisting of upper-case Latin letters A to Z,
377 and hyphen @samp{-} characters. 381 digits, spaces, and hyphen @samp{-} characters.
378 382
379 @item general-category 383 @item general-category
380 This property assigns the character to one of the major classes, such 384 This property corresponds to the Unicode @code{General_Category}
381 as letters, punctuation, and symbols, and its important subclasses. 385 property. The value is a symbol whose name is a 2-letter abbreviation
382 The value is a symbol whose name is a 2-letter abbreviation. The 386 of the character's classification.
383 first letter specifies the character's major class and the second
384 letter designates a subclass of that major class.
385 387
386 @item canonical-combining-class 388 @item canonical-combining-class
387 This property classifies combining characters into several classes, 389 Corresponds to the Unicode @code{Canonical_Combining_Class} property.
388 depending on the details of their behavior in sequences of combining 390 The value is an integer number.
389 characters. The property's value is an integer number.
390 391
391 @item bidi-class 392 @item bidi-class
392 This property specifies character attributes required for correct 393 Corresponds to the Unicode @code{Bidi_Class} property. The value is a
393 display of @dfn{bidirectional text} used by right-to-left scripts, 394 symbol whose name is the Unicode @dfn{directional type} of the
394 such as Arabic and Hebrew. The value is a symbol whose name is the 395 character.
395 Unicode @dfn{directional type} of the character.
396 396
397 @item decomposition 397 @item decomposition
398 This property defines a mapping from a character to a sequence of one 398 Corresponds to the Unicode @code{Decomposition_Type} and
399 or more characters that is a canonical or compatibility equivalent to 399 @code{Decomposition_Value} properties. The value is a list, whose
400 it. The value is a list, whose first element may be a symbol 400 first element may be a symbol representing a compatibility formatting
401 representing a compatibility formatting tag, such as @code{<small>}; 401 tag, such as @code{small}@footnote{
402 the other elements are characters that give the compatibility 402 Note that Emacs strips the @samp{<..>} brackets from the corresponding
403 decomposition sequence. 403 Unicode tags; e.g., Unicode specifies @samp{<small>} where Emacs uses
404 @samp{small}.
405 }; the other elements are characters that give the compatibility
406 decomposition sequence of this character.
404 407
405 @item decimal-digit-value 408 @item decimal-digit-value
406 This property specifies a numeric value of characters that represent 409 Corresponds to the Unicode @code{Numeric_Value} property for
407 decimal digits. The value is an integer number. 410 characters whose @code{Numeric_Type} is @samp{Digit}. The value is an
411 integer number.
408 412
409 @item digit 413 @item digit
410 This property specifies a numeric value of characters that represent 414 Corresponds to the Unicode @code{Numeric_Value} property for
411 digits, but not necessarily decimal. Examples include compatibility 415 characters whose @code{Numeric_Type} is @samp{Decimal}. The value is
412 subscript and superscript digits. The value is an integer number. 416 an integer number. Examples of such characters include compatibility
417 subscript and superscript digits, for which the value is the
418 corresponding number.
413 419
414 @item numeric-value 420 @item numeric-value
415 This property specifies whether the character represents a number. 421 Corresponds to the Unicode @code{Numeric_Value} property for
416 Examples of characters that do include fractions, subscripts, 422 characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
423 this property is an integer of a floating-point number. Examples of
424 characters that have this property include fractions, subscripts,
417 superscripts, Roman numerals, currency numerators, and encircled 425 superscripts, Roman numerals, currency numerators, and encircled
418 numbers. The value is a symbol whose name gives the numeric value; 426 numbers. For example, the value of this property for the character
419 for example, the value of this property for the character 427 @code{U+2155} (@sc{vulgar fraction one fifth}) is @code{0.2}.
420 @code{U+2155} (@sc{vulgar fraction one fifth}) is the symbol
421 @samp{1/5}.
422 428
423 @item mirrored 429 @item mirrored
424 This is a property of characters such as parentheses, which need to be 430 Corresponds to the Unicode @code{Bidi_Mirrored} property. The value
425 mirrored horizontally in right to left scripts. The value is a 431 of this property is a symbol, either @samp{Y} or @samp{N}.
426 symbol, either @samp{Y} or @samp{N}.
427 432
428 @item old-name 433 @item old-name
429 This property's value specifies the name, if any, of the character in 434 Corresponds to the Unicode @code{Unicode_1_Name} property. The value
430 the old version 1.0 of the Unicode Standard. The value is a string. 435 is a string.
431 436
432 @item iso-10646-comment 437 @item iso-10646-comment
433 This character's comment field from the ISO 10646 standard. The value 438 Corresponds to the Unicode @code{ISO_Comment} property. The value is
434 is a string, or @code{nil} if there's no comment. 439 a string.
435 440
436 @item uppercase 441 @item uppercase
437 If this character has an upper-case equivalent that is a single 442 Corresponds to the Unicode @code{Simple_Uppercase_Mapping} property.
438 character, then the value of this property is that upper-case 443 The value of this property is a single character.
439 equivalent. Otherwise, the value is @code{nil}.
440 444
441 @item lowercase 445 @item lowercase
442 If this character has an lower-case equivalent that is a single 446 Corresponds to the Unicode @code{Simple_Lowercase_Mapping} property.
443 character, then the value of this property is that lower-case 447 The value of this property is a single character.
444 equivalent. Otherwise, the value is @code{nil}.
445 448
446 @item titlecase 449 @item titlecase
450 Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
447 @dfn{Title case} is a special form of a character used when the first 451 @dfn{Title case} is a special form of a character used when the first
448 character of a word needs to be capitalized. If a character has a 452 character of a word needs to be capitalized. The value of this
449 title-case equivalent that is a single character, then the value of 453 property is a single character.
450 this property is that title-case equivalent. Otherwise, the value is
451 @code{nil}.
452 @end table 454 @end table
453 455
454 @defun get-char-code-property char propname 456 @defun get-char-code-property char propname
455 This function returns the value of @var{char}'s @var{propname} property. 457 This function returns the value of @var{char}'s @var{propname} property.
456 458
791 several variants of ISO 2022. In some cases, Emacs supports several 793 several variants of ISO 2022. In some cases, Emacs supports several
792 alternative encodings for the same characters; for example, there are 794 alternative encodings for the same characters; for example, there are
793 three coding systems for the Cyrillic (Russian) alphabet: ISO, 795 three coding systems for the Cyrillic (Russian) alphabet: ISO,
794 Alternativnyj, and KOI8. 796 Alternativnyj, and KOI8.
795 797
796 @c I think this paragraph is no longer correct. 798 Every coding system specifies a particular set of character code
797 @ignore 799 conversions, but the coding system @code{undecided} is special: it
798 Most coding systems specify a particular character code for 800 leaves the choice unspecified, to be chosen heuristically for each
799 conversion, but some of them leave the choice unspecified---to be chosen 801 file, based on the file's data.
800 heuristically for each file, based on the data.
801 @end ignore
802 802
803 In general, a coding system doesn't guarantee roundtrip identity: 803 In general, a coding system doesn't guarantee roundtrip identity:
804 decoding a byte sequence using coding system, then encoding the 804 decoding a byte sequence using coding system, then encoding the
805 resulting text in the same coding system, can produce a different byte 805 resulting text in the same coding system, can produce a different byte
806 sequence. But some coding systems do guarantee that the byte sequence 806 sequence. But some coding systems do guarantee that the byte sequence