comparison src/coding.c @ 24425:61c6b3be1d51

Comment for ISO 2022 encoding mechanism modified.
author Kenichi Handa <handa@m17n.org>
date Mon, 01 Mar 1999 11:52:54 +0000
parents 8b7ef7fb9e2e
children be35d27a4bfb
comparison
equal deleted inserted replaced
24424:520e8f39c1f8 24425:61c6b3be1d51
523 523
524 524
525 /*** 3. ISO2022 handlers ***/ 525 /*** 3. ISO2022 handlers ***/
526 526
527 /* The following note describes the coding system ISO2022 briefly. 527 /* The following note describes the coding system ISO2022 briefly.
528 Since the intention of this note is to help in understanding of 528 Since the intention of this note is to help understand the
529 the programs in this file, some parts are NOT ACCURATE or OVERLY 529 functions in this file, some parts are NOT ACCURATE or OVERLY
530 SIMPLIFIED. For the thorough understanding, please refer to the 530 SIMPLIFIED. For thorough understanding, please refer to the
531 original document of ISO2022. 531 original document of ISO2022.
532 532
533 ISO2022 provides many mechanisms to encode several character sets 533 ISO2022 provides many mechanisms to encode several character sets
534 in 7-bit and 8-bit environment. If one chooses 7-bite environment, 534 in 7-bit and 8-bit environments. For 7-bite environments, all text
535 all text is encoded by codes of less than 128. This may make the 535 is encoded using bytes less than 128. This may make the encoded
536 encoded text a little bit longer, but the text gets more stability 536 text a little bit longer, but the text passes more easily through
537 to pass through several gateways (some of them strip off the MSB). 537 several gateways, some of which strip off MSB (Most Signigant Bit).
538 538
539 There are two kinds of character set: control character set and 539 There are two kinds of character sets: control character set and
540 graphic character set. The former contains control characters such 540 graphic character set. The former contains control characters such
541 as `newline' and `escape' to provide control functions (control 541 as `newline' and `escape' to provide control functions (control
542 functions are provided also by escape sequences). The latter 542 functions are also provided by escape sequences). The latter
543 contains graphic characters such as ' A' and '-'. Emacs recognizes 543 contains graphic characters such as 'A' and '-'. Emacs recognizes
544 two control character sets and many graphic character sets. 544 two control character sets and many graphic character sets.
545 545
546 Graphic character sets are classified into one of the following 546 Graphic character sets are classified into one of the following
547 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, 547 four classes, according to the number of bytes (DIMENSION) and
548 DIMENSION2_CHARS94, DIMENSION2_CHARS96 according to the number of 548 number of characters in one dimension (CHARS) of the set:
549 bytes (DIMENSION) and the number of characters in one dimension 549 - DIMENSION1_CHARS94
550 (CHARS) of the set. In addition, each character set is assigned an 550 - DIMENSION1_CHARS96
551 identification tag (called "final character" and denoted as <F> 551 - DIMENSION2_CHARS94
552 here after) which is unique in each class. <F> of each character 552 - DIMENSION2_CHARS96
553 set is decided by ECMA(*) when it is registered in ISO. Code range 553
554 of <F> is 0x30..0x7F (0x30..0x3F are for private use only). 554 In addition, each character set is assigned an identification tag,
555 unique for each set, called "final character" (denoted as <F>
556 hereafter). The <F> of each character set is decided by ECMA(*)
557 when it is registered in ISO. The code range of <F> is 0x30..0x7F
558 (0x30..0x3F are for private use only).
555 559
556 Note (*): ECMA = European Computer Manufacturers Association 560 Note (*): ECMA = European Computer Manufacturers Association
557 561
558 Here are examples of graphic character set [NAME(<F>)]: 562 Here are examples of graphic character set [NAME(<F>)]:
559 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... 563 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ...
560 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... 564 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ...
561 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... 565 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
562 o DIMENSION2_CHARS96 -- none for the moment 566 o DIMENSION2_CHARS96 -- none for the moment
563 567
564 A code area (1byte=8bits) is divided into 4 areas, C0, GL, C1, and GR. 568 A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR.
565 C0 [0x00..0x1F] -- control character plane 0 569 C0 [0x00..0x1F] -- control character plane 0
566 GL [0x20..0x7F] -- graphic character plane 0 570 GL [0x20..0x7F] -- graphic character plane 0
567 C1 [0x80..0x9F] -- control character plane 1 571 C1 [0x80..0x9F] -- control character plane 1
568 GR [0xA0..0xFF] -- graphic character plane 1 572 GR [0xA0..0xFF] -- graphic character plane 1
569 573
570 A control character set is directly designated and invoked to C0 or 574 A control character set is directly designated and invoked to C0 or
571 C1 by an escape sequence. The most common case is that ISO646's 575 C1 by an escape sequence. The most common case is that:
572 control character set is designated/invoked to C0 and ISO6429's 576 - ISO646's control character set is designated/invoked to C0, and
573 control character set is designated/invoked to C1, and usually 577 - ISO6429's control character set is designated/invoked to C1,
574 these designations/invocations are omitted in a coded text. With 578 and usually these designations/invocations are omitted in encoded
575 7-bit environment, only C0 can be used, and a control character for 579 text. In a 7-bit environment, only C0 can be used, and a control
576 C1 is encoded by an appropriate escape sequence to fit in the 580 character for C1 is encoded by an appropriate escape sequence to
577 environment. All control characters for C1 are defined the 581 fit into the environment. All control characters for C1 are
578 corresponding escape sequences. 582 defined to have corresponding escape sequences.
579 583
580 A graphic character set is at first designated to one of four 584 A graphic character set is at first designated to one of four
581 graphic registers (G0 through G3), then these graphic registers are 585 graphic registers (G0 through G3), then these graphic registers are
582 invoked to GL or GR. These designations and invocations can be 586 invoked to GL or GR. These designations and invocations can be
583 done independently. The most common case is that G0 is invoked to 587 done independently. The most common case is that G0 is invoked to
584 GL, G1 is invoked to GR, and ASCII is designated to G0, and usually 588 GL, G1 is invoked to GR, and ASCII is designated to G0. Usually
585 these invocations and designations are omitted in a coded text. 589 these invocations and designations are omitted in encoded text.
586 With 7-bit environment, only GL can be used. 590 In a 7-bit environment, only GL can be used.
587 591
588 When a graphic character set of CHARS94 is invoked to GL, code 0x20 592 When a graphic character set of CHARS94 is invoked to GL, codes
589 and 0x7F of GL area work as control characters SPACE and DEL 593 0x20 and 0x7F of the GL area work as control characters SPACE and
590 respectively, and code 0xA0 and 0xFF of GR area should not be used. 594 DEL respectively, and codes 0xA0 and 0xFF of the GR area should not
595 be used.
591 596
592 There are two ways of invocation: locking-shift and single-shift. 597 There are two ways of invocation: locking-shift and single-shift.
593 With locking-shift, the invocation lasts until the next different 598 With locking-shift, the invocation lasts until the next different
594 invocation, whereas with single-shift, the invocation works only 599 invocation, whereas with single-shift, the invocation affects the
595 for the following character and doesn't affect locking-shift. 600 following character only and doesn't affect the locking-shift
596 Invocations are done by the following control characters or escape 601 state. Invocations are done by the following control characters or
597 sequences. 602 escape sequences:
598 603
599 ---------------------------------------------------------------------- 604 ----------------------------------------------------------------------
600 function control char escape sequence description 605 abbrev function cntrl escape seq description
601 ---------------------------------------------------------------------- 606 ----------------------------------------------------------------------
602 SI (shift-in) 0x0F none invoke G0 to GL 607 SI/LS0 (shift-in) 0x0F none invoke G0 into GL
603 SO (shift-out) 0x0E none invoke G1 to GL 608 SO/LS1 (shift-out) 0x0E none invoke G1 into GL
604 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL 609 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL
605 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL 610 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL
606 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 into GL 611 LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*)
607 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 into GL 612 LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*)
613 LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*)
614 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char
615 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char
608 ---------------------------------------------------------------------- 616 ----------------------------------------------------------------------
609 The first four are for locking-shift. Control characters for these 617 (*) These are not used by any known coding system.
610 functions are defined by macros ISO_CODE_XXX in `coding.h'. 618
611 619 Control characters for these functions are defined by macros
612 Designations are done by the following escape sequences. 620 ISO_CODE_XXX in `coding.h'.
621
622 Designations are done by the following escape sequences:
613 ---------------------------------------------------------------------- 623 ----------------------------------------------------------------------
614 escape sequence description 624 escape sequence description
615 ---------------------------------------------------------------------- 625 ----------------------------------------------------------------------
616 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0 626 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0
617 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1 627 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1
630 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2 640 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2
631 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3 641 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3
632 ---------------------------------------------------------------------- 642 ----------------------------------------------------------------------
633 643
634 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set 644 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set
635 of dimension 1, chars 94, and final character <F>, and etc. 645 of dimension 1, chars 94, and final character <F>, etc...
636 646
637 Note (*): Although these designations are not allowed in ISO2022, 647 Note (*): Although these designations are not allowed in ISO2022,
638 Emacs accepts them on decoding, and produces them on encoding 648 Emacs accepts them on decoding, and produces them on encoding
639 CHARS96 character set in a coding system which is characterized as 649 CHARS96 character sets in a coding system which is characterized as
640 7-bit environment, non-locking-shift, and non-single-shift. 650 7-bit environment, non-locking-shift, and non-single-shift.
641 651
642 Note (**): If <F> is '@', 'A', or 'B', the intermediate character 652 Note (**): If <F> is '@', 'A', or 'B', the intermediate character
643 '(' can be omitted. We call this as "short-form" here after. 653 '(' can be omitted. We refer to this as "short-form" hereafter.
644 654
645 Now you may notice that there are a lot of ways for encoding the 655 Now you may notice that there are a lot of ways for encoding the
646 same multilingual text in ISO2022. Actually, there exists many 656 same multilingual text in ISO2022. Actually, there exist many
647 coding systems such as Compound Text (used in X's inter client 657 coding systems such as Compound Text (used in X11's inter client
648 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR 658 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR
649 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian 659 (used in Korean internet), EUC (Extended UNIX Code, used in Asian
650 localized platforms), and all of these are variants of ISO2022. 660 localized platforms), and all of these are variants of ISO2022.
651 661
652 In addition to the above, Emacs handles two more kinds of escape 662 In addition to the above, Emacs handles two more kinds of escape
653 sequences: ISO6429's direction specification and Emacs' private 663 sequences: ISO6429's direction specification and Emacs' private
654 sequence for specifying character composition. 664 sequence for specifying character composition.
655 665
656 ISO6429's direction specification takes the following format: 666 ISO6429's direction specification takes the following form:
657 o CSI ']' -- end of the current direction 667 o CSI ']' -- end of the current direction
658 o CSI '0' ']' -- end of the current direction 668 o CSI '0' ']' -- end of the current direction
659 o CSI '1' ']' -- start of left-to-right text 669 o CSI '1' ']' -- start of left-to-right text
660 o CSI '2' ']' -- start of right-to-left text 670 o CSI '2' ']' -- start of right-to-left text
661 The control character CSI (0x9B: control sequence introducer) is 671 The control character CSI (0x9B: control sequence introducer) is
662 abbreviated to the escape sequence ESC '[' in 7-bit environment. 672 abbreviated to the escape sequence ESC '[' in a 7-bit environment.
663 673
664 Character composition specification takes the following format: 674 Character composition specification takes the following form:
665 o ESC '0' -- start character composition 675 o ESC '0' -- start character composition
666 o ESC '1' -- end character composition 676 o ESC '1' -- end character composition
667 Since these are not standard escape sequences of any ISO, the use 677 Since these are not standard escape sequences of any ISO standard,
668 of them for these meaning is restricted to Emacs only. */ 678 the use of them for these meaning is restricted to Emacs only. */
669 679
670 enum iso_code_class_type iso_code_class[256]; 680 enum iso_code_class_type iso_code_class[256];
671 681
672 #define CHARSET_OK(idx, charset) \ 682 #define CHARSET_OK(idx, charset) \
673 (coding_system_table[idx] \ 683 (coding_system_table[idx] \