Network Working Group J. Klensin Request for Comments: 5242 Category: Informational H. Alvestrand Google 1 April 2008
Network Working Group J. Klensin Request for Comments: 5242 Category: Informational H. Alvestrand Google 1 April 2008
A Generalized Unified Character Code: Western European and CJK Sections
通用统一字符代码:西欧和CJK部分
Status of This Memo
关于下段备忘
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。
IESG Note
IESG注释
This is not an IETF document. Readers should be aware of RFC 4690, "Review and Recommendations for Internationalized Domain Names (IDNs)", and its references.
这不是IETF文件。读者应了解RFC4690,“国际化域名(IDN)的审查和建议”及其参考文献。
This document is not a candidate for any level of Internet Standard. The IETF disclaims any knowledge of the fitness of this document for any purpose, and in particular notes that it has not had IETF review for such things as security, congestion control, or inappropriate interaction with deployed protocols. The RFC Editor has chosen to publish this document at its discretion. Readers of this document should exercise caution in evaluating its value for implementation and deployment.
本文件不适用于任何级别的互联网标准。IETF不承认对本文件适用于任何目的的任何了解,特别注意到IETF没有对安全性、拥塞控制或与已部署协议的不当交互等事项进行审查。RFC编辑已自行决定发布本文件。本文档的读者在评估其实施和部署价值时应谨慎。
Abstract
摘要
Many issues have been identified with the use of general-purpose character sets for internationalized domain names and similar purposes. This memo describes a fully unified coded character set for scripts based on Latin, Greek, Cyrillic, and Chinese (CJK) characters. It is not a complete specification of that character set.
在为国际化域名和类似目的使用通用字符集时,发现了许多问题。本备忘录描述了基于拉丁、希腊、西里尔和中文(CJK)字符的脚本的完全统一编码字符集。它不是该字符集的完整规范。
Table of Contents
目录
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Types of Characters . . . . . . . . . . . . . . . . . . . . . 4 2.1. Base Character . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Nonspacing Marks . . . . . . . . . . . . . . . . . . . . . 4 2.3. Case Indicators . . . . . . . . . . . . . . . . . . . . . 4 2.4. Joining Indicators . . . . . . . . . . . . . . . . . . . . 5 2.5. Character-Matrix Positioning Indicators . . . . . . . . . 5 2.6. Position Shaping Controls . . . . . . . . . . . . . . . . 6 2.7. Repetition Indicators . . . . . . . . . . . . . . . . . . 6 2.8. Control Characters . . . . . . . . . . . . . . . . . . . . 7 3. Code Assigment Groupings . . . . . . . . . . . . . . . . . . . 7 4. Canonical Form . . . . . . . . . . . . . . . . . . . . . . . . 7 5. Examples of Graphic Element Codes . . . . . . . . . . . . . . 8 6. Composite Characters and Unicode Equivalences . . . . . . . . 10 7. Ideographic Characters . . . . . . . . . . . . . . . . . . . . 11 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.1. Normative References . . . . . . . . . . . . . . . . . . . 13 11.2. Informative References . . . . . . . . . . . . . . . . . . 13
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Types of Characters . . . . . . . . . . . . . . . . . . . . . 4 2.1. Base Character . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Nonspacing Marks . . . . . . . . . . . . . . . . . . . . . 4 2.3. Case Indicators . . . . . . . . . . . . . . . . . . . . . 4 2.4. Joining Indicators . . . . . . . . . . . . . . . . . . . . 5 2.5. Character-Matrix Positioning Indicators . . . . . . . . . 5 2.6. Position Shaping Controls . . . . . . . . . . . . . . . . 6 2.7. Repetition Indicators . . . . . . . . . . . . . . . . . . 6 2.8. Control Characters . . . . . . . . . . . . . . . . . . . . 7 3. Code Assigment Groupings . . . . . . . . . . . . . . . . . . . 7 4. Canonical Form . . . . . . . . . . . . . . . . . . . . . . . . 7 5. Examples of Graphic Element Codes . . . . . . . . . . . . . . 8 6. Composite Characters and Unicode Equivalences . . . . . . . . 10 7. Ideographic Characters . . . . . . . . . . . . . . . . . . . . 11 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 12 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 11.1. Normative References . . . . . . . . . . . . . . . . . . . 13 11.2. Informative References . . . . . . . . . . . . . . . . . . 13
Many issues have been identified with the use of general-purpose character sets for internationalized domain names and similar purposes. This memo specifies a fully unified coded character set for scripts based on Latin, Greek, Cyrillic, and Chinese characters.
在为国际化域名和类似目的使用通用字符集时,发现了许多问题。此备忘录为基于拉丁、希腊、西里尔和中文字符的脚本指定了完全统一的编码字符集。
There are four important principles in this work:
这项工作有四项重要原则:
1. If it looks alike, it is alike. The number of base characters and marks should be minimized. Glyphs are more important than character abstractions.
1. 如果它看起来像,它就是一样的。应尽量减少基本字符和标记的数量。字形比字符抽象更重要。
2. If it is the same thing, it is the same thing. Two symbols that have the same semantic meaning in all contexts should be encoded in a way that allows their identity to be discovered by removing modifiers, rather than having to resort to external equivalence tables.
2. 如果是同一件事,那就是同一件事。在所有上下文中具有相同语义的两个符号的编码方式应允许通过删除修饰语来发现它们的标识,而不必求助于外部等价表。
3. For simplicity, when a character form can be evaluated on the basis of either serif or sanserif fonts, the sanserif font is always preferred.
3. 为简单起见,当字符形式可以基于衬线或sanserif字体计算时,sanserif字体总是首选。
4. The use of combining characters and modifiers is preferred to adding more base characters.
4. 与添加更多基本字符相比,最好使用组合字符和修改器。
Based on these principles, it becomes obvious that:
基于这些原则,显而易见:
o Ligatures, digraphs, and final forms are constructed with special modifiers so that relationships to basic forms are obvious.
o 连字、有向图和词尾形式都是用特殊的修饰语构成的,因此与基本形式的关系是显而易见的。
o Symbols consisting of multiple marks are always constructed from combining characters and positional modifiers; thus, the "i" character is constructed from the vertical line symbol followed by a combining dot above. Similarly "f" is composed of a centered vertical line, a right hook in the top position, and an appropriately-positioned composing hyphen.
o 由多个标记组成的符号总是由组合字符和位置修饰符构成;因此,“i”字符由垂直线符号和上面的组合点构成。类似地,“f”由一条居中的垂直线、顶部位置的右钩子和一个适当定位的组合连字符组成。
This document draws strongly from the design and terminology of Unicode [Unicode] but represents a radically different approach.
本文档强烈借鉴了Unicode[Unicode]的设计和术语,但代表了一种完全不同的方法。
All special-use terms in this document, including descriptions of behaviors and related relationships, are used with their common-sense meanings.
本文件中的所有特殊使用术语,包括行为和相关关系的描述,均按其常识含义使用。
Questions to, and contributions for, this coding system should be addressed to the mailing list unified-ccs@xn--iwem3b1f.xn--90ase1a.bogus.domain.name.
此编码系统的问题和贡献应提交至邮件列表-ccs@xn--iwem3b1f.xn--90ase1a.bogus.domain.name。
This document defines several types of characters. Note that these definitions are not the same as the Unicode definitions for similar or identical terms.
本文档定义了几种类型的字符。请注意,这些定义与相似或相同术语的Unicode定义不同。
Any character that is used as an atomic shape, rather than being assembled from such a character in combination with combining (overstriking) marks, symbols, or specially-designed base characters. When used alone, base characters always take up space. For example, a, c, l,...
用作原子形状的任何字符,而不是由此类字符与组合(过串)标记、符号或特殊设计的基本字符组合而成。单独使用时,基本字符总是占用空间。例如,a,c,l,。。。
Marks, symbols, and character components that are used to form characters when used in combination with base characters. They do not occupy separate character positions when displayed.
与基本字符结合使用时用于形成字符的标记、符号和字符组件。它们在显示时不占用单独的字符位置。
For example, the special combining symbols LeftUpperHook and RightLowerHook, described in Section 5, are nonspacing marks.
例如,第5节中描述的特殊组合符号LeftUpperHook和RightLowerHook是非间隔标记。
In scripts with case, only the lower-case characters are base characters. Upper-case forms are represented by using the UC modifier. So the traditional "A" character is represented by "a<UC>". Note that this means that case-independent comparisons are made simply by ignoring the <UC> modifiers rather than by complicated mapping operations.
在带有大小写的脚本中,只有小写字符是基本字符。大写形式通过使用UC修改器表示。因此,传统的“A”字符由“A<UC>”表示。请注意,这意味着只需忽略<UC>修饰符,而不是通过复杂的映射操作,即可进行与大小写无关的比较。
The initial set of case modifiers consists exclusively of:
案例修改器的初始集合仅包括:
UC Upper-case, code value 1 (hexadecimal)
UC大写,代码值1(十六进制)
The code values two through four are reserved for the impending encoding of scripts with more than two cases; five is reserved for expansion in case a script with more than four cases is identified.
代码值2到4保留用于对具有两个以上案例的脚本进行即将进行的编码;保留五个用于扩展,以防识别出包含四个以上案例的脚本。
Zero-width joiners are used to build characters, not only to separate or join words. As compared to Unicode, a richer set of joiners is used to distinguish between the inter-word and ligature-forming (including half-character forming) cases. Unicode ZWJ and ZWNJ are supplemented by ZWCJ, OJ, and ONJ. ZWCJ is used to modify a spacing basic character into a nonspacing role. For example, there is no "w" character, but only "u<ZWCJ>u". Upper-case "W" is coded as u<ZWCJ>u<UC> -- the CWCJ binds more tightly than the UC modifier.
零宽度连接符用于构建字符,而不仅仅用于分隔或连接单词。与Unicode相比,使用了更丰富的连接符集来区分单词间和连字形成(包括半字符形成)情况。Unicode ZWJ和ZWNJ由ZWCJ、OJ和ONJ补充。ZWCJ用于将间距基本字符修改为非间距角色。例如,没有“w”字符,只有“u<ZWCJ>u”。大写字母“W”编码为u<ZWCJ>u<UC>——CWCJ比UC修饰语结合更紧密。
The initial set of joining indicators consists exclusively of:
初始加入指标集仅包括:
ZWCJ Character joiner (also known as "ligature joiner"), code value 6 (hexadecimal).
ZWCJ字符连接符(也称为“连字连接符”),代码值6(十六进制)。
OJ Overlay joiner (permits use of a subsequent character that would normally be spacing as nonspacing), code value 7 (hexadecimal).
OJ叠加连接符(允许使用通常为空格的后续字符作为非空格),代码值7(十六进制)。
ONJ Overlay non-joiner (turns a nonspacing mark into a standalone character), code value 8 (hexadecimal). This joiner should not be necessary, and is normally prohibited by the "shortest string" rule. But there may be unanticipated cases.
ONJ Overlay non joiner(将非空格标记转换为独立字符),代码值8(十六进制)。这种加入者不应该是必要的,并且通常被“最短字符串”规则禁止。但也可能有一些意想不到的情况。
ZWJ Zero-width joiner for words or word-like constructions, code value 9 (hexadecimal).
ZWJ字或类字结构的零宽度连接符,代码值9(十六进制)。
ZWNJ Zero-width non-joiner for words or word-like constructions, code value A (hexadecimal).
ZWNJ零宽度非连接符,用于单词或类似单词的结构,代码值A(十六进制)。
Many characters are defined by constructed glyphs using nonspacing marks. For example, the characters "b" and "d" are coded as o<VerticalLine><PositionLeft> and o<VerticalLine><PositionRight>, respectively. The Catalan ligature that has caused some difficulties in Internationalizing Domain Names in Applications (IDNA) [RFC3490] is coded as l<ZWCJ><.><PositionVMiddle><ZWCJ>l
Many characters are defined by constructed glyphs using nonspacing marks. For example, the characters "b" and "d" are coded as o<VerticalLine><PositionLeft> and o<VerticalLine><PositionRight>, respectively. The Catalan ligature that has caused some difficulties in Internationalizing Domain Names in Applications (IDNA) [RFC3490] is coded as l<ZWCJ><.><PositionVMiddle><ZWCJ>l
The initial table of positioning indicators is:
定位指示器的初始表为:
+-------------------+-----------+ | Name | Hex value | +-------------------+-----------+ | PositionLeft | 20 | | PositionCenter | 21 | | PositionRight | 22 | | PositionTop | 30 | | PositionVMiddle | 31 | | PositionBottom | 32 | | PositionDescender | 33 | +-------------------+-----------+
+-------------------+-----------+ | Name | Hex value | +-------------------+-----------+ | PositionLeft | 20 | | PositionCenter | 21 | | PositionRight | 22 | | PositionTop | 30 | | PositionVMiddle | 31 | | PositionBottom | 32 | | PositionDescender | 33 | +-------------------+-----------+
These controls designate character form changes for initial or final-form characters. Where the distinction is important, medial-form characters are the default when no qualification occurs. As with case comparisons, comparisons are performed by ignoring these control functions.
这些控件指定初始或最终形式字符的字符形式更改。如果区分很重要,则在没有限定条件时,默认使用中间形式字符。与案例比较一样,通过忽略这些控制函数来执行比较。
+-------------+-----------+ | Name | Hex value | +-------------+-----------+ | InitialForm | 71 | | FinalForm | 72 | +-------------+-----------+
+-------------+-----------+ | Name | Hex value | +-------------+-----------+ | InitialForm | 71 | | FinalForm | 72 | +-------------+-----------+
For compactness of coding, two repetition indicators are introduced for double (Repeat2) and triple (Repeat3) characters that may be treated as ligatures or special cases. Two consecutive uses of a character compare equal to the character followed by <Repeat2>. The interpretation of u<ZWCJ>u<Repeat3> is left as an exercise for the reader.
为了编码的紧凑性,双(Repeat2)和三(Repeat3)字符引入了两个重复指示符,它们可以被视为连字或特殊情况。一个字符的两次连续使用比较等于后跟<Repeat2>的字符。u<ZWCJ>u<Repeat3>的解释留给读者作为练习。
The initial table of repetition indicators is:
重复指标的初始表为:
+---------+-----------+ | Name | Hex value | +---------+-----------+ | Repeat2 | 50 | | Repeat3 | 51 | | Repeat1 | 52 | +---------+-----------+
+---------+-----------+ | Name | Hex value | +---------+-----------+ | Repeat2 | 50 | | Repeat3 | 51 | | Repeat1 | 52 | +---------+-----------+
For larger repeats, these repeats can be combined; the sequence <Repeat2><Repeat3> represents six repeats, while the <Repeat3><Repeat2> represents five repeats. Following the "shortest string" principle (see Section 4), Repeat1 must not ever appear except in combination with Repeat2 and/or Repeat3. The generation of other numbers is left as an exercise for the reader.
对于较大的重复,可以组合这些重复;序列<Repeat2><Repeat3>代表六个重复,而<Repeat3><Repeat2>代表五个重复。根据“最短字符串”原则(见第4节),Repeat1不得出现,除非与Repeat2和/或Repeat3组合出现。其他数字的生成留给读者作为练习。
Because it is intended primarily for domain names, this specification has no provision for control or spacing characters.
因为它主要用于域名,所以本规范没有控制字符或空格字符的规定。
Following the reasoning used in Unicode [Unicode], every character occupies exactly 23 bits (conventionally stored as three octets, with the leading bit always zero). This value is chosen because both 3 and 23 are prime numbers, unlike 42.
按照Unicode[Unicode]中使用的推理,每个字符正好占用23位(通常存储为三个八位字节,前导位始终为零)。选择此值是因为3和23都是素数,而不是42。
The code point value zero is permanently reserved and will not be used unless it is necessary to expand the code space.
代码点值零是永久保留的,除非需要扩展代码空间,否则不会使用。
Code values between 1 and 255 (decimal) are reserved for the special character formation codes described in Section 2.3 through Section 2.7.
第2.3节至第2.7节中所述的特殊字符形成代码保留1到255(十进制)之间的代码值。
Code values between 256 and 511 (decimal) are reserved for character formation marks for non-ideographic characters. Most, but not all, of these are nonspacing (combining) characters.
256到511(十进制)之间的代码值保留用于非表意字符的字符形成标记。大多数(但不是全部)字符是非空格(组合)字符。
Code values between 512 and 1023 are reserved on general principles and in case it is necessary to invent new rules and make them retroactive.
512和1023之间的代码值根据一般原则保留,以防有必要发明新规则并使其具有追溯力。
Code values of 1024 and above are to be allocated for characters, glyphs, and other character elements.
为字符、字形和其他字符元素分配1024及以上的代码值。
When glyphs are constructed using the mechanisms described here, there is a single canonical form for representing any given glyph. There are no exceptions to that form, and any sequence of characters and qualifiers that is not consistent with the form is invalid. If there are two possible ways to represent a given character, the shorter one (in octet count) is the only permitted form. If there are two possible ways that are of the same length, the only permitted form is the one that has the smaller value when the numeric values of all of the octets in each are summed.
当使用这里描述的机制构造glyph时,有一个单一的标准形式来表示任何给定的glyph。该表单没有例外,任何与表单不一致的字符和限定符序列都是无效的。如果有两种可能的方式来表示给定的字符,那么较短的一种(八位字节计数)是唯一允许的形式。如果有两种可能的方式具有相同的长度,则唯一允许的形式是当每个八位字节中的所有八位字节的数值相加时具有较小值的形式。
The ordering rules are as follows:
订购规则如下:
1. A base character or composite character (see below) must come first.
1. 基本字符或组合字符(见下文)必须位于第一位。
2. The base character may be followed by ZWCJ or OJ, but not both, followed by a base or nonspacing character or mark.
2. 基本字符后面可以跟ZWCJ或OJ,但不能同时跟ZWCJ或OJ,后面可以跟基本字符或非空格字符或标记。
3. If ZWCJ appears, the next character must be a base character or nonspacing mark.
3. 如果出现ZWCJ,则下一个字符必须是基本字符或非空格标记。
4. If OJ appears, the next character must be a base character, since the function of OJ is to make a spacing base character into a nonspacing (overlay) character.
4. 如果出现OJ,则下一个字符必须是基础字符,因为OJ的功能是将间距基础字符转换为非间距(重叠)字符。
5. That character can be followed by positional qualifiers that apply to it. Vertical positional qualifiers precede horizontal positional qualifiers.
5. 该字符后面可以跟有应用于该字符的位置限定符。垂直位置限定符先于水平位置限定符。
6. That sequence of characters may be followed by a case qualifier.
6. 该字符序列后面可以跟一个大小写限定符。
7. That entire sequence of characters forms a composite character. When the composite character is non-trivial, the rules may be applied to it recursively. If grouping is needed to distinguish between one composite character and the next, ZWNCJ may be used at the beginning of a composite character to identify a group boundary.
7. 整个字符序列形成一个复合字符。当复合字符非平凡时,可以递归地将规则应用于它。如果需要分组来区分一个组合字符和下一个组合字符,则可以在组合字符的开头使用ZWNCJ来标识组边界。
The initial lists of positioning and combining controls appear above. This section shows codes for some base characters. Names in upper case are the Unicode names for the characters. These are followed, for information, by the Unicode code point designations. The code point list is informative, not normative, and may not be complete (especially since additional matching code points may be added to Unicode over time). Note that several Unicode characters that are considered different by Unicode are assigned the same code sequence in the system specified here.
上面显示了定位和组合控件的初始列表。本节显示一些基本字符的代码。大写的名称是字符的Unicode名称。以下是Unicode代码点名称,以供参考。代码点列表是信息性的,不是规范性的,并且可能不完整(特别是因为随着时间的推移,可能会向Unicode添加额外的匹配代码点)。请注意,在此处指定的系统中,Unicode认为不同的几个Unicode字符被分配了相同的代码序列。
+------------------------+-------+----------------------------------+ | Name | Hex | Comment | | | value | | +------------------------+-------+----------------------------------+ | FULL STOP (U+002E) | 110 | Used as both base character (in | | | | bottom center position) and as | | | | movable dot with OJ and | | | | positional qualifiers. | | HYPHEN-MINUS (U+002D) | 108 | Used as a spacing base character | | | | (in horizontally and vertically | | | | centered position) and as a | | | | movable half-width horizontal | | | | line with OJ and positional | | | | qualifiers. In the context of | | | | this specification, should be | | | | known as Half Horizontal Line. | | LOW LINE (U+005F) | 109 | Used as a spacing base character | | | | (in bottom position) and as a | | | | movable full-width horizontal | | | | line with OJ and positional | | | | qualifiers. In the context of | | | | this specification, should be | | | | known as Horizontal Line. | | VERTICAL LINE (U+007C) | 102 | As with the horizontal lines, | | | | normally a spacing base | | | | character (in the middle | | | | position between left and | | | | right), but can be used as a | | | | right to left movable | | | | full-height vertical line with | | | | OJ and/or positional qualifiers. | | HalfHeightVerticalLine | 105 | Similar to VERTICAL LINE, but | | | | only half height. | | SOLIDUS (U+002F) | 103 | Used only for character | | | | formation; forward slash | | REVERSE SOLIDUS | 104 | Used only for character | | (U+005C) | | formation; reverse slash | | RightUpperHook | 131 | Used only for character | | | | formation; nonspacing mark. | | LeftUpperHook | 132 | Used only for character | | | | formation; nonspacing mark. | | LeftLowerHook | 133 | Used only for character | | | | formation; nonspacing mark. | | RightLowerHook | 134 | Used only for character | | | | formation; nonspacing mark. | | HalfHeightHoop | 140 | Used only for character | | | | formation; nonspacing mark. |
+------------------------+-------+----------------------------------+ | Name | Hex | Comment | | | value | | +------------------------+-------+----------------------------------+ | FULL STOP (U+002E) | 110 | Used as both base character (in | | | | bottom center position) and as | | | | movable dot with OJ and | | | | positional qualifiers. | | HYPHEN-MINUS (U+002D) | 108 | Used as a spacing base character | | | | (in horizontally and vertically | | | | centered position) and as a | | | | movable half-width horizontal | | | | line with OJ and positional | | | | qualifiers. In the context of | | | | this specification, should be | | | | known as Half Horizontal Line. | | LOW LINE (U+005F) | 109 | Used as a spacing base character | | | | (in bottom position) and as a | | | | movable full-width horizontal | | | | line with OJ and positional | | | | qualifiers. In the context of | | | | this specification, should be | | | | known as Horizontal Line. | | VERTICAL LINE (U+007C) | 102 | As with the horizontal lines, | | | | normally a spacing base | | | | character (in the middle | | | | position between left and | | | | right), but can be used as a | | | | right to left movable | | | | full-height vertical line with | | | | OJ and/or positional qualifiers. | | HalfHeightVerticalLine | 105 | Similar to VERTICAL LINE, but | | | | only half height. | | SOLIDUS (U+002F) | 103 | Used only for character | | | | formation; forward slash | | REVERSE SOLIDUS | 104 | Used only for character | | (U+005C) | | formation; reverse slash | | RightUpperHook | 131 | Used only for character | | | | formation; nonspacing mark. | | LeftUpperHook | 132 | Used only for character | | | | formation; nonspacing mark. | | LeftLowerHook | 133 | Used only for character | | | | formation; nonspacing mark. | | RightLowerHook | 134 | Used only for character | | | | formation; nonspacing mark. | | HalfHeightHoop | 140 | Used only for character | | | | formation; nonspacing mark. |
| HalfHeightInvertedHoop | 141 | Used only for character | | | | formation; nonspacing mark. | | DIGIT ZERO (U+0030) | 400 | | | DIGIT ONE (U+0031) | 401 | | | DIGIT TWO (U+0032) | 402 | | | DIGIT NINE (U+0039) | 409 | | | LATIN SMALL LETTER A | 40A | | | (U+0061) | | | | LATIN SMALL LETTER O | 418 | Unify with Greek Omicron | | (U+006F, U+03BF) | | | | LATIN SMALL LETTER C | 40C | Unifying C with Cyrillic ES | | (U+0063, U+0441) | | | | GREEK SMALL LETTER | 491 | | | SIGMA (U+03C3) | | | +------------------------+-------+----------------------------------+
| HalfHeightInvertedHoop | 141 | Used only for character | | | | formation; nonspacing mark. | | DIGIT ZERO (U+0030) | 400 | | | DIGIT ONE (U+0031) | 401 | | | DIGIT TWO (U+0032) | 402 | | | DIGIT NINE (U+0039) | 409 | | | LATIN SMALL LETTER A | 40A | | | (U+0061) | | | | LATIN SMALL LETTER O | 418 | Unify with Greek Omicron | | (U+006F, U+03BF) | | | | LATIN SMALL LETTER C | 40C | Unifying C with Cyrillic ES | | (U+0063, U+0441) | | | | GREEK SMALL LETTER | 491 | | | SIGMA (U+03C3) | | | +------------------------+-------+----------------------------------+
This section provides examples of characters that are derived from or based on others, known as "composite characters".
本节提供了从其他字符派生或基于其他字符的字符示例,称为“复合字符”。
+------------------+--------------+---------------------------------+ | Name | Hex value | Comment | +------------------+--------------+---------------------------------+ | LATIN SMALL | 418 007 102 | | | LETTER B | 020 | | | (U+0062) | | | | LATIN SMALL | 418 007 102 | | | LETTER D | 022 | | | (U+0064) | | | | LATIN SMALL | 40C 007 108 | | | LETTER E | 031 | | | (U+0065) | | | | LATIN SMALL | 40A 006 40C | | | LETTER AE | 007 108 031 | | | (U+00E6) | | | | LATIN SMALL | 102 131 030 | Note that 007 is not needed | | LETTER F | 007 108 | before 131 because hooks are | | (U+0066) | | exclusively nonspacing | | | | (combining). | | LATIN SMALL | 102 020 141 | | | LETTER H | 021 032 | | | (U+0068) | | | | LATIN SMALL | 105 007 110 | | | LETTER I | 021 030 | | | (U+0069) | | |
+------------------+--------------+---------------------------------+ | Name | Hex value | Comment | +------------------+--------------+---------------------------------+ | LATIN SMALL | 418 007 102 | | | LETTER B | 020 | | | (U+0062) | | | | LATIN SMALL | 418 007 102 | | | LETTER D | 022 | | | (U+0064) | | | | LATIN SMALL | 40C 007 108 | | | LETTER E | 031 | | | (U+0065) | | | | LATIN SMALL | 40A 006 40C | | | LETTER AE | 007 108 031 | | | (U+00E6) | | | | LATIN SMALL | 102 131 030 | Note that 007 is not needed | | LETTER F | 007 108 | before 131 because hooks are | | (U+0066) | | exclusively nonspacing | | | | (combining). | | LATIN SMALL | 102 020 141 | | | LETTER H | 021 032 | | | (U+0068) | | | | LATIN SMALL | 105 007 110 | | | LETTER I | 021 030 | | | (U+0069) | | |
| LATIN SMALL | 105 020 141 | | | LETTER N | 021 032 | | | (U+006E) | | | | LATIN SMALL | 418 007 102 | Unified P, Greek Rho, Cyrillic | | LETTER P | 033 020 033 | ER | | (U+0070, U+03C1, | | | | U+0440) | | | | LATIN CAPITAL | 40A 001 | | | LETTER A | | | | (U+0041) | | | | LATIN CAPITAL | 418 007 102 | | | LETTER B | 020 001 | | | (U+0042) | | | | LATIN CAPITAL | 40C 001 | | | LETTER C | | | | (U+0043) | | | | LATIN CAPITAL | 418 007 102 | | | LETTER D | 022 001 | | | (U+0044) | | | | GREEK SMALL | 491 072 | | | LETTER FINAL | | | | SIGMA (U+03C2) | | | +------------------+--------------+---------------------------------+
| LATIN SMALL | 105 020 141 | | | LETTER N | 021 032 | | | (U+006E) | | | | LATIN SMALL | 418 007 102 | Unified P, Greek Rho, Cyrillic | | LETTER P | 033 020 033 | ER | | (U+0070, U+03C1, | | | | U+0440) | | | | LATIN CAPITAL | 40A 001 | | | LETTER A | | | | (U+0041) | | | | LATIN CAPITAL | 418 007 102 | | | LETTER B | 020 001 | | | (U+0042) | | | | LATIN CAPITAL | 40C 001 | | | LETTER C | | | | (U+0043) | | | | LATIN CAPITAL | 418 007 102 | | | LETTER D | 022 001 | | | (U+0044) | | | | GREEK SMALL | 491 072 | | | LETTER FINAL | | | | SIGMA (U+03C2) | | | +------------------+--------------+---------------------------------+
Because of the traditional model of forming characters using selected radicals and strokes in combination, Han-derived ("CJK") characters are even more naturally represented, with less ambiguity, in the system specified here than European ones. The mechanisms used in this specification and represented in the tables (see Section 8) are similar to those described as "Radicals" and "Strokes" in Section 5.1 and in Section 5.2 ("Ideographic Description Characters") of The Unicode Standard [Unicode]. Of course, following the same principles outlined above for European characters, only radicals, stroke, and description controls would be treated as base characters; no distinct compound precomposed ideographic characters are registered.
由于使用选定的部首和笔划组合形成字符的传统模式,在这里指定的系统中,汉文衍生(“CJK”)字符比欧洲字符更自然地表示,歧义更少。本规范中使用并在表格中表示的机制(见第8节)类似于Unicode标准[Unicode]第5.1节和第5.2节(“表意文字描述字符”)中描述为“部首”和“笔划”的机制。当然,按照上面为欧洲字符概述的相同原则,只有部首、笔划和描述控件才会被视为基本字符;未注册任何明显的复合预合成表意字符。
IANA is requested to keep the actual registry of characters and code tables. The registry entries consist of a character name (preferably matching the Unicode character name when one is available), the code sequence used to represent the character and optional descriptive information. The characters and codes identified in Section 2, Section 5, and Section 6 above should be used to initialize the table. Since the coding system is user-extensible, registrations should be accepted for new characters as long as they don't look like
IANA被要求保留字符和代码表的实际注册表。注册表项包括字符名(如果有Unicode字符名,最好与之匹配)、用于表示字符的代码序列和可选的描述性信息。应使用上文第2节、第5节和第6节中确定的字符和代码初始化表格。由于编码系统是用户可扩展的,所以只要新字符看起来不像,就应该接受注册
old ones. A designated expert with a background in calligraphy or abstract art, and considerable experience in evaluating claims about the count of angels on heads of pins, should be selected to advise IANA on "looks like".
旧的。应选择一名具有书法或抽象艺术背景的指定专家,并在评估针头上天使数量方面具有丰富经验,就“相貌”向IANA提供建议。
The representation of characters in this format should be a significant boon for security. It eliminates many possibilities of phishing attacks, since Principle 1 prevents the existence of two characters that look alike but are different.
这种格式的字符表示应该是安全性的一大福音。它消除了许多网络钓鱼攻击的可能性,因为原则1防止了两个看起来相似但不同的角色的存在。
By detaching the encoding of characters for domain names from the encoding of characters for other purposes, it also guarantees that reasonable-looking names will have been encoded by competent entities, thereby providing a significant degree of safety by obscurity.
通过将域名的字符编码与其他用途的字符编码分离,它还保证了外观合理的名称将由主管实体进行编码,从而通过模糊性提供了很大程度的安全性。
Because of the method by which upper-case forms are encoded and because similarity is sometimes in the mind of the beholder, this specification will not completely eliminate opportunities for visual confusion. For example, because the lower-case characters are quite different, LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA will never compare equal, even though they look alike.
由于大写形式的编码方法以及观看者有时会想到相似性,本规范不会完全消除视觉混淆的机会。例如,因为小写字母完全不同,所以拉丁字母大写字母A和希腊字母大写字母ALPHA永远不会相等,即使它们看起来很相似。
The authors would like to acknowledge the many contributions of J.F.C. Morphin for pointing out the inadequacies of trying to address the challenges of internationalization within the context of existing engineering principles. His comments and related ones, in combination with issues encountered in trying to internationalize domain names based on Unicode, have contributed greatly to the frame of mind underlying large parts of the proposal documented here. The theoretical framework for this coding system is based, in part, on Unicode and its collection of names and sample glyphs but represents a very different approach to the coding system itself.
作者要感谢J.F.C.Morpin的许多贡献,他指出了在现有工程原理的背景下解决国际化挑战的不足之处。他的评论和相关评论,再加上在尝试基于Unicode的域名国际化过程中遇到的问题,极大地促进了本文所述提案大部分内容背后的思想框架。该编码系统的理论框架部分基于Unicode及其名称和示例符号集合,但表示编码系统本身的一种非常不同的方法。
[Unicode] The Unicode Consortium, "The Unicode Standard, Version 5.0", 2007. Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0
[Unicode]Unicode联盟,“Unicode标准,5.0版”,2007年。美国马萨诸塞州波士顿:艾迪生·韦斯利。ISBN 0-321-48091-0
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003.
[RFC3490]Faltstrom,P.,Hoffman,P.,和A.Costello,“应用程序中的域名国际化(IDNA)”,RFC 34902003年3月。
Authors' Addresses
作者地址
John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA
美国马萨诸塞州剑桥市322号马萨诸塞大道1770号约翰·C·克伦辛,邮编:02140
Phone: +1 617 491 5735 EMail: john+ietf@jck.com
Phone: +1 617 491 5735 EMail: john+ietf@jck.com
Harald Tveit Alvestrand Google Beddingen 10 Trondheim, 7014 Norway
Harald Tveit Alvestrand Google Beddingen 10 Trondheim, 7014 Norwaytranslate error, please retry
EMail: harald@alvestrand.no
EMail: harald@alvestrand.no
Full Copyright Statement
完整版权声明
Copyright (C) The IETF Trust (2008).
版权所有(C)IETF信托基金(2008年)。
This document is subject to the rights, licenses and restrictions contained in BCP 78 and at http://www.rfc-editor.org/copyright.html, and except as set forth therein, the authors retain all their rights.
本文件受BCP 78和http://www.rfc-editor.org/copyright.html,除本协议另有规定外,提交人保留其所有权利。
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件及其包含的信息以“原样”为基础提供,贡献者、他/她所代表或赞助的组织(如有)、互联网协会、IETF信托基金和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Intellectual Property
知识产权
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.