Network Working Group J. Klensin Request for Comments: 5137 February 2008 BCP: 137 Category: Best Current Practice
Network Working Group J. Klensin Request for Comments: 5137 February 2008 BCP: 137 Category: Best Current Practice
ASCII Escaping of Unicode Characters
Unicode字符的ASCII转义
Status of This Memo
关于下段备忘
This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. Distribution of this memo is unlimited.
本文件规定了互联网社区的最佳现行做法,并要求进行讨论和提出改进建议。本备忘录的分发不受限制。
Abstract
摘要
There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII coding, the traditional escape has been either the decimal or hexadecimal numeric value of the character, written in a variety of different ways. The move to Unicode, where characters occupy two or more octets and may be coded in several different forms, has further complicated the question of escapes. This document discusses some options now in use and discusses considerations for selecting one for use in new IETF protocols, and protocols that are now being internationalized.
在许多情况下,需要使用转义机制和协议来编码无法直接表示或传输的字符。对于ASCII编码,传统的转义是以各种不同方式写入的字符的十进制或十六进制数值。转向Unicode,字符占据两个或更多的八位字节,并可能以几种不同的形式编码,这进一步使转义问题复杂化。本文档讨论了目前正在使用的一些选项,并讨论了在新的IETF协议和正在国际化的协议中选择一个选项的注意事项。
Table of Contents
目录
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Context and Background . . . . . . . . . . . . . . . . . . 3 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 1.3. Discussion List . . . . . . . . . . . . . . . . . . . . . 4 2. Encodings that Represent Unicode Code Points: Code Position versus UTF-8 or UTF-16 Octets . . . . . . . . . . . . 4 3. Referring to Unicode Characters . . . . . . . . . . . . . . . 5 4. Syntax for Code Point Escapes . . . . . . . . . . . . . . . . 6 5. Recommended Presentation Variants for Unicode Code Point Escapes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Backslash-U with Delimiters . . . . . . . . . . . . . . . 7 5.2. XML and HTML . . . . . . . . . . . . . . . . . . . . . . . 7 6. Forms that Are Normally Not Recommended . . . . . . . . . . . 8 6.1. The C Programming Language: Backslash-U . . . . . . . . . 8 6.2. Perl: A Hexadecimal String . . . . . . . . . . . . . . . . 8 6.3. Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1. Normative References . . . . . . . . . . . . . . . . . . . 10 9.2. Informative References . . . . . . . . . . . . . . . . . . 10 Appendix A. Formal Syntax for Forms Not Recommended . . . . . . . 12 A.1. The C Programming Language Form . . . . . . . . . . . . . 12 A.2. Perl Form . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3. Java Form . . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Context and Background . . . . . . . . . . . . . . . . . . 3 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 1.3. Discussion List . . . . . . . . . . . . . . . . . . . . . 4 2. Encodings that Represent Unicode Code Points: Code Position versus UTF-8 or UTF-16 Octets . . . . . . . . . . . . 4 3. Referring to Unicode Characters . . . . . . . . . . . . . . . 5 4. Syntax for Code Point Escapes . . . . . . . . . . . . . . . . 6 5. Recommended Presentation Variants for Unicode Code Point Escapes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Backslash-U with Delimiters . . . . . . . . . . . . . . . 7 5.2. XML and HTML . . . . . . . . . . . . . . . . . . . . . . . 7 6. Forms that Are Normally Not Recommended . . . . . . . . . . . 8 6.1. The C Programming Language: Backslash-U . . . . . . . . . 8 6.2. Perl: A Hexadecimal String . . . . . . . . . . . . . . . . 8 6.3. Java: Escaped UTF-16 . . . . . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1. Normative References . . . . . . . . . . . . . . . . . . . 10 9.2. Informative References . . . . . . . . . . . . . . . . . . 10 Appendix A. Formal Syntax for Forms Not Recommended . . . . . . . 12 A.1. The C Programming Language Form . . . . . . . . . . . . . 12 A.2. Perl Form . . . . . . . . . . . . . . . . . . . . . . . . 12 A.3. Java Form . . . . . . . . . . . . . . . . . . . . . . . . 12
There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII [ASCII] coding, the traditional escape has been either the decimal or hexadecimal numeric value of the character, written in a variety of different ways. For example, in different contexts, we have seen %dNN or %NN for the decimal form, %NN, %xNN, X'nn', and %X'NN' for the hexadecimal form. "%NN" has become popular in recent years to represent a hexadecimal value without further qualification, perhaps as a consequence of its use in URLs and their prevalence. There are even some applications around in which octal forms are used and, while they do not generalize well, the MIME Quoted-Printable and Encoded-word forms can be thought of as yet another set of escapes. So, even for the fairly simple cases of ASCII and standard built by extending ASCII, such as the ISO 8859 family, we have been living with several different escaping forms, each the result of some history.
在许多情况下,需要使用转义机制和协议来编码无法直接表示或传输的字符。使用ASCII[ASCII]编码,传统的转义是以各种不同方式写入的字符的十进制或十六进制数值。例如,在不同的上下文中,我们看到%dNN或%NN表示十进制形式,%NN,%xNN,X'NN'和%X'NN'表示十六进制形式。“%NN”近年来已变得流行,表示十六进制值而无需进一步限定,这可能是由于它在URL中的使用及其流行。甚至有一些应用程序使用八进制形式,虽然它们不能很好地概括,但MIME引用的可打印和编码单词形式可以被认为是另一组转义。因此,即使对于相当简单的ASCII和通过扩展ASCII构建的标准,例如ISO 8859系列,我们也一直生活在几种不同的转义形式中,每种转义形式都是一些历史的结果。
When one moves to Unicode [Unicode] [ISO10646], where characters occupy two or more octets and may be coded in several different forms, the question of escapes becomes even more complicated. Unicode represents characters as code points: numeric values from 0 to hex 10FFFF. When referencing code points in flowing text, they are represented using the so-called "U+" notation, as values from U+0000 to U+10FFFF. When serialized into octets, these code points can be represented in different forms:
当人们转向Unicode[Unicode][ISO10646]时,字符占据两个或多个八位字节,并可能以几种不同的形式编码,转义问题变得更加复杂。Unicode将字符表示为代码点:从0到十六进制10FFFF的数值。在流动文本中引用代码点时,它们使用所谓的“U+”表示法表示,表示为从U+0000到U+10FFFF的值。当序列化为八位字节时,这些代码点可以用不同的形式表示:
o in UTF-8 with one to four octets [RFC3629]
o 在UTF-8中,具有一到四个八位组[RFC3629]
o in UTF-16 with two or four octets (or one or two seizets -- 16-bit units)
o 在UTF-16中,有两个或四个八位组(或一个或两个八位组——16位单元)
o in UTF-32 with exactly four octets (or one 32-bit unit)
o 在UTF-32中,正好有四个八位字节(或一个32位单元)
When escaping characters, we have seen fairly extensive use of hexadecimal representations of both the serialized forms and variations on the U+ notation, known as code point escapes.
在转义字符时,我们已经看到了对序列化形式和U+符号变体(称为代码点转义)的十六进制表示法的相当广泛的使用。
In accordance with existing best-practices recommendations [RFC2277], new protocols that are required to carry textual content for human use SHOULD be designed in such a way that the full repertoire of Unicode characters may be represented in that text.
根据现有最佳实践建议[RFC2277],承载供人类使用的文本内容所需的新协议的设计方式应确保在该文本中可以表示完整的Unicode字符库。
This document proposes that existing protocols being internationalized, and those that need an escape mechanism, SHOULD use some contextually appropriate variation on references to code points as described in Section 2 unless other considerations outweigh those described here.
本文件建议,正在国际化的现有协议,以及那些需要转义机制的协议,应在第2节所述的代码点引用上使用一些上下文适当的变体,除非其他考虑因素超过了此处所述的考虑因素。
This recommendation is not applicable to protocols that already accept native UTF-8 or some other encoding of Unicode. In general, when protocols are internationalized, it is preferable to accept those forms rather than using escapes. This recommendation applies to cases, including transition arrangements, in which that is not practical.
本建议不适用于已经接受本机UTF-8或某些其他Unicode编码的协议。一般来说,当协议国际化时,最好接受这些形式,而不是使用转义。这项建议适用于不可行的情况,包括过渡安排。
In addition to the protocol contexts addressed in this specification, escapes to represent Unicode characters also appear in presentations to users, i.e., in user interfaces (UI). The formats specified in, and the reasoning of, this document may be applicable in UI contexts as well, but this is not a proposal to standardize UI or presentation forms.
除了本规范中所述的协议上下文之外,用于表示Unicode字符的转义符也出现在用户演示文稿中,即用户界面(UI)中。本文档中指定的格式及其推理可能也适用于UI上下文,但这不是标准化UI或表示形式的建议。
This document does not make general recommendations for processing Unicode strings or for their contents. It assumes that the strings that one might want to escape are valid and reasonable and that the definition of "valid and reasonable" is the province of other documents. Recommendations about general treatment of Unicode strings may be found in many places, including the Unicode Standard itself and the W3C Character Model [W3C-CharMod], as well as specific rules in individual protocols.
本文档不提供处理Unicode字符串或其内容的一般建议。它假定您可能想要转义的字符串是有效和合理的,并且“有效和合理”的定义是其他文档的范围。关于Unicode字符串的一般处理方法的建议可以在许多地方找到,包括Unicode标准本身和W3C字符模型[W3C CharMod],以及各个协议中的特定规则。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照[RFC2119]中所述进行解释。
Additional Unicode-specific terminology appears in [UnicodeGlossary], but is not necessary for understanding this specification.
其他特定于Unicode的术语出现在[UnicodeGlossary]中,但理解本规范并不需要这些术语。
Discussion of this document should be addressed to the discuss@apps.ietf.org mailing list.
对本文件的讨论应提交给discuss@apps.ietf.org邮件列表。
2. Encodings that Represent Unicode Code Points: Code Position versus UTF-8 or UTF-16 Octets
2. 表示Unicode代码点的编码:代码位置与UTF-8或UTF-16八位字节
There are two major families of ways to escape Unicode characters. One uses the code point in some representation (see the next
有两种主要的Unicode字符转义方法。一种是在某些表示中使用代码点(请参见下一种)
section), the other encodes the octets of the UTF-8 encoding or some other encoding in some representation. Some other options are possible, but they have been rare in practice. This specification recommends that, in the absence of compelling reasons to do otherwise, the Unicode code points SHOULD be used rather than a representation of UTF-8 (or UTF-16) octets. There are several reasons for this, including:
第节),另一个编码UTF-8编码的八位字节或某些表示中的其他编码。其他一些选择是可能的,但在实践中很少有。本规范建议,在没有令人信服的理由不这样做的情况下,应使用Unicode代码点,而不是UTF-8(或UTF-16)八位字节的表示形式。这有几个原因,包括:
o One reason for the success of many IETF protocols is that they use human-interpretable text forms to communicate, rather than encodings that generally require computer programs (or hand simulation of algorithms) to decode. This suggests that the presentation form should reference the Unicode tables for characters and to do so as simply as possible.
o 许多IETF协议成功的一个原因是,它们使用人类可解释的文本形式进行通信,而不是通常需要计算机程序(或人工模拟算法)进行解码的编码。这表明,表示形式应参考Unicode表中的字符,并尽可能简单。
o Because of the nature of UTF-8, for a human to interpret a decimal or hexadecimal numeral representation of UTF-8 octets requires one or more decoding steps to determine a Unicode code point that can used to look up the character in a table. That may be appropriate in some cases where the goal is really to represent the UTF-8 form but, in general, it just obscures desired information and makes errors more likely and debugging harder.
o 由于UTF-8的性质,人类要解释UTF-8八位字节的十进制或十六进制数字表示,需要一个或多个解码步骤来确定可用于在表中查找字符的Unicode码点。在某些情况下,这可能是合适的,因为目标实际上是表示UTF-8表单,但一般来说,它只是模糊了所需的信息,使出错的可能性更大,调试更困难。
o Except for characters in the ASCII subset of Unicode (U+0000 through U+007F), the code point form is generally more compact than forms based on coding UTF-8 octets, sometimes much more compact.
o 除了Unicode的ASCII子集(U+0000到U+007F)中的字符外,代码点形式通常比基于UTF-8八位字节编码的形式更紧凑,有时更紧凑。
The same considerations that apply to representation of the octets of UTF-8 encoding also apply to more compact ACE encodings such as the "bootstring" encoding [RFC3492] with or without its "Punycode" profile.
适用于UTF-8编码的八位字节表示的相同注意事项也适用于更紧凑的ACE编码,例如带或不带“Punycode”配置文件的“bootstring”编码[RFC3492]。
Similar considerations apply to UTF-16 encoding, such as the \uNNNN form used in Java (See Section 6.3). While those forms are equivalent to code point references for the Basic Multilingual Plane (BMP, Plane 0), a two-stage decoding process is needed to handle surrogates to access higher planes.
类似的注意事项也适用于UTF-16编码,例如Java中使用的uNNNN格式(参见第6.3节)。虽然这些形式相当于基本多语言平面(BMP,平面0)的代码点引用,但需要两个阶段的解码过程来处理代理以访问更高的平面。
Regardless of what decisions are made about escapes for Unicode characters in protocol or similar contexts, text referring to a Unicode code point SHOULD use the U+NNNN[N[N]] syntax, as specified in the Unicode Standard, where the NNNN... string consists of hexadecimal numbers. Text actually containing a Unicode character SHOULD use a syntax more suitable for automated processing.
无论在协议或类似上下文中对Unicode字符的转义做出何种决定,引用Unicode代码点的文本应使用Unicode标准中指定的U+NNNN[N[N]]语法,其中NNNN。。。字符串由十六进制数组成。实际包含Unicode字符的文本应使用更适合自动处理的语法。
There are many options for code point escapes, some of which are summarized below. All are equivalent in content and semantics -- the differences lie in syntax. The best choice of syntax for a particular protocol or other application depends on that application: one form may simply "fit" better in a given context than others. It is clear, however, that hexadecimal values are preferable to other alternatives: Systems based on decimal or octal offsets SHOULD NOT be used.
代码点转义有许多选项,下面对其中一些进行了总结。它们在内容和语义上都是等价的——区别在于语法。特定协议或其他应用程序的最佳语法选择取决于该应用程序:一种形式可能比其他形式更“适合”给定上下文。但是,很明显,十六进制值比其他替代值更可取:不应使用基于十进制或八进制偏移量的系统。
Since this specification does not recommend one specific syntax, protocol specifications that use escapes MUST define the syntax they are using, including any necessary escapes to permit the escape sequence to be used literally.
由于本规范不推荐一种特定的语法,因此使用转义的协议规范必须定义它们所使用的语法,包括允许逐字使用转义序列的任何必要转义。
The application designer selecting a format should consider at least the following factors:
应用程序设计器选择格式应至少考虑以下因素:
o If similar or related protocols already use one form, it may be best to select that form for consistency and predictability.
o 如果类似的或相关的协议已经使用了一种形式,为了一致性和可预测性,最好选择该形式。
o A Unicode code point can fall in the range from U+0000 to U+10FFFF. Different escape systems may use four, five, six, or eight hexadecimal digits. To avoid clever syntax tricks and the consequent risk of confusion and errors, forms that use explicit string delimiters are generally preferred over other alternatives. In many contexts, symmetric paired delimiters are easier to recognize and understand than visually unrelated ones.
o Unicode代码点可以在U+0000到U+10FFFF的范围内。不同的转义系统可能使用四、五、六或八个十六进制数字。为了避免巧妙的语法技巧以及由此带来的混淆和错误风险,通常首选使用显式字符串分隔符的表单,而不是其他替代方法。在许多上下文中,对称成对分隔符比视觉上不相关的分隔符更容易识别和理解。
o Syntax forms starting in "\u", without explicit delimiters, have been used in several different escape systems, including the four or eight digit syntax of C [ISO-C] (see Section 6.1), the UTF-16 encoding of Java [Java] (see Section 6.3), and some arrangements that may follow the "\u" with four, five, or six digits. The possible confusion about which option is actually being used may argue against use of any of these forms.
o 以“\u”开头的语法形式(不带显式分隔符)已在几种不同的转义系统中使用,包括C[ISO-C]的四位或八位语法(见第6.1节)、Java[Java]的UTF-16编码(见第6.3节),以及可能在“\u”后面有四位、五位或六位数字的一些排列。对于实际使用的选项可能存在的混淆可能会反对使用这些表格中的任何一种。
o Forms that require decoding surrogate pairs share most of the problems that appear with encoding of UTF-8 octets. Internet protocols SHOULD NOT use surrogate pairs.
o 需要解码代理项对的表单共享UTF-8八位字节编码中出现的大多数问题。Internet协议不应使用代理项对。
There are a number of different ways to represent a Unicode code point position. No one of them appears to be "best" for all contexts. In addition, when an escape is needed for the escape mechanism itself, the optimal one of those might differ from one context to another.
有许多不同的方法来表示Unicode代码点的位置。在所有情况下,它们中没有一个是“最好的”。此外,当转义机制本身需要转义时,最佳的转义机制可能因上下文而异。
Some forms that are in popular use and that might reasonably be considered for use in a given protocol are described below and identified with a current-use context when feasible. The two in this section are recommended for use in Internet Protocols. Other popular ones appear in Section 6 with some discussion of their disadvantages.
下面描述了一些常用的形式,以及可能被合理地考虑用于给定协议的形式,并在可行时与当前使用上下文进行了标识。本节中的两个建议用于Internet协议。第6节介绍了其他流行的方法,并讨论了它们的缺点。
One of the recommended forms is a variation of the many forms that start in "\u" (See, e.g., Section 6.1, below>), but uses explicit delimiters for the reasons discussed elsewhere.
推荐的表单之一是以“\u”开头的许多表单的变体(例如,请参见下文第6.1节>),但出于其他地方讨论的原因,使用了显式分隔符。
Specifically, in ABNF [RFC5234],
具体而言,在ABNF[RFC5234]中,
EmbeddedUnicodeChar = %x5C.75.27 4*6HEXDIG %x27 ; starting with lowercase "\u" and "'" and ending with "'". ; Note that the encodings are considered to be abstractions ; for the relevant characters, not designations of specific ; octets.
EmbeddedUnicodeChar = %x5C.75.27 4*6HEXDIG %x27 ; starting with lowercase "\u" and "'" and ending with "'". ; Note that the encodings are considered to be abstractions ; for the relevant characters, not designations of specific ; octets.
HEXDIG = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" / "A" / "B" / "C" / "D" / "E" / "F" ; effectively identical with definition in RFC 5234.
HEXDIG = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" / "A" / "B" / "C" / "D" / "E" / "F" ; effectively identical with definition in RFC 5234.
Protocol designers of applications using this form should specify a way to escape the introducing backslash ("\"), if needed. "\\" is one obvious possibility, but not the only one.
如果需要,使用此表单的应用程序的协议设计者应该指定一种方法来避开引入的反斜杠(\)。“\\”是一种明显的可能性,但不是唯一的可能性。
The other recommended form is the one used in XML. It uses the form "&#xNNNN;". Like the Perl form (Section 6.2), this form has a clear ending delimiter, reducing ambiguity. HTML uses a similar form, but the semicolon may be omitted in some cases. If that is done, the advantages of the delimiter disappear so that the HTML form without the semicolon SHOULD NOT be used. However, this format is often considered ugly and awkward outside of its native HTML, XML, and similar contexts.
另一个推荐的表单是XML中使用的表单。它使用的形式是“&#xNNNN;”。与Perl表单(第6.2节)一样,此表单有一个清晰的结尾分隔符,减少了歧义。HTML使用类似的形式,但分号在某些情况下可能会被省略。如果这样做了,分隔符的优势就会消失,因此不应该使用没有分号的HTML表单。然而,这种格式在其原生HTML、XML和类似上下文之外通常被认为是丑陋和笨拙的。
In ABNF:
在ABNF中:
EmbeddedUnicodeChar = %x26.23.78 2*6HEXDIG %x3B ; starts with "&#x" and ends with ";"
EmbeddedUnicodeChar = %x26.23.78 2*6HEXDIG %x3B ; starts with "&#x" and ends with ";"
Note that a literal "&" can be expressed by "&" when using this style.
请注意,使用此样式时,文字“&”可以用“&;”表示。
The forms
表格
\UNNNNNNNN (for any Unicode character) and
\unnnnn(用于任何Unicode字符)和
\uNNNN (for Unicode characters in plane 0)
\uNNNN(用于平面0中的Unicode字符)
are utilized in the C Programming Language [ISO-C] when an ASCII escape for embedded Unicode characters is needed.
当嵌入式Unicode字符需要ASCII转义时,在C编程语言[ISO-C]中使用。
There are disadvantages of this form that may be significant. First, the use of a case variation (between "u" for the four-digit form and "U" for the eight-digit form) may not seem natural in environments where uppercase and lowercase characters are generally considered equivalent and might be confusing to people who are not very familiar with Latin-based alphabets (although those people might have even more trouble reading relevant English text and explanations). Second, as discussed in Section 4, the very fact that there are several different conventions that start in \u or \U may become a source of confusion as people make incorrect assumptions about what they are looking at.
这种形式的缺点可能很明显。首先,在大写和小写字符通常被视为等效的环境中,使用大小写变化(四位数形式为“u”和八位数形式为“u”)似乎不太自然,可能会让不太熟悉拉丁字母的人感到困惑(尽管这些人在阅读相关的英文文本和解释时可能会遇到更大的困难)。第二,正如第4节所讨论的,由于人们对他们所看到的东西做出了错误的假设,以\u或\u开头的几个不同约定这一事实可能会引起混淆。
Perl uses the form \x{NNNN...}. The advantage of this form is that there are explicit delimiters, resolving the issue of having variable-length strings or using the case-change mechanism of the proposed form to distinguish between Plane 0 and more general forms. Some other programming languages would tend to favor X'NNNN...' forms for hexadecimal strings and perhaps U'NNNN...' for Unicode-specific strings, but those forms do not seem to be in use around the IETF.
Perl使用\x{NNNN…}形式。此表单的优点是有显式分隔符,解决了具有可变长度字符串的问题,或者使用建议表单的大小写更改机制来区分平面0和更一般的表单。其他一些编程语言倾向于使用X'NNNN…'形式表示十六进制字符串,或者使用U'NNNN…'形式表示Unicode特定字符串,但IETF似乎不使用这些形式。
Note that there is a possible ambiguity in how two-character or low-numbered sequences in this notation are understood, i.e., that octets in the range \x(00) through \x(FF) may be construed as being in the local character set, not as Unicode code points. Because of this apparent ambiguity, and because IETF documents do not contain
请注意,如何理解此符号中的两个字符或低编号序列可能存在歧义,即范围为\x(00)到\x(FF)的八位字节可能被解释为在本地字符集中,而不是Unicode码点。由于这种明显的模糊性,以及IETF文件不包含
provision for pragmas (see [PERLUniIntro] for more information about the "encoding" pragma in Perl and other details), the Perl forms should be used with extreme caution, if at all.
pragma的规定(请参见[PERLUniIntro]以了解有关Perl中“编码”pragma的更多信息和其他详细信息),如果使用Perl表单,应该非常小心。
Java [Java] uses the form \uNNNN, but as a reference to UTF-16 values, not to Unicode code points. While it uses a syntax similar to that described in Section 6.1, this relationship to UTF-16 makes it, in many respects, more similar to the encodings of UTF-8 discussed above than to an escape that designates Unicode code points. Note that the UTF-16 form, and hence, the Java escape notation, can represent characters outside Plane 0 (i.e., above U+FFFF) only by the use of surrogate pairs, raising some of the same issues as the use of UTF-8 octets discussed above. For characters in Plane 0, the Java form is indistinguishable from the Plane 0-only form described in Section 6.1. If only for that reason, it SHOULD NOT be used as an escape except in those Java contexts in which it is natural.
Java[Java]使用\unnn格式,但作为对UTF-16值的引用,而不是对Unicode代码点的引用。虽然它使用了与第6.1节中描述的语法相似的语法,但与UTF-16的这种关系使得它在许多方面更类似于上面讨论的UTF-8编码,而不是指定Unicode码点的转义。请注意,UTF-16形式以及Java转义表示法只能通过使用代理项对来表示平面0之外的字符(即U+FFFF之上),这与上面讨论的UTF-8八位字节的使用产生了一些相同的问题。对于平面0中的字符,Java格式与第6.1节中描述的仅平面0格式无法区分。如果仅仅出于这个原因,它不应该被用作转义,除非在那些Java上下文中它是自然的。
This document proposes a set of rules for encoding Unicode characters when other considerations do not apply. Since all of the recommended encodings are unambiguous and normalization issues are not involved, it should not introduce any security issues that are not present as a result of simple use of non-ASCII characters, no matter how they are encoded. The mechanisms suggested should slightly lower the risks of confusing users with encoded characters by making the identity of the characters being used somewhat more obvious than some of the alternatives.
本文档提出了一组在其他考虑因素不适用时对Unicode字符进行编码的规则。由于所有推荐的编码都是明确的,并且不涉及规范化问题,因此不应引入任何安全问题,这些问题不是由于简单使用非ASCII字符造成的,无论它们是如何编码的。所建议的机制应通过使所使用字符的标识比某些替代字符的标识更加明显,从而略微降低将用户与编码字符混淆的风险。
An escape mechanism such as the one specified in this document can allow characters to be represented in more than one way. Where software interprets the escaped form, there is a risk that security checks, and any necessary checks for, e.g., minimal or normalized forms, are done at the wrong point.
本文档中指定的转义机制可以允许以多种方式表示字符。当软件解释转义表单时,存在安全检查和任何必要的检查(例如,最小或规范化表单)在错误点执行的风险。
This document was produced in response to a series of discussions within the IETF Applications Area and as part of work on email internationalization and internationalized domain name updates. It is a synthesis of a large number of discussions, the comments of the participants in which are gratefully acknowledged. The help of Mark Davis in constructing a list of alternative presentations and selecting among them was especially important.
本文件是为了响应IETF应用领域内的一系列讨论,并作为电子邮件国际化和国际化域名更新工作的一部分而编制的。这是大量讨论的综合,与会者的评论得到了感谢。马克·戴维斯(Mark Davis)在构建备选演讲列表并从中进行选择方面的帮助尤为重要。
Tim Bray, Peter Constable, Stephane Bortzmeyer, Chris Newman, Frank Ellermann, Clive D.W. Feather, Philip Guenther, Bjoern Hoehrmann, Simon Josefsson, Bill McQuillan, der Mouse, Phil Pennock, and Julian Reschke provided careful reading and some corrections and suggestions on the various working drafts that preceded this document. Taken together, their suggestions motivated the significant revision of this document and its recommendations between version -00 and version -01 and further improvements in the subsequent versions.
Tim Bray、Peter Constable、Stephane Bortzmeyer、Chris Newman、Frank Ellermann、Clive D.W.Feather、Philip Guenther、Bjoern Hoehrmann、Simon Josefsson、Bill McQuillan、der Mouse、Phil Pennock和Julian Reschke对本文件之前的各种工作草案进行了仔细阅读,并提出了一些更正和建议。综上所述,他们的建议促使本文件及其建议在版本-00和版本-01之间进行重大修订,并在后续版本中进一步改进。
[ISO10646] International Organization for Standardization, "Information Technology -- Universal Multiple-Octet Coded Character Set (UCS)", ISO/ IEC 10646:2003, December 2003.
[ISO10646]国际标准化组织,“信息技术——通用多八位编码字符集(UCS)”,ISO/IEC 10646:2003,2003年12月。
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003.
[RFC3629]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,STD 63,RFC 3629,2003年11月。
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008.
[RFC5234]Crocker,D.和P.Overell,“语法规范的扩充BNF:ABNF”,STD 68,RFC 5234,2008年1月。
[Unicode] The Unicode Consortium, "The Unicode Standard, Version 5.0", 2006. (Addison-Wesley, 2006. ISBN 0-321-48091-0).
[Unicode]Unicode联盟,“Unicode标准,5.0版”,2006年。(艾迪生·韦斯利,2006年,ISBN 0-321-48091-0)。
[ASCII] American National Standards Institute (formerly United States of America Standards Institute), "USA Code for Information Interchange", ANSI X3.4- 1968, 1968.
[ASCII]美国国家标准协会(前美国标准协会),“美国信息交换代码”,ANSI X3.4-1968,1968。
ANSI X3.4-1968 has been replaced by newer versions with slight modifications, but the 1968 version remains definitive for the Internet.
ANSI X3.4-1968已被稍作修改的较新版本所取代,但1968年版本仍然是互联网的最终版本。
[ISO-C] International Organization for Standardization, "Information technology -- Programming languages -- C", ISO/IEC 9899:1999, 1999.
[ISO-C]国际标准化组织,“信息技术——编程语言——C”,ISO/IEC 9899:1999,1999。
[Java] Sun Microsystems, Inc., "Java Language Specification, Third Edition", 2005, <http:// java.sun.com/docs/books/jls/third_edition/html/ lexical.html#95413p>.
[Java]Sun Microsystems,Inc.,“Java语言规范,第三版”,2005年,<http://Java.Sun.com/docs/books/jls/Third_Edition/html/lexical.html#95413p>。
[PERLUniIntro] Hietaniemi, J., "perluniintro", Perl documentation 5.8.8, 2002, <http://perldoc.perl.org/perluniintro.html>.
[PERLUniIntro]Hietaniemi,J.,“PERLUniIntro”,Perl文档5.8.82002<http://perldoc.perl.org/perluniintro.html>.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998.
[RFC2277]Alvestrand,H.,“IETF字符集和语言政策”,BCP 18,RFC 2277,1998年1月。
[RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)", RFC 3492, March 2003.
[RFC3492]Costello,A.,“Punycode:应用程序中国际化域名的Unicode引导字符串编码(IDNA)”,RFC 3492,2003年3月。
[UnicodeGlossary] The Unicode Consortium, "Glossary of Unicode Terms", June 2007, <http://www.unicode.org/glossary>.
[Unicode Deglossary]Unicode联盟,“Unicode术语表”,2007年6月<http://www.unicode.org/glossary>.
[W3C-CharMod] Duerst, M., "Character Model for the World Wide Web 1.0", W3C Recommendation, February 2005, <http://www.w3.org/TR/charmod/>.
[W3C CharMod]Duerst,M.,“万维网1.0的字符模型”,W3C推荐,2005年2月<http://www.w3.org/TR/charmod/>.
While the syntax for the escape forms that are not recommended above (see Section 6) are not given inline in the hope of discouraging their use, they are provided in this appendix in the hope that those who choose to use them will do so consistently. The reader is cautioned that some of these forms are not defined precisely in the original specifications and that others have evolved over time in ways that are not precisely consistent. Consequently, these definitions are not normative and may not even precisely match reasonable interpretations of their sources.
虽然上述未推荐的转义形式(见第6节)的语法不是内联给出的,而是希望阻止其使用,但本附录中提供了转义形式,希望选择使用转义形式的人能够始终如一地使用转义形式。读者需要注意的是,其中一些形式在原始规范中没有精确定义,其他形式随着时间的推移以不完全一致的方式演变。因此,这些定义不是规范性的,甚至可能不完全符合对其来源的合理解释。
The definition of "HEXDIG" for the forms that follow appears in Section 5.1.
第5.1节给出了以下表格的“HEXDIG”定义。
Specifically, in ABNF [RFC5234],
具体而言,在ABNF[RFC5234]中,
EmbeddedUnicodeChar = BMP-form / Full-form
EmbeddedUnicodeChar = BMP-form / Full-form
BMP-form = %x5C.75 4HEXDIG ; starting with lowercase "\u" ; The encodings are considered to be abstractions for the ; relevant characters, not designations of specific octets.
BMP-form = %x5C.75 4HEXDIG ; starting with lowercase "\u" ; The encodings are considered to be abstractions for the ; relevant characters, not designations of specific octets.
Full-form = %x5C.55 8HEXDIG ; starting with uppercase "\U"
Full-form = %x5C.55 8HEXDIG ; starting with uppercase "\U"
EmbeddedUnicodeChar = %x5C.78 "{" 2*6HEXDIG "}" ; starts with "\x"
EmbeddedUnicodeChar = %x5C.78 "{" 2*6HEXDIG "}" ; starts with "\x"
EmbeddedUnicodeChar = %x5C.7A 4HEXDIG ; starts with "\u"
EmbeddedUnicodeChar = %x5C.7A 4HEXDIG ; starts with "\u"
Author's Address
作者地址
John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA
美国马萨诸塞州剑桥市322号马萨诸塞大道1770号约翰·C·克伦辛,邮编:02140
Phone: +1 617 245 1457 EMail: john-ietf@jck.com
Phone: +1 617 245 1457 EMail: john-ietf@jck.com
Full Copyright Statement
完整版权声明
Copyright (C) The IETF Trust (2008).
版权所有(C)IETF信托基金(2008年)。
This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件及其包含的信息以“原样”为基础提供,贡献者、他/她所代表或赞助的组织(如有)、互联网协会、IETF信托基金和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Intellectual Property
知识产权
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.