Network Working Group J. Klensin Request for Comments: 5198 M. Padlipsky Obsoletes: 698 March 2008 Updates: 854 Category: Standards Track
Network Working Group J. Klensin Request for Comments: 5198 M. Padlipsky Obsoletes: 698 March 2008 Updates: 854 Category: Standards Track
Unicode Format for Network Interchange
网络交换用Unicode格式
Status of This Memo
关于下段备忘
This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.
本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。
Abstract
摘要
The Internet today is in need of a standardized form for the transmission of internationalized "text" information, paralleling the specifications for the use of ASCII that date from the early days of the ARPANET. This document specifies that format, using UTF-8 with normalization and specific line-ending sequences.
今天的互联网需要一种标准化的形式来传输国际化的“文本”信息,与ARPANET早期的ASCII使用规范相一致。本文档使用UTF-8规范化和特定的行结束序列来指定该格式。
Table of Contents
目录
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirement for a Standardized Text Stream Format . . . . 2 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5 5. Applicability and Stability of this Specification . . . . . . 7 5.1. Use in IETF Applications Specifications . . . . . . . . . 7 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14 Appendix D. A Note about Related Future Work . . . . . . . . . . 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Normative References . . . . . . . . . . . . . . . . . . . . . . 15 Informative References . . . . . . . . . . . . . . . . . . . . . 16
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirement for a Standardized Text Stream Format . . . . 2 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3 3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5 5. Applicability and Stability of this Specification . . . . . . 7 5.1. Use in IETF Applications Specifications . . . . . . . . . 7 5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 Appendix A. History and Context . . . . . . . . . . . . . . . . . 11 Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12 Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14 Appendix D. A Note about Related Future Work . . . . . . . . . . 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Normative References . . . . . . . . . . . . . . . . . . . . . . 15 Informative References . . . . . . . . . . . . . . . . . . . . . 16
Historically, Internet protocols have been largely ASCII-based and references to "text" in protocols have assumed ASCII text and specifically text in Network Virtual Terminal ("NVT") or "Network ASCII" form (see Appendix A and Appendix B). Protocols and formats that have moved beyond ASCII have included arrangements to specifically identify the character set and often the language being used.
从历史上看,互联网协议主要基于ASCII,协议中对“文本”的引用采用ASCII文本,特别是网络虚拟终端(“NVT”)中的文本或“网络ASCII”形式(见附录A和附录B)。已经超越ASCII的协议和格式包括了专门标识字符集和通常使用的语言的安排。
In our more internationalized world, "text" clearly no longer equates unambiguously to "network ASCII". Fortunately, however, we are converging on Unicode [Unicode] [ISO10646] as a single international interchange character coding and no longer need to deal with per-script standards for character sets (e.g., one standard for each of Arabic, Cyrillic, Devanagari, etc., or even standards keyed to languages that are usually considered to share a script, such as French, German, or Swedish). Unfortunately, though, while it is certainly time to define a Unicode-based text type for use as a common text interchange format, "use Unicode" involves even more ambiguity than "use ASCII" did decades ago.
在我们更加国际化的世界中,“文本”显然不再明确地等同于“网络ASCII”。然而,幸运的是,我们正在将Unicode[Unicode][ISO10646]作为一种单一的国际交换字符编码,不再需要处理字符集的按脚本标准(例如,阿拉伯文、西里尔文、德瓦那加里文等各有一个标准,甚至是通常被认为共享一个脚本的语言的标准,如法语、德语或瑞典语)。不幸的是,虽然现在肯定是定义基于Unicode的文本类型作为通用文本交换格式的时候了,“使用Unicode”与几十年前的“使用ASCII”相比,它涉及更多的歧义。
Unicode identifies each character by an integer, called its "code point", in the range 0-0x10ffff. These integers can be encoded into byte sequences for transmission in at least three standard and generally-recognized encoding forms, all of which are completely defined in The Unicode Standard and the documents cited below:
Unicode通过一个整数标识每个字符,称为其“代码点”,范围为0-0x10ffff。这些整数可以编码成字节序列,以便以至少三种标准和公认的编码形式进行传输,所有这些格式都在Unicode标准和以下引用的文件中完全定义:
o UTF-8 [RFC3629] defines a variable-length encoding that may be applied uniformly to all code points.
o UTF-8[RFC3629]定义了可变长度编码,可统一应用于所有代码点。
o UTF-16 [RFC2781] encodes the range of Unicode characters whose code points are less than 65536 straightforwardly as 16-bit integers, and provides a "surrogate" mechanism for encoding larger code points in 32 bits.
o UTF-16[RFC2781]将代码点小于65536的Unicode字符范围直接编码为16位整数,并提供一种“代理”机制,用于以32位编码较大的代码点。
o UTF-32 (also known as UCS-4) simply encodes each code point as a 32-bit integer.
o UTF-32(也称为UCS-4)将每个代码点简单地编码为32位整数。
Older forms and nomenclature, such as the 16-bit UCS-2, are now strongly discouraged.
现在强烈反对使用较旧的形式和命名法,如16位UCS-2。
As with ASCII, any of these forms may be used with different line-ending conventions. That flexibility can be an additional source of confusion with, e.g., index (offset) references into documents based on character counts.
与ASCII一样,这些格式中的任何一种都可以与不同的行尾约定一起使用。这种灵活性可能是另一个混淆的来源,例如,基于字符计数将索引(偏移量)引用到文档中。
This document proposes to establish "Net-Unicode" as a new standardized text transmission form for the Internet, to serve as an internationalized alternative for NVT ASCII when specified in new -- and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is chosen for the coding because it has good compatibility properties with ASCII and for other reasons discussed in the existing IETF character set policy [RFC2277]. "Net-Unicode" is specified in Section 2; the subsequent sections of the document provide background and explanation.
本文件建议将“Net Unicode”确立为互联网的一种新的标准化文本传输形式,当在新协议(以及在适当情况下更新的协议)中指定时,作为NVT ASCII的国际化替代方案。选择UTF-8[RFC3629]进行编码,因为它与ASCII具有良好的兼容性,以及现有IETF字符集策略[RFC2277]中讨论的其他原因。第2节规定了“净Unicode”;本文件后续章节提供了背景和解释。
Whenever there is a choice, Unicode SHOULD be used with the text encoding specified here. This combination is preferred to the double-byte encoding of "extended ASCII" [RFC0698] or the assorted per-language or per-country character coding systems.
只要有选择,Unicode就应该与此处指定的文本编码一起使用。这种组合优先于“扩展ASCII”[RFC0698]的双字节编码或按语言或按国家分类的字符编码系统。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照[RFC2119]中所述进行解释。
The Network Unicode format (Net-Unicode) is defined as follows. Parts of this definition are deliberately informal, providing guidance for specific profiles or rules in the protocols that reference this one rather than firm rules that apply globally.
网络Unicode格式(Net Unicode)定义如下。该定义的某些部分故意不正式,为协议中引用该定义的特定概要文件或规则提供指导,而不是适用于全球的固定规则。
1. Characters MUST be encoded in UTF-8 as defined in [RFC3629].
1. 字符必须按照[RFC3629]中的定义以UTF-8编码。
2. If the protocol has the concept of "lines", line-endings MUST be indicated by the sequence Carriage-Return (CR, U+000D) followed by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD NOT appear except when followed by LF. The only other allowed context in which CR is permitted is in the combination CR NUL, which is not recommended (see the note at the end of this section).
2. 如果协议有“行”的概念,行结束必须由顺序回车(CR,U+000D)表示,然后是换行(LF,U+000A),通常称为CRLF。CR不应出现,除非后跟LF。唯一允许CR的其他上下文是组合CR NUL,这是不推荐的(参见本节末尾的注释)。
3. The control characters in the ASCII range (U+0000 to U+001F and U+007F to U+009F) SHOULD generally be avoided. Space (SP, U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to this principle, but use of all but the first requires care as discussed elsewhere in this document. The so-called "C1 Controls" (U+0080 through U+009F), which did not appear in ASCII, MUST NOT appear.
3. 通常应避免使用ASCII范围内的控制字符(U+0000至U+001F和U+007F至U+009F)。空格(SP,U+0020)、CR、LF和换页(FF,U+000C)是这一原则的例外情况,但使用除第一个以外的所有换页都需要谨慎,如本文件其他部分所述。所谓的“C1控件”(U+0080到U+009F)不能出现在ASCII中。
FF should be used only with caution: it does not have a standard and universal interpretation and, in particular, if its use
FF仅应谨慎使用:它没有标准和普遍的解释,尤其是在使用时
assumes a page length, such assumptions may not be appropriate in international contexts (e.g., considering 8.5x11 inch paper versus A4). Other control characters are used to affect display format, control devices, or to structure files. None of those uses is appropriate for streams of plain text.
假设页面长度,这种假设在国际环境中可能不合适(例如,考虑8.5x11英寸纸张与A4纸张)。其他控制字符用于影响显示格式、控制设备或结构文件。这些用途都不适用于纯文本流。
4. Before transmission, all character sequences SHOULD be normalized according to Unicode normalization form "NFC" (see Section 3).
4. 在传输之前,应根据Unicode规范化格式“NFC”对所有字符序列进行规范化(见第3节)。
5. As suggested in Section 6 of RFC 3629, the Byte Order Mark ("BOM") signature MUST NOT appear at the beginning of these text strings.
5. 如RFC 3629第6节所述,字节顺序标记(“BOM”)签名不得出现在这些文本字符串的开头。
6. Systems conforming to this specification MUST NOT transmit any string containing any code point that is unassigned in the version of Unicode on which they are dependent. The version of NFC and the version of Unicode used by that system MUST be consistent.
6. 符合本规范的系统不得传输任何包含其所依赖的Unicode版本中未分配的任何代码点的字符串。该系统使用的NFC版本和Unicode版本必须一致。
The use of LF without CR is questionable; see Appendix B for more discussion. The newer control characters IND (U+0084) and NEL ("Next Line", U+0085) might have been used to disambiguate the various line-ending situations, but, because their use has not been established on the Internet, because many protocols require CRLF, and because IND and NEL fall within the "C1 Controls" group (see below), they MUST NOT be used. Similar observations apply to the yet newer line and paragraph separators at U+2028 and U+2029 and any future characters that might be defined to serve these functions. For this specification and protocols that depend on it, lines end in CRLF and only in CRLF. Anything that does not end in CRLF is either not a line or is severely malformed.
使用无CR的LF是有疑问的;更多讨论见附录B。较新的控制字符IND(U+0084)和NEL(“下一行”,U+0085)可能用于消除各种行结束情况的歧义,但是,由于它们的使用尚未在互联网上确定,因为许多协议需要CRLF,并且由于IND和NEL属于“C1控制”组(见下文),因此不得使用它们。类似的观察结果适用于U+2028和U+2029处较新的行和段分隔符,以及可能定义用于这些功能的任何未来字符。对于本规范和依赖于本规范的协议,线路以CRLF结尾,并且仅以CRLF结尾。任何不以CRLF结尾的内容要么不是一行,要么格式严重错误。
The NVT specification contained a number of additional provisions, e.g., for the optional use of backspacing and "bare CR" (sent as CR NUL) to generate overstruck character sequences. The much greater number of precomposed characters in Unicode, the availability of combining characters, and the growing use of markup conventions of various types to show, e.g., emphasis (rather than attempting to do that via the use of special characters), should make such sequences largely unnecessary. These sequences SHOULD be avoided if at all possible. However, because they were optional in NVT applications and this specification is an NVT superset, they cannot be prohibited entirely. The most important of these rules is that CR MUST NOT appear unless it is immediately followed by LF (indicating end of line) or NUL. Because NUL (an octet whose value is all zeros, i.e., %x00 in the notation of [RFC5234]) is hostile to programming languages that use that character as a string delimiter, the CR NUL sequence SHOULD be avoided for that reason as well.
NVT规范包含许多附加规定,例如,可选使用退格和“裸CR”(作为CR NUL发送)来生成多字符序列。Unicode中预合成字符的数量大得多,组合字符的可用性,以及越来越多地使用各种类型的标记约定来显示,例如强调(而不是试图通过使用特殊字符来实现这一点),应该使得这样的序列在很大程度上是不必要的。如果可能,应避免这些顺序。但是,由于它们在NVT应用程序中是可选的,并且本规范是NVT超集,因此不能完全禁止它们。这些规则中最重要的是,除非紧跟其后的是LF(表示行尾)或NUL,否则CR不得出现。由于NUL(其值为全零的八位字节,即[RFC5234]符号中的%x00)不利于使用该字符作为字符串分隔符的编程语言,因此也应避免使用CR NUL序列。
There are cases where strings of Unicode are fundamentally equivalent, essentially representing the same text. These are called "canonical equivalents" in the Unicode Standard. For example, the following pairs of strings are canonically equivalent:
在某些情况下,Unicode字符串基本上是等价的,基本上表示相同的文本。这些在Unicode标准中称为“规范等价物”。例如,以下字符串对在规范上是等效的:
U+2126 OHM SIGN U+03A9 GREEK CAPITAL LETTER OMEGA
U+2126欧姆符号U+03A9希腊文大写字母OMEGA
U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT U+00E0 LATIN SMALL LETTER A WITH GRAVE
U+0061拉丁文小写字母A,U+0300结合了严重重音U+00E0拉丁文小写字母A和严重重音
Comparison of strings becomes much easier if any such cases are always represented by a single unique form. The Unicode Consortium specifies a normalization form, known as NFC [NFC], which provides the necessary mappings and mechanisms to convert all canonically equivalent sequences to a single unique form. Typically, this form produces precomposed characters for any sequences that can be represented in that fashion. It also reorders other combining marks so that they have a unique and unambiguous order.
如果任何这样的情况总是由一个唯一的形式表示,那么字符串的比较就变得容易多了。Unicode联盟指定了一种称为NFC[NFC]的规范化形式,它提供了必要的映射和机制,以将所有规范上等价的序列转换为一种唯一的形式。通常,这种形式为任何可以以这种方式表示的序列生成预合成字符。它还对其他组合标记重新排序,以便它们具有唯一且明确的顺序。
Of the various normalization forms defined as part of Unicode, NFC is closest to actual use in practice, minimizes side-effects due to considering characters equivalent that may not be equivalent in all situations, and typically requires the least work when converting from non-Unicode encodings.
在定义为Unicode的一部分的各种规范化形式中,NFC最接近实际使用,由于考虑到在所有情况下可能不等同的字符,NFC将副作用降至最低,并且在从非Unicode编码转换时通常需要最少的工作。
The section above requires that, except in very unusual circumstances, all Net-Unicode strings be transmitted in normalized form. Recognition of the fact that some implementations of applications may rely on operating system libraries over which they have little control and adherence to the robustness principle suggests that receivers of such strings should be prepared to receive unnormalized ones and to not react to that in excessive ways.
上述章节要求,除非常特殊的情况外,所有净Unicode字符串都应以规范化形式传输。认识到应用程序的某些实现可能依赖于操作系统库,而它们对操作系统库几乎没有控制权,并且遵循健壮性原则,这表明此类字符串的接收器应该准备好接收非规范化字符串,并且不以过度的方式对其作出反应。
Unicode changes and expands over time. Large blocks of space are reserved for future expansion. New versions, which appear at regular intervals, add new scripts and characters. Occasionally they also change some property definitions. In retrospect, one of the advantages of ASCII [ASCII] when it was chosen was that the code space was full when the Standard was first published. There was no practical way to add characters or change code point assignments without being obviously incompatible.
Unicode会随着时间的推移而变化和扩展。为将来的扩展保留了大量的空间。定期出现的新版本会添加新的脚本和字符。有时,它们也会更改某些特性定义。回顾过去,选择ASCII[ASCII]的优点之一是,在标准首次发布时,代码空间已满。在不明显不兼容的情况下,没有实际的方法来添加字符或更改代码点分配。
While there are some security issues if people deliberately try to trick the system (see Section 6), Unicode version changes should not have a significant impact on the text stream specification of this document for the following reasons:
虽然如果有人故意欺骗系统(参见第6节),会出现一些安全问题,但Unicode版本的更改不会对本文档的文本流规范产生重大影响,原因如下:
o The transformation between Unicode code table positions and the corresponding UTF-8 code is algorithmic; it does not depend on whether a code point has been assigned or not.
o Unicode代码表位置和相应UTF-8代码之间的转换是算法转换;它不取决于是否分配了代码点。
o The normalization recommended here, NFC (see Section 3), performs a very limited set of mappings, much more limited than those of the more extensive NFKC used in, e.g., Nameprep [RFC3491].
o 此处推荐的规范化NFC(见第3节)执行的映射集非常有限,比Nameprep[RFC3491]中使用的更广泛的NFKC的映射集更为有限。
The NFC tables may be updated over time as new characters are added, but the Unicode Consortium has guaranteed the stability of all NFC strings. That is, if a string does not contain any unassigned characters, and it is normalized according to NFC, it will always be normalized according to all future versions of the Unicode Standard. The stability of the Net-Unicode format is thus guaranteed when any implementation that converts text into Net-Unicode format does not permit unassigned characters.
NFC表可能会随着新字符的添加而不断更新,但Unicode联盟已经保证了所有NFC字符串的稳定性。也就是说,如果字符串不包含任何未分配字符,并且根据NFC对其进行了规范化,那么它将始终根据Unicode标准的所有未来版本进行规范化。因此,当任何将文本转换为Net Unicode格式的实现都不允许未分配字符时,Net Unicode格式的稳定性就得到了保证。
Because Unicode code points that are reserved for private use do not have standard definitions or normalization interpretations, they SHOULD be avoided in strings intended for Internet interchange.
由于保留供私人使用的Unicode代码点没有标准定义或规范化解释,因此应避免在用于Internet交换的字符串中使用它们。
Were Unicode to be changed in a way that violated these assumptions, i.e., that either invalidated the byte string order specified in RFC 3629 or that changed the stability of NFC as stated above, this specification would not apply. Put differently, this specification applies only to versions of Unicode starting with version 5.0 and extending to, but not including, any version for which changes are made in either the UTF-8 definition or to NFC stability. Such changes would violate established Unicode policies and are hence unlikely, but, should they occur, it would be necessary to evaluate them for compatibility with this specification and other Internet uses of NFC.
如果Unicode的更改方式违反了这些假设,即使RFC 3629中指定的字节字符串顺序无效或如上所述更改了NFC的稳定性,则本规范将不适用。换言之,本规范仅适用于Unicode版本,从版本5.0开始,扩展到但不包括任何更改UTF-8定义或NFC稳定性的版本。此类更改将违反既定的Unicode策略,因此不太可能发生,但如果发生,则有必要评估这些更改是否与本规范和NFC的其他互联网用途兼容。
If the specification of a protocol references this one, strings that are received by that protocol and that appear to be UTF-8 and are not otherwise identified (e.g., by charset labeling) SHOULD be treated as using UTF-8 in conformance with this specification.
如果协议规范引用了此规范,则该协议接收的字符串以及看似UTF-8且未以其他方式标识(例如,通过字符集标记)的字符串应视为使用符合本规范的UTF-8。
During the development of this specification, there was some confusion about where it would be useful given that, e.g., the individual MIME media types used in email and with HTTP have their own rules about UTF-8 character types and normalization, and the application transport protocols impose their own conventions about line endings. There are three answers. The first is that, in retrospect, it would have been better to have those protocols and content types standardized in the way specified here, even though it is certainly too late to change them at this time. The second is that we have several protocols that are dependent on either the original Telnet design or other arrangements requiring a standard, interoperable, string definition without specific content-labels of one sort or another. Whois [RFC3912] is an example member of this group. As consideration is given to upgrading them for non-ASCII use, this specification provides a normative reference that provides the same stability that NVT has provided the ASCII forms. This specification is intended for use by other specifications that have not yet defined how to use Unicode. Having a preferred standard Internet definition for Unicode text streams -- rather than just one for transmission codings -- may help improve the specification and interoperability of protocols to be developed in the future. This specification is not intended for use with specifications that already allow the use of UTF-8 and precisely define that use.
在本规范的制定过程中,考虑到电子邮件和HTTP中使用的各个MIME媒体类型都有自己关于UTF-8字符类型和规范化的规则,并且应用程序传输协议对行尾施加了自己的约定,因此,对于在何处有用,存在一些混淆。有三个答案。首先,回顾过去,最好是以此处指定的方式对这些协议和内容类型进行标准化,尽管现在更改它们肯定为时已晚。第二,我们有几个协议,它们依赖于原始的Telnet设计或其他安排,需要一个标准的、可互操作的、字符串定义,而不需要某种特定的内容标签。Whois[RFC3912]是该组的示例成员。由于考虑将其升级为非ASCII使用,本规范提供了一个规范性参考,提供了与NVT提供的ASCII表单相同的稳定性。本规范旨在供尚未定义如何使用Unicode的其他规范使用。拥有Unicode文本流的首选标准互联网定义——而不仅仅是传输编码的标准互联网定义——可能有助于改进未来开发的协议的规范和互操作性。本规范不适用于已经允许使用UTF-8并精确定义该用途的规范。
The IETF faces a practical dilemma with regard to versions of Unicode. Each new version brings with it new characters and sometimes new combining characters. Version 5.0 introduces the new concept of sequences of characters named as if they were individual characters (see [NamedSequences]). The normalization represented by NFC is stable if all strings are transmitted and stored in normalized form if corrections are never made to character definitions or normalization tables and if unassigned code points are never used. The latter is important because an unassigned code point always normalizes to itself. However, if the same code point is assigned to a character in a future version, it may participate in some other normalization mapping (some specific difficulties in this regard are discussed in [RFC4690]). It is worth noting that transmission in normalized form is not required by either the IETF's UTF-8 Standard [RFC3629] or by standards dependent on the current version of Stringprep [RFC3454].
IETF在Unicode版本方面面临实际困境。每一个新版本都会带来新的字符,有时还会带来新的组合字符。版本5.0引入了字符序列的新概念,将其命名为单个字符(请参见[NamedSequences])。如果所有字符串都以规范化形式传输和存储,如果从未对字符定义或规范化表进行更正,并且如果从未使用未分配的代码点,则NFC表示的规范化是稳定的。后者很重要,因为未分配的代码点总是规范化为自身。但是,如果在将来的版本中将同一代码点分配给某个字符,则该字符可能会参与其他一些规范化映射(在[RFC4690]中讨论了这方面的一些具体困难)。值得注意的是,IETF的UTF-8标准[RFC3629]或依赖于当前版本的Stringprep[RFC3454]的标准均不要求以规范化形式传输。
All would be well with this as described in Section 4 except for one problem: Applications typically do not perform their own conversions to Unicode and may not perform their own normalizations but instead rely on operating system or language library functions -- functions that may be upgraded or otherwise changed without changes to the application code itself. Consequently, there may be no plausible way for an application to know which version of Unicode, or which version of the normalization procedures, it is utilizing, nor is there any way by which it can guarantee that the two will be consistent.
如第4节所述,除了一个问题外,所有这些问题都会很好地解决:应用程序通常不执行自己到Unicode的转换,也可能不执行自己的规范化,而是依赖于操作系统或语言库函数——这些函数可以升级或以其他方式更改,而无需更改代码应用程序代码本身。因此,对于应用程序来说,可能没有任何合理的方法可以知道它正在使用哪个版本的Unicode或哪个版本的规范化过程,也没有任何方法可以保证这两者是一致的。
Because of per-version changes in definitions and tables, Stringprep and documents depending on it are now tied to Unicode Version 3.2 [Unicode32] and full interoperability of Internet Standard UTF-8 [RFC3629], when used with normalization as specified here, is dependent on normalization definitions and the definition of UTF-8 itself not changing after Unicode Version 5.0. These assumptions seem fairly safe, but they are still assumptions. Rather than being linked to the latest available version of Unicode, version 5.0 [Unicode] or broader concepts of version independence based on specific assumptions and conditions, this specification could reasonably have been tied, like Stringprep and Nameprep to Unicode 3.2 [Unicode32] or some more recent intermediate version, but, in addition to the obvious disadvantages of having different IETF standards tied to different versions of Unicode, the library-based application implementation behavior described above makes these version linkages nearly meaningless in practice.
Because of per-version changes in definitions and tables, Stringprep and documents depending on it are now tied to Unicode Version 3.2 [Unicode32] and full interoperability of Internet Standard UTF-8 [RFC3629], when used with normalization as specified here, is dependent on normalization definitions and the definition of UTF-8 itself not changing after Unicode Version 5.0. These assumptions seem fairly safe, but they are still assumptions. Rather than being linked to the latest available version of Unicode, version 5.0 [Unicode] or broader concepts of version independence based on specific assumptions and conditions, this specification could reasonably have been tied, like Stringprep and Nameprep to Unicode 3.2 [Unicode32] or some more recent intermediate version, but, in addition to the obvious disadvantages of having different IETF standards tied to different versions of Unicode, the library-based application implementation behavior described above makes these version linkages nearly meaningless in practice.
In theory, one can get around this problem in four ways:
从理论上讲,可以通过四种方式解决这个问题:
1. Freeze on a particular version of Unicode and try to insist that applications enforce that version by, e.g., containing lists of unassigned characters and prohibiting their use. Of course, this would prohibit evolution to include newly-added scripts and the tables of unassigned code points would be cumbersome.
1. 冻结Unicode的特定版本,并尝试坚持应用程序强制执行该版本,例如,包含未分配字符的列表并禁止使用这些字符。当然,这将禁止包含新添加的脚本的演化,并且未分配代码点的表将很麻烦。
2. Require that every Unicode "text" string or file start with a version indication, somewhat akin to the "byte order mark" indicator. It is unlikely that this provision would be practical. More important, it would require that each application implementation be prepared to either support multiple normalization tables and versions or that it reject text from Unicode versions with which it was not prepared to deal.
2. 要求每个Unicode“文本”字符串或文件以版本指示开头,有点类似于“字节顺序标记”指示符。这项规定不大可能是切实可行的。更重要的是,它要求每个应用程序实现都准备好支持多个规范化表和版本,或者拒绝来自Unicode版本的文本,而Unicode版本是它不准备处理的。
3. Devise a different set of normalization rules that would, e.g., guarantee that no character assigned to a previously-unassigned code point in Unicode was ever normalized to anything but itself and use those rules instead of NFC. It is not clear whether or not such a set of rules is possible or whether some other
3. 设计一组不同的规范化规则,例如,保证在Unicode中分配给以前未分配的代码点的字符不会被规范化为除自身以外的任何字符,并使用这些规则而不是NFC。目前尚不清楚这样一套规则是否可行,或者是否有其他一些规则
completely stable set of rules could be devised, perhaps in combination with restrictions on the ways in which characters were added in future versions of Unicode.
可以设计出一套完全稳定的规则,也许可以结合对未来Unicode版本中字符添加方式的限制。
4. Devise a normalization process that is otherwise equivalent to NFC but that rejects code points that are unassigned in the current version of Unicode, rather than mapping those code points to themselves. This would still leave some risk of incompatible corrections in Unicode and possibly a few edge cases, but it is probably stable enough for Internet use in the overwhelming number of cases. This process has been discussed in the Unicode Consortium under the name "Stable NFC".
4. 设计一个标准化过程,该过程在其他方面等同于NFC,但拒绝当前Unicode版本中未分配的代码点,而不是将这些代码点映射到它们自己。这仍然会在Unicode中留下一些不兼容更正的风险,可能还有一些边缘情况,但它可能足够稳定,可以在绝大多数情况下用于Internet。Unicode联盟以“稳定NFC”的名义讨论了这一过程。
None of these approaches seems ideal: the ideal procedure would be as stable and predictable as ASCII has been. But that level is simply not feasible as long as Unicode continues to evolve by the addition of new code points and scripts. The fourth option listed above appears to be a reasonable compromise.
这些方法似乎都不理想:理想的过程应该像ASCII那样稳定和可预测。但是,只要Unicode通过添加新的代码点和脚本继续发展,那么这个级别就根本不可行。上面列出的第四种选择似乎是一种合理的妥协。
This specification provides a standard form for the use of Unicode as "network text". Most of the same security issues that apply to UTF-8, as discussed in [RFC3629], apply to it, although it should be slightly less subject to some risks by virtue of requiring NFC normalization and generally being somewhat more restrictive. However, shifts in Unicode versions, as discussed in Section 5.2, may introduce other security issues.
本规范提供了将Unicode用作“网络文本”的标准格式。如[RFC3629]中所述,适用于UTF-8的大多数相同安全问题也适用于UTF-8,尽管由于需要NFC标准化,并且通常具有更严格的限制性,UTF-8应稍微不受某些风险的影响。但是,如第5.2节所述,Unicode版本的变化可能会带来其他安全问题。
Programs that receive these streams should use extreme caution about assuming that incoming data are normalized, since it might be possible to use unnormalized forms, as well as invalid UTF-8, as part of an attack. In particular, firewalls and other systems that interpret UTF-8 streams should be developed with the clear knowledge that an attacker may deliberately send unnormalized text, for instance, to avoid detection by naive text-matching systems.
接收这些流的程序在假设传入数据已规范化时应格外小心,因为可能会使用非规范化的表单以及无效的UTF-8作为攻击的一部分。特别是,开发解释UTF-8流的防火墙和其他系统时,应清楚知道攻击者可能故意发送非规范文本,例如,以避免被幼稚的文本匹配系统检测到。
NVT contains a requirement, of necessity repeated here (see Section 2), that the CR character be immediately followed by either LF or ASCII NUL (an octet with all bits zero). NUL may be problematic for some programming languages that use it as a string terminator, and hence a trap for the unwary, unless caution is used. This may be an additional reason to avoid the use of CR entirely, except in sequence with LF, as suggested above.
NVT包含一项要求,在此必须重复(见第2节),即CR字符后面紧跟LF或ASCII NUL(所有位均为零的八位字节)。对于某些将NUL用作字符串终止符的编程语言来说,NUL可能有问题,因此,除非谨慎使用,否则NUL会成为粗心大意的陷阱。这可能是完全避免使用CR的另一个原因,如上文所述,与LF顺序使用除外。
The discussion about Unicode versions above (see Section 4 and Section 5.2) makes several assumptions about future versions of Unicode, about NFC normalization being applied properly, and about
上面关于Unicode版本的讨论(参见第4节和第5.2节)对Unicode的未来版本、正确应用NFC规范化以及
UTF-8 being processed and transmitted exactly as specified in RFC 3629. If any of those assumptions are not correct, then there are cases in which strings that would be considered equivalent do not compare equal. Robust code should be prepared for those possibilities.
UTF-8的处理和传输完全符合RFC 3629的规定。如果这些假设中的任何一个都不正确,那么在某些情况下,被视为等效的字符串不进行比较。应该为这些可能性准备健壮的代码。
Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for suggestions about Unicode normalization that led to the format described here, and especially to Mark for providing the paragraphs that describe the role of NFC. Thanks also to Mark, Doug Ewell, Asmus Freytag for corrected text describing Unicode transmission forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George Michaelson, Chris Newman, and Marcos Sanz for a number of helpful comments and clarification requests.
非常感谢Mark Davis、Martin Duerst和Michel Suignard对Unicode规范化的建议,这些建议导致了本文所述的格式,特别是感谢Mark提供了描述NFC角色的段落。感谢Mark、Doug Ewell、Asmus Freytag提供的描述Unicode传输形式的更正文本,以及Tim Bray、Carsten Bormann、Stephane Bortzmeyer、Martin Duerst、Frank Ellermann、Clive D.W.Feather、Ted Hardie、Bjoern Hoehrmann、Alfred Hoenes、Kent Karlson、Bill McQuillan、George Michaelson、Chris Newman、,以及Marcos Sanz提出了一些有益的评论和澄清请求。
This subsection contains a review of prior work in the ARPANET and Internet to establish a standard text type, work that establishes the context and motivation for the approach taken in this document. The text is explanatory rather than normative: nothing in this section is intended to change or update any current specification. Those who are uninterested in this review and analysis can safely skip this section.
本小节回顾了在ARPANET和互联网中建立标准文本类型的先前工作,该工作为本文件中所采用的方法建立了背景和动机。文本是解释性的,而不是规范性的:本节中的任何内容都无意更改或更新任何当前规范。那些对这个回顾和分析不感兴趣的人可以跳过这个部分。
One of the earlier application design decisions made in the development of ARPANET, a decision that was carried forward into the Internet, was the decision to standardize on a single and very specific coding for "text" to be passed across the network [RFC0020]. Hosts on the network were then responsible for translating or mapping from whatever character coding conventions were used locally to that common intermediate representation, with sending hosts mapping to it and receiving ones mapping from it to their local forms as needed. It is interesting to note that at the time the ARPANET was being developed, participating host operating systems used at least three different character coding standards: the antiquated BCD (Binary Coded Decimal), the then-dominant major manufacturer-backed EBCDIC (Extended BCD Interchange Code), and the then-still emerging ASCII (American Standard Code for Information Interchange). Since the ARPANET was an "open" project and EBCDIC was intimately linked to a particular hardware vendor, the original Network Working Group agreed that its standard should be ASCII. That ASCII form was precisely "7-bit ASCII in an 8-bit field", which was in effect a compromise between hosts that were natively 7-bit oriented (e.g., with five seven-bit characters in a 36-bit word), those that were 8-bit oriented (using eight-bit characters) and those that placed the seven-bit ASCII characters in 9-bit fields with two leading zero bits (four characters in a 36-bit word).
在ARPANET开发过程中做出的一项早期应用程序设计决策(该决策已被推广到互联网中)是对通过网络传递的“文本”的单一且非常具体的编码进行标准化的决策[RFC0020]。然后,网络上的主机负责将本地使用的任何字符编码约定转换或映射到该公共中间表示,发送主机映射到该表示,接收主机根据需要映射到其本地形式。有趣的是,在开发ARPANET时,参与的主机操作系统至少使用了三种不同的字符编码标准:过时的BCD(二进制编码十进制)、当时占主导地位的主要制造商支持的EBCDIC(扩展BCD交换码),以及当时仍在出现的ASCII(美国信息交换标准代码)。由于ARPANET是一个“开放”项目,且EBCDIC与特定硬件供应商密切相关,最初的网络工作组同意其标准应为ASCII。ASCII格式正是“8位字段中的7位ASCII”,这实际上是本机7位定向主机(例如,36位字中有5个7位字符)、8位定向主机(使用8位字符)和将7位ASCII字符放在9位字段中并带有两个前导零位的主机(36位字中有4个字符)之间的折衷。
More standardization was suggested in the first preliminary description of the Telnet protocol [RFC0097]. With the iterations of that protocol [RFC0137] [RFC0139] and the drawing together of an essentially formal definition somewhat later [RFC0318], a standard abstraction, the Network Virtual Terminal (NVT) was established. NVT character-coding conventions (initially called "Telnet ASCII" and later called "NVT ASCII", or, more casually, "network ASCII") included the requirement that Carriage Return followed by Line Feed (CRLF) be the common representation for ending lines of text (given that some participating "Host" operating systems used the one natively, some the other, at least one used both, and a few used neither (preferring variable-length lines with counts or special delimiters or markers instead) and specified conventions for some other characters. Also, since NVT ASCII was restricted to seven-bit
Telnet协议[RFC0097]的第一个初步描述中建议了更多的标准化。随着该协议[RFC0137][RFC0139]的迭代,以及稍后[RFC0318]基本上正式的定义的结合,一个标准的抽象,即网络虚拟终端(NVT)被建立起来。NVT字符编码约定(最初称为“Telnet ASCII”,后来称为“NVT ASCII”,或者更随意地称为“网络ASCII”)包括以下要求,即回车符后跟换行符(CRLF)是文本结束行的通用表示形式(考虑到某些参与的“主机”操作系统本机使用一个字符,有些使用另一个字符,至少有一个同时使用这两个字符,有几个既不使用(更喜欢带有计数或特殊分隔符或标记的可变长度行),也不使用其他字符的指定约定。此外,由于NVT ASCII限制为7位
characters, use of the high-order bit in octets was reserved for the transmission of control signaling information.
字符,八位字节中的高阶位用于控制信令信息的传输。
At a very high level, the concept was that a system could use whatever character coding and line representations were appropriate locally, but text transmitted over the network as text must conform to the single "network virtual terminal" convention. Virtually all early Internet protocols that presume transfer of "text" assume this virtual terminal model, although different ones assume or limit it in different ways. Telnet, the command stream and ASCII Type in FTP [RFC0542], the message stream in SMTP transfer [RFC2821], and the strings passed to finger [RFC0742] and whois [RFC0954] are the classic examples. More recently, HTTP [RFC1945] [RFC2616] follows the same general model but permits 8-bit data and leaves the line end sequence unspecified (the latter has been the source of a significant number of problems).
在非常高的层次上,这一概念是,系统可以使用任何适合本地的字符编码和行表示,但作为文本在网络上传输的文本必须符合单一的“网络虚拟终端”约定。几乎所有假定“文本”传输的早期互联网协议都采用这种虚拟终端模型,尽管不同的协议以不同的方式假定或限制它。Telnet、FTP[RFC0542]中的命令流和ASCII类型、SMTP传输[RFC2821]中的消息流以及传递给finger[RFC0742]和whois[RFC0954]的字符串都是典型的示例。最近,HTTP[RFC1945][RFC2616]遵循相同的通用模型,但允许8位数据,并且未指定行结束序列(后者是大量问题的根源)。
The main body of this specification is intended as an update to, and internationalized version of, the Net-ASCII definition. The specification is self-contained in that parts of the Net-ASCII definition that are no longer recommended are not included above. Because Net-ASCII evolved somewhat over time and there has been debate about which specification is the "official" Net-ASCII, it is appropriate to review the key elements of that definition here. This review is informal with regard to the contents of Net-ASCII and should not be considered as a normative update or summary of the earlier specifications (Section 2 does specify some normative updates to those specifications and some comments below are consistent with it).
本规范的主体旨在更新Net ASCII定义,并将其国际化。本规范是独立的,不再推荐的净ASCII定义部分不包括在上述内容中。由于网络ASCII在一定程度上随着时间的推移而发展,并且关于哪个规范是“官方的”网络ASCII一直存在争议,因此在这里审查该定义的关键元素是合适的。该审查是关于Net ASCII内容的非正式审查,不应视为早期规范的规范性更新或总结(第2节确实规定了这些规范的一些规范性更新,下面的一些评论与之一致)。
The first part of the section titled "THE NVT PRINTER AND KEYBOARD" in RFC 854 [RFC0854] is generally, although not universally, considered to be the normative definition of the (ASCII) Network Virtual Terminal and hence of Net-ASCII. It includes not only the graphic ASCII characters but a number of control characters. The latter are given Internet-specific meanings that are often more specific than the definitions in the ASCII specification. In today's usage, and for the present specification, the following clarifications and updates to that list should be noted. Each one is accompanied by a brief explanation of the reason why the original specification is no longer appropriate.
RFC 854[RFC0854]中标题为“NVT打印机和键盘”一节的第一部分通常被认为是(ASCII)网络虚拟终端的规范性定义,因此也被认为是网络ASCII的规范性定义。它不仅包括图形ASCII字符,还包括许多控制字符。后者被赋予特定于互联网的含义,通常比ASCII规范中的定义更为具体。在今天的使用中,对于本规范,应注意以下对该列表的澄清和更新。每一份说明书都附有对原始规范不再适用的原因的简要说明。
1. The "defined but not required" codes -- BEL (U+0007), BS (U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the undefined control codes ("C0") SHOULD NOT be used unless required by exceptional circumstances. Either their original "network
1. 除非特殊情况需要,否则不应使用“已定义但非必需”代码——BEL(U+0007)、BS(U+0008)、HT(U+0009)、VT(U+000B)和FF(U+000C)——以及未定义的控制代码(“C0”)。要么是他们原来的“网络”
printer" definitions are no longer in general use, common practice has evolved away from the formats specified there, or their use to simulate characters that are better handled by Unicode is no longer appropriate. While the appearance of some of these characters on the list may seem surprising, BS now has an ambiguous interpretation in practice (erasing in some systems but not in others), the width associated with HT varies with the environment, and VT and FF do not have a uniform effect with regard to either vertical positioning or the associated horizontal position result. Of course, telnet escapes are not considered part of the data stream and hence are unaffected by this provision.
“打印机”定义不再普遍使用,通常的做法已经从这里指定的格式演变而来,或者使用它们来模拟Unicode更好地处理的字符不再合适。虽然列表中某些字符的出现似乎令人惊讶,但BS现在在实践中有一个模棱两可的解释(在某些系统中进行擦除,但在其他系统中不进行擦除),与HT相关的宽度随环境而变化,VT和FF对垂直定位或相关的水平定位结果都没有统一的影响。当然,telnet转义不被视为数据流的一部分,因此不受此规定的影响。
2. In Net-ASCII, CR MUST NOT appear except when immediately followed by either NUL or LF, with the latter (CR LF) designating the "new line" function. Today and as specified above, CR should generally appear only when followed by LF. Because page layout is better done in other ways, because NUL has a special interpretation in some programming languages, and to avoid other types of confusion, CR NUL should preferably be avoided as specified above.
2. 在净ASCII中,CR不得出现,除非紧跟NUL或LF,后者(CR LF)表示“新行”功能。今天,如上所述,CR通常只在后面跟LF时出现。由于页面布局最好以其他方式完成,因为NUL在某些编程语言中有特殊的解释,并且为了避免其他类型的混淆,最好按照上面的规定避免使用CR NUL。
3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF sequences (e.g., CR LF CR LF).
3. 除多个CR-LF序列(如CR-LF-CR-LF)的副作用外,不得出现LF-CR。
4. The historical NVT documents do not call out either "bare LF" (LF without CR) or HT for special treatment. Both have generally been understood to be problematic. In the case of LF, there is a difference in interpretation as to whether its semantics imply "go to same position on the next line" or "go to the first position on the next line" and interoperability considerations suggest not depending on which interpretation the receiver applies. At the same time, misinterpretation of LF is less harmful than misinterpretation of "bare" CR: in the CR case, text may be erased or made completely unreadable; in the LF one, the worst consequence is a very funny-looking display. Obviously, HT is problematic because there is no standard way to transmit intended tab position or width information in running text. Again, the harm is unlikely to be great if HT is simply interpreted as one or more spaces, but, in general, it cannot be relied upon to format information.
4. 历史上的NVT文件没有要求对“裸LF”(无CR的LF)或HT进行特殊处理。人们普遍认为这两种方法都有问题。在LF的情况下,对于其语义是否意味着“下一行上的同一位置”或“下一行上的第一位置”的解释存在差异,互操作性考虑建议不依赖于接收者应用的解释。同时,对LF的曲解比对“裸”CR的曲解危害更小:在CR的情况下,文本可能被删除或完全不可读;在LF中,最糟糕的结果是一个非常滑稽的显示。显然,HT是有问题的,因为在运行的文本中没有标准的方式来传输预期的制表符位置或宽度信息。同样,如果将HT简单地解释为一个或多个空格,危害不大,但一般来说,不能依赖它来格式化信息。
It is worth noting that the telnet IAC character (an octet consisting of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that particular octet cannot appear in a valid UTF-8 string. However, while few of them have been used, telnet permits other command-introducer characters whose bit sequences in an octet may be part of valid UTF-8 characters. While it causes no ambiguity in UTF-8,
值得注意的是,telnet IAC字符(由所有字符组成的八位字节,即%xFF)本身对于UTF-8不是问题,因为该八位字节不能出现在有效的UTF-8字符串中。然而,虽然很少使用这些字符,但telnet允许其他命令介绍人字符,其八位字节中的位序列可能是有效UTF-8字符的一部分。虽然在UTF-8中不会引起歧义,
Unicode assigns a graphic character ("Latin Small Letter Y with Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is clearly in order in this area.
Unicode为U+00FF(UTF-8中的八位字节C3 B0)分配一个图形字符(“带分音符的拉丁小写字母Y”)。在这方面显然需要谨慎。
The definition of how a line ending should be denoted in plain text strings on the wire for the Internet has been controversial from even before the introduction of NVT. Some have argued that recipients should be required to interpret almost anything that a sender might intend as a line ending as actually a line ending. Others have pointed out that this would lead to some ambiguities of interpretation and presentation and would violate the principle that we should minimize the number of forms that are permitted on the wire in order to promote interoperability and eliminate the "every recipient needs to understand every sender format" problem. The design of this specification, like that of NVT, takes the latter approach. Its designers believe that there is little point in a standard if it is to specify "anyone can do whatever they like and the receiver just needs to cope".
在引入NVT之前,关于如何在互联网上用纯文本字符串表示行尾的定义就一直存在争议。一些人认为,应该要求收件人将发送者可能打算作为行尾的几乎任何内容解释为实际的行尾。另一些人指出,这将导致解释和表述上的一些含糊不清,并违反了我们应尽量减少允许在网上使用的表格数量的原则,以促进互操作性并消除“每个收件人都需要了解每个发件人的格式”问题。本规范的设计与NVT的设计一样,采用了后一种方法。它的设计者认为,如果一个标准规定“任何人都可以做任何他们喜欢的事情,而接收者只需要应付”,那么这个标准就没有什么意义了。
A further discussion of the nature and evolution of the line-ending problem appears in Section 5.8 of the Unicode Standard [Unicode] and is suggested for additional reading. If we were starting with the Internet today, it would probably be sensible to follow the recommendation there and use LS (U+2028) exclusively, in preference to CRLF. However, the installed base of use of CRLF and the importance of forward compatibility with NVT and protocols that assume it makes that impossible, so it is necessary to continue using CRLF as the "New Line Function" ("NLF", see the terminology section in that reference).
Unicode标准[Unicode]第5.8节进一步讨论了行尾问题的性质和演变,建议进一步阅读。如果我们从今天的互联网开始,遵循那里的建议,只使用LS(U+2028),而不是CRLF,可能是明智的。然而,CRLF的使用基础以及与NVT和协议的前向兼容性的重要性使得这一点变得不可能,因此有必要继续将CRLF用作“新线功能”(“NLF”,见该参考文献中的术语部分)。
Consideration should be given to a Telnet (or SSH [RFC4251]) option to specify this type of stream and an FTP extension [RFC0959] to permit a new "Unicode text" data TYPE.
应考虑使用Telnet(或SSH[RFC4251])选项来指定此类型的流,并使用FTP扩展[RFC0959]来允许新的“Unicode文本”数据类型。
References
工具书类
Normative References
规范性引用文件
[ISO10646] International Organization for Standardization, "Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane", ISO/ IEC 10646-1:2000, October 2000.
[ISO10646]国际标准化组织,“信息技术-通用多八位编码字符集(UCS)-第1部分:体系结构和基本多语言平面”,ISO/IEC 10646-1:2000,2000年10月。
[NFC] Davis, M. and M. Duerst, "Unicode Standard Annex #15: Unicode Normalization Forms", October 2006, <http://www.unicode.org/reports/tr15/>.
[NFC]Davis,M.和M.Duerst,“Unicode标准附件#15:Unicode规范化格式”,2006年10月<http://www.unicode.org/reports/tr15/>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003.
[RFC3629]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,STD 63,RFC 3629,2003年11月。
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008.
[RFC5234]Crocker,D.和P.Overell,“语法规范的扩充BNF:ABNF”,STD 68,RFC 5234,2008年1月。
[Unicode] The Unicode Consortium, "The Unicode Standard, Version 5.0", 2007.
[Unicode]Unicode联盟,“Unicode标准,5.0版”,2007年。
Boston, MA, USA: Addison-Wesley. ISBN 0-321-48091-0
美国马萨诸塞州波士顿:艾迪生·韦斯利。ISBN 0-321-48091-0
[Unicode32] The Unicode Consortium, "The Unicode Standard, Version 3.0", 2000.
[Unicode 32]Unicode联盟,“Unicode标准,3.0版”,2000年。
(Reading, MA, Addison-Wesley, 2000. ISBN 0-201- 61633-5). Version 3.2 consists of the definition in that book as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/) and by the Unicode Standard Annex #28: Unicode 3.2 (http://www.unicode.org/reports/tr28/).
(雷丁,马萨诸塞州,艾迪生·韦斯利,2000年,ISBN 0-201-61633-5)。第3.2版包含该书中的定义,该定义由Unicode标准附录#27:Unicode 3.1修订(http://www.unicode.org/reports/tr27/)根据Unicode标准附录#28:Unicode 3.2(http://www.unicode.org/reports/tr28/).
Informative References
资料性引用
[ASCII] American National Standards Institute (formerly United States of America Standards Institute), "USA Code for Information Interchange", ANSI X3.4-1968, 1968.
[ASCII]美国国家标准协会(前美国标准协会),“美国信息交换代码”,ANSI X3.4-1968,1968年。
ANSI X3.4-1968 has been replaced by newer versions with slight modifications, but the 1968 version remains definitive for the Internet. ISO 646 International Reverence Version (IRV) [ISO.646.1991] is usually considered equivalent to ASCII.
ANSI X3.4-1968已被稍作修改的较新版本所取代,但1968年版本仍然是互联网的最终版本。ISO 646国际尊重版本(IRV)[ISO.646.1991]通常被认为等同于ASCII。
[ISO.646.1991] International Organization for Standardization, "Information technology - ISO 7-bit coded character set for information interchange", ISO Standard 646, 1991.
[ISO.646.1991]国际标准化组织,“信息技术-信息交换用ISO 7位编码字符集”,ISO标准6461991。
[NamedSequences] The Unicode Consortium, "NamedSequences-4.1.0.txt", 2005, <http://www.unicode.org/Public/UNIDATA/ NamedSequences.txt>.
[NamedSequences]Unicode联盟,“NamedSequences-4.1.0.txt”,2005年<http://www.unicode.org/Public/UNIDATA/ NamedSequences.txt>。
[RFC0020] Cerf, V., "ASCII format for network interchange", RFC 20, October 1969.
[RFC0020]Cerf,V.,“网络交换的ASCII格式”,RFC 20,1969年10月。
[RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed Telnet Protocol", RFC 97, February 1971.
[RFC0097]Melvin,J.和R.Watson,“提议的Telnet协议的首次削减”,RFC 97,1971年2月。
[RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed document", RFC 137, April 1971.
[RFC0137]O'Sullivan,T.,“Telnet协议——一份拟议文件”,RFC 137,1971年4月。
[RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol", RFC 139, May 1971.
[RFC0139]O'Sullivan,T.,“Telnet协议的讨论”,RFC 139,1971年5月。
[RFC0318] Postel, J., "Telnet Protocols", RFC 318, April 1972.
[RFC0318]Postel,J.,“Telnet协议”,RFC318,1972年4月。
[RFC0542] Neigus, N., "File Transfer Protocol", RFC 542, August 1973.
[RFC0542]Neigus,N.,“文件传输协议”,RFC5421973年8月。
[RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698, July 1975.
[RFC0698]Mock,T.,“Telnet扩展ASCII选项”,RFC6981975年7月。
[RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742, December 1977.
[RFC0742]哈伦斯坦,K.,“姓名/手指协议”,RFC 742,1977年12月。
[RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol Specification", STD 8, RFC 854, May 1983.
[RFC0854]Postel,J.和J.Reynolds,“Telnet协议规范”,STD 8,RFC 854,1983年5月。
[RFC0954] Harrenstien, K., Stahl, M., and E. Feinler, "NICNAME/WHOIS", RFC 954, October 1985.
[RFC0954]哈伦斯汀,K.,斯塔尔,M.,和E.费恩勒,“NICNAME/WHOIS”,RFC 954,1985年10月。
[RFC0959] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, RFC 959, October 1985.
[RFC0959]Postel,J.和J.Reynolds,“文件传输协议”,标准9,RFC 959,1985年10月。
[RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen, "Hypertext Transfer Protocol -- HTTP/1.0", RFC 1945, May 1996.
[RFC1945]Berners Lee,T.,Fielding,R.,和H.Nielsen,“超文本传输协议——HTTP/1.0”,RFC 1945,1996年5月。
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998.
[RFC2277]Alvestrand,H.,“IETF字符集和语言政策”,BCP 18,RFC 2277,1998年1月。
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
[RFC2616]菲尔丁,R.,盖蒂斯,J.,莫卧儿,J.,弗莱斯蒂克,H.,马斯特,L.,利奇,P.,和T.伯纳斯李,“超文本传输协议——HTTP/1.1”,RFC 2616,1999年6月。
[RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000.
[RFC2781]Hoffman,P.和F.Yergeau,“UTF-16,ISO 10646编码”,RFC 2781,2000年2月。
[RFC2821] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, April 2001.
[RFC2821]Klensin,J.,“简单邮件传输协议”,RFC 28212001年4月。
[RFC3454] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002.
[RFC3454]Hoffman,P.和M.Blanchet,“国际化弦的准备(“stringprep”)”,RFC 3454,2002年12月。
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003.
[RFC3491]Hoffman,P.和M.Blanchet,“Nameprep:国际化域名(IDN)的Stringprep配置文件”,RFC 3491,2003年3月。
[RFC3912] Daigle, L., "WHOIS Protocol Specification", RFC 3912, September 2004.
[RFC3912]Daigle,L.,“WHOIS协议规范”,RFC 3912,2004年9月。
[RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH) Protocol Architecture", RFC 4251, January 2006.
[RFC4251]Ylonen,T.和C.Lonvick,“安全外壳(SSH)协议架构”,RFC 4251,2006年1月。
[RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and Recommendations for Internationalized Domain Names (IDNs)", RFC 4690, September 2006.
[RFC4690]Klensin,J.,Faltstrom,P.,Karp,C.,和IAB,“国际化域名(IDN)的审查和建议”,RFC 46902006年9月。
Authors' Addresses
作者地址
John C Klensin 1770 Massachusetts Ave, #322 Cambridge, MA 02140 USA
美国马萨诸塞州剑桥市322号马萨诸塞大道1770号约翰·C·克伦辛,邮编:02140
Phone: +1 617 491 5735 EMail: john-ietf@jck.com
Phone: +1 617 491 5735 EMail: john-ietf@jck.com
Michael A. Padlipsky 8011 Stewart Ave. Los Angeles, CA 90045 USA
美国加利福尼亚州洛杉矶斯图尔特大道8011号迈克尔A.帕德利普斯基90045
Phone: +1 310-670-4288 EMail: the.map@alum.mit.edu
Phone: +1 310-670-4288 EMail: the.map@alum.mit.edu
Full Copyright Statement
完整版权声明
Copyright (C) The IETF Trust (2008).
版权所有(C)IETF信托基金(2008年)。
This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件及其包含的信息以“原样”为基础提供,贡献者、他/她所代表或赞助的组织(如有)、互联网协会、IETF信托基金和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Intellectual Property
知识产权
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.