Network Working Group M. Crispin Request for Comments: 4042 Panda Programming Category: Informational 1 April 2005
Network Working Group M. Crispin Request for Comments: 4042 Panda Programming Category: Informational 1 April 2005
UTF-9 and UTF-18 Efficient Transformation Formats of Unicode
UTF-9和UTF-18高效的Unicode转换格式
Status of This Memo
关于下段备忘
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。
Copyright Notice
版权公告
Copyright (C) The Internet Society (2005).
版权所有(C)互联网协会(2005年)。
Abstract
摘要
ISO-10646 defines a large character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. The same set of codepoints is defined by Unicode, which further defines additional character properties and other implementation details. By policy of the relevant standardization committees, changes to Unicode and amendments and additions to ISO/IEC 646 track each other, so that the character repertoires and code point assignments remain in synchronization.
ISO-10646定义了一个称为通用字符集(UCS)的大型字符集,它包含了世界上大多数的书写系统。同一组代码点由Unicode定义,它进一步定义了其他字符属性和其他实现细节。根据相关标准化委员会的政策,对Unicode的更改和对ISO/IEC 646的修订和添加相互跟踪,以便字符库和代码点分配保持同步。
The current representation formats for Unicode (UTF-7, UTF-8, UTF-16) are not storage and computation efficient on platforms that utilize the 9 bit nonet as a natural storage unit instead of the 8 bit octet.
当前的Unicode表示格式(UTF-7、UTF-8、UTF-16)在使用9位nonet作为自然存储单元而不是8位八位组的平台上,存储和计算效率不高。
This document describes a transformation format of Unicode that takes advantage of the nonet so that the format will be storage and computation efficient.
本文档描述了一种Unicode转换格式,该格式利用了nonet,因此该格式具有存储和计算效率。
A number of Internet sites utilize platforms that are not based upon the traditional 8-bit byte or octet. One such platform is the PDP-10, which is based upon a 36-bit word. On these platforms, it is wasteful to represent data in octets, since 4 bits are left unused in each word. The 9-bit nonet is a much more sensible representation.
许多互联网站点使用的平台不是基于传统的8位字节或八位字节。一个这样的平台是基于36位字的PDP-10。在这些平台上,以八位字节表示数据是浪费的,因为每个字中有4位未使用。9位nonet是一种更为合理的表示形式。
Although these platforms support IETF standards, many of these platforms still utilize a text representation based upon the septet,
尽管这些平台支持IETF标准,但其中许多平台仍然使用基于septet的文本表示,
which is only suitable for [US-ASCII] (although it has been used for various ISO 10646 national variants).
仅适用于[US-ASCII](尽管它已用于各种ISO10646国家标准变体)。
To maximize international and multi-lingual interoperability, the IAB has recommended ([IAB-CHARACTER]) that [ISO-10646] be the default coded character set.
为了最大限度地实现国际和多语言互操作性,IAB建议([IAB-CHARACTER])将[ISO-10646]作为默认编码字符集。
Although other transformation formats of [UNICODE] exist, and conceivably can be used on nonet-oriented machines (most notably [UTF-8]), they suffer significant disadvantages:
尽管存在[UNICODE]的其他转换格式,并且可以在面向nonet的机器上使用(最显著的是[UTF-8]),但它们存在显著的缺点:
[UTF-8] requires one to three octets to represent codepoints in the Basic Multilingual Plane (BMP), four octets to represent [UNICODE] codepoints outside the BMP, and six octets to represent non-[UNICODE] codepoints. When stored in nonets, this results in as many as four wasted bits per [UNICODE] character.
[UTF-8]需要一到三个八位字节来表示基本多语言平面(BMP)中的码点,四个八位字节来表示BMP之外的[UNICODE]码点,六个八位字节来表示非[UNICODE]码点。当存储在NONET中时,这会导致每个[UNICODE]字符浪费多达四个位。
[UTF-16] requires a hexadecet to represent codepoints in the BMP, and two hexadecets to represent [UNICODE] codepoints outside the BMP. When stored in nonet pairs, this results in as many as four wasted bits per [UNICODE] character. This transformation format requires complex surrogates to represent codepoints outside the BMP, and can not represent non-[UNICODE] codepoints at all.
[UTF-16]需要一个十六进制来表示BMP中的代码点,需要两个十六进制来表示BMP之外的[UNICODE]代码点。当存储在nonet对中时,这会导致每个[UNICODE]字符浪费多达四个位。这种转换格式需要复杂的代理来表示BMP之外的代码点,并且根本不能表示非[UNICODE]代码点。
[UTF-7] requires one to five septets to represent codepoints in the BMP, and as many as eight septets to represent codepoints outside the BMP. When stored in nonets, this results in as many as sixteen wasted bits per character. This transformation format requires very complex and computationally expensive shifting and "modified BASE64" processing, and can not represent non-[UNICODE] codepoints at all.
[UTF-7]需要一到五个SEPTET来表示BMP中的代码点,最多需要八个SEPTET来表示BMP之外的代码点。当存储在NONET中时,这会导致每个字符浪费多达16位。这种转换格式需要非常复杂且计算代价高昂的移位和“修改的BASE64”处理,并且根本不能表示非[UNICODE]码点。
By comparison, UTF-9 uses one to two nonets to represent codepoints in the BMP, three nonets to represent [UNICODE] codepoints outside the BMP, and three or four nonets to represent non-[UNICODE] codepoints. There are no wasted bits, and as the examples in this document demonstrate, the computational processing is minimal.
相比之下,UTF-9使用一到两个NONET表示BMP中的代码点,三个NONET表示BMP之外的[UNICODE]代码点,三个或四个NONET表示非[UNICODE]代码点。没有浪费的比特,正如本文中的示例所示,计算处理是最小的。
Transformation between [UTF-8] and UTF-9 is straightforward, with most of the complexity in the handling of [UTF-8]. It is hoped that future extensions to protocols such as SMTP will permit the use of UTF-9 in these protocols between nonet platforms without the use of [UTF-8] as an "on the wire" format.
[UTF-8]和UTF-9之间的转换非常简单,处理[UTF-8]的过程非常复杂。希望将来对SMTP等协议的扩展将允许在nonet平台之间的这些协议中使用UTF-9,而无需将[UTF-8]用作“在线”格式。
Similarly, transformation between [UNICODE] codepoints and UTF-18 is also quite simple. Although (like UCS-2) UTF-18 only represents a subset of the available [UNICODE] codepoints, it encompasses the non-private codepoints that are currently assigned in [UNICODE].
类似地,[UNICODE]代码点和UTF-18之间的转换也非常简单。尽管(与UCS-2类似)UTF-18仅表示可用的[UNICODE]码点的子集,但它包含当前在[UNICODE]中分配的非私有码点。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [KEYWORDS].
本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照BCP 14、RFC 2119[关键词]中的描述进行解释。
UTF-9 encodes [UNICODE] codepoints in the low order 8 bits of a nonet, using the high order bit to indicate continuation. Surrogates are not used.
UTF-9将[UNICODE]码点编码为nonet的低阶8位,使用高阶位表示继续。不使用代理。
[UNICODE] codepoints in the range U+0000 - U+00FF ([US-ASCII] and Latin 1) are represented by a single nonet; codepoints in the range U+0100 - U+FFFF (the remainder of the BMP) are represented by two nonets; and codepoints in the range U+1000 - U+10FFFF (remainder of [UNICODE]) are represented by three nonets.
U+0000-U+00FF([US-ASCII]和拉丁语1)范围内的[UNICODE]码点由一个nonet表示;U+0100-U+FFFF(BMP的剩余部分)范围内的代码点由两个非集合表示;U+1000-U+10FFFF范围内的代码点(UNICODE的剩余部分)由三个非集合表示。
Non-[UNICODE] codepoints in [ISO-10646] (that is, codepoints in the range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by obvious extension, but this is not discussed further as these codepoints have been removed from [ISO-10646] by ISO.
[ISO-10646]中的非[UNICODE]码点(即0x110000-0x7fffffff范围内的码点)也可以通过明显的扩展在UTF-9中表示,但由于这些码点已被ISO从[ISO-10646]中删除,因此不作进一步讨论。
UTF-18 encodes [UNICODE] codepoints in the Basic Multilingual Plane (BMP, plane 0), Supplementary Multilingual Plane (SMP, plane 1), Supplementary Ideographic Plane (SIP, plane 2), and Supplementary Special-purpose Plane (SSP, plane 14) in a single 18-bit value. It does not encode planes 3 though 13, which are currently unused; nor planes 15 or 16, which are private spaces.
UTF-18以单个18位值对基本多语言平面(BMP,平面0)、补充多语言平面(SMP,平面1)、补充表意平面(SIP,平面2)和补充专用平面(SSP,平面14)中的[UNICODE]码点进行编码。它不编码当前未使用的平面3到13;也不是飞机15或16,它们是私人空间。
Normally, UTF-9 and UTF-18 should only be used in the context of 9 bit storage and transport. Although some protocols, e.g., [FTP], support transport of nonets, the current IETF protocol suite is quite deficient in this area. The IETF is urged to take action to improve IETF protocol support for nonets.
通常,UTF-9和UTF-18只能在9位存储和传输环境中使用。尽管一些协议(例如[FTP])支持NONET的传输,但当前的IETF协议套件在这方面相当不足。敦促IETF采取行动,改进IETF对NONET的协议支持。
A UTF-9 stream represents [ISO-10646] codepoints using 9 bit nonets. The low order 8-bits of a nonet is an octet, and the high order bit indicates continuation.
UTF-9流使用9位NONET表示[ISO-10646]码点。nonet的低阶8位是八位字节,高阶位表示继续。
UTF-9 does not use surrogates; consequently a UTF-16 value must be transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never transmitted in UTF-9.
UTF-9不使用替代品;因此,UTF-16值必须转换为UCS-4等效值,并且U+D800-U+DBFF永远不会在UTF-9中传输。
Octets of the [UNICODE] codepoint value are then copied into successive UTF-9 nonets, starting with the most-significant non-zero octet. All but the least significant octet have the continuation bit set in the associated nonet.
然后将[UNICODE]码点值的八位字节复制到连续的UTF-9非字节中,从最重要的非零八位字节开始。除最低有效八位字节外,所有八位字节都在相关的nonet中设置了连续位。
Examples:
示例:
Character Name UTF-9 (in octal) --------- ---- ---------------- U+0041 LATIN CAPITAL LETTER A 101 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 300 U+0391 GREEK CAPITAL LETTER ALPHA 403 221 U+611B <CJK ideograph meaning "love"> 541 33 U+10330 GOTHIC LETTER AHSA 401 403 60 U+E0041 TAG LATIN CAPITAL LETTER A 416 400 101 U+10FFFD <Plane 16 Private Use, Last> 420 777 375 0x345ecf1b (UCS-4 value not in [UNICODE]) 464 536 717 33
Character Name UTF-9 (in octal) --------- ---- ---------------- U+0041 LATIN CAPITAL LETTER A 101 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 300 U+0391 GREEK CAPITAL LETTER ALPHA 403 221 U+611B <CJK ideograph meaning "love"> 541 33 U+10330 GOTHIC LETTER AHSA 401 403 60 U+E0041 TAG LATIN CAPITAL LETTER A 416 400 101 U+10FFFD <Plane 16 Private Use, Last> 420 777 375 0x345ecf1b (UCS-4 value not in [UNICODE]) 464 536 717 33
A UTF-18 stream represents [ISO-10646] codepoints using a pair of 9 bit nonets to form an 18-bit value.
UTF-18流表示[ISO-10646]码点,使用一对9位非字节来形成18位值。
UTF-18 does not use surrogates; consequently a UTF-16 value must be transformed into the UCS-4 equivalent, and U+D800 - U+DBFF are never transmitted in UTF-18.
UTF-18不使用替代品;因此,UTF-16值必须转换为UCS-4等效值,并且U+D800-U+DBFF永远不会在UTF-18中传输。
[UNICODE] codepoint values in the range U+0000 - U+2FFFF are copied as the same value into a UTF-18 value. [UNICODE] codepoint values in the range U+E0000 - U+EFFFF are copied as values 0x30000 - 0x3ffff; that is, these values are shifted by 0x70000. Other codepoint values can not be represented in UTF-18.
U+0000-U+2FFF范围内的[UNICODE]码点值作为相同的值复制到UTF-18值中。U+E0000-U+EFFF范围内的[UNICODE]码点值作为值0x30000-0x3ffff复制;也就是说,这些值被移位0x70000。UTF-18中不能表示其他代码点值。
Examples:
示例:
Character Name UTF-18 (in octal) --------- ---- ---------------- U+0041 LATIN CAPITAL LETTER A 000101 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 000300 U+0391 GREEK CAPITAL LETTER ALPHA 001621 U+611B <CJK ideograph meaning "love"> 060433 U+10330 GOTHIC LETTER AHSA 201460 U+E0041 TAG LATIN CAPITAL LETTER A 600101
Character Name UTF-18 (in octal) --------- ---- ---------------- U+0041 LATIN CAPITAL LETTER A 000101 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE 000300 U+0391 GREEK CAPITAL LETTER ALPHA 001621 U+611B <CJK ideograph meaning "love"> 060433 U+10330 GOTHIC LETTER AHSA 201460 U+E0041 TAG LATIN CAPITAL LETTER A 600101
The following routines demonstrate conversion from UCS-4 to UTF-9. For simplicity, these routines do not do any validity checking. Routines used in applications SHOULD reject invalid UTF-9 sequences; that is, the first nonet with a value of 400 octal (0x100), or sequences that result in an overflow (exceeding 0x10ffff for [UNICODE]), or codepoints used for UTF-16 surrogates.
以下例程演示了从UCS-4到UTF-9的转换。为简单起见,这些例程不进行任何有效性检查。应用程序中使用的例程应拒绝无效的UTF-9序列;也就是说,值为400八进制(0x100)的第一个nonet,或导致溢出的序列(对于[UNICODE]超过0x10ffff),或用于UTF-16代理的代码点。
; Return UCS-4 value from UTF-9 string (PDP-10 assembly version) ; Accepts: P1/ 9-bit byte pointer to UTF-9 string ; Returns +1: Always, T1/ UCS-4 value, P1/ updated byte pointer ; Clobbers T2
; Return UCS-4 value from UTF-9 string (PDP-10 assembly version) ; Accepts: P1/ 9-bit byte pointer to UTF-9 string ; Returns +1: Always, T1/ UCS-4 value, P1/ updated byte pointer ; Clobbers T2
UT92U4: TDZA T1,T1 ; start with zero U92U41: XOR T1,T2 ; insert octet into UCS-4 value LSH T1,^D8 ; shift UCS-4 value ILDB T2,P1 ; get next nonet TRZE T2,400 ; extract octet, any continuation? JRST U92U41 ; yes, continue XOR T1,T2 ; insert final octet POPJ P,
UT92U4:TDZA T1,T1;从零开始U92U41:XOR T1,T2;将八位字节插入UCS-4值LSH T1,^D8;移位UCS-4值ILDB T2,P1;获取下一个nonet TRZE T2400;提取八重奏,有续集吗?JRST U92U41;是,继续XOR T1,T2;插入最后一个八位组POPJ P,
/* Return UCS-4 value from UTF-9 string (C version) * Accepts: pointer to pointer to UTF-9 string * Returns: UCS-4 character, nonet pointer updated */
/* Return UCS-4 value from UTF-9 string (C version) * Accepts: pointer to pointer to UTF-9 string * Returns: UCS-4 character, nonet pointer updated */
UINT31 UTF9_to_UCS4 (UINT9 **utf9PP) { UINT9 nonet; UINT31 ucs4; for (ucs4 = (nonet = *(*utf9PP)++) & 0xff; nonet & 0x100; ucs4 |= (nonet = *(*utf9PP)++) & 0xff) ucs4 <<= 8; return ucs4; }
UINT31 UTF9_to_UCS4 (UINT9 **utf9PP) { UINT9 nonet; UINT31 ucs4; for (ucs4 = (nonet = *(*utf9PP)++) & 0xff; nonet & 0x100; ucs4 |= (nonet = *(*utf9PP)++) & 0xff) ucs4 <<= 8; return ucs4; }
The following routines demonstrate conversion from UTF-9 to UCS-4. For simplicity, these routines do not do any validity checking. Routines used in applications SHOULD reject invalid UCS-4 codepoints; that is, codepoints used for UTF-16 surrogates or codepoints with values exceeding 0x10ffff for [UNICODE].
以下例程演示了从UTF-9到UCS-4的转换。为简单起见,这些例程不进行任何有效性检查。应用程序中使用的例程应拒绝无效的UCS-4代码点;也就是说,用于UTF-16代理的代码点或[UNICODE]值超过0x10ffff的代码点。
; Write UCS-4 character to UTF-9 string (PDP-10 assembly version) ; Accepts: P1/ 9-bit byte pointer to UTF-9 string ; T1/ UCS-4 character to write ; Returns +1: Always, P1/ updated byte pointer ; Clobbers T1, T2; (T1, T2) must be an accumulator pair
; Write UCS-4 character to UTF-9 string (PDP-10 assembly version) ; Accepts: P1/ 9-bit byte pointer to UTF-9 string ; T1/ UCS-4 character to write ; Returns +1: Always, P1/ updated byte pointer ; Clobbers T1, T2; (T1, T2) must be an accumulator pair
U42UT9: SETO T2, ; we'll need some of these 1-bits later ASHC T1,-^D8 ; low octet becomes nonet with high 0-bit U32U91: JUMPE T1,U42U9X ; done if no more octets LSHC T1,-^D8 ; shift next octet into T2 ROT T2,-1 ; turn it into nonet with high 1 bit PUSHJ P,U42U91 ; recurse for remainder U42U9X: LSHC T1,^D9 ; get next nonet back from T2 IDPB T1,P1 ; write nonet POPJ P,
U42UT9:SETO T2;稍后我们需要这些1位的ASHC T1,-^D8;低八位字节变为nonet与高0位U32U91:跳线T1、U42U9X;如果没有更多的八位字节LSHC T1,-^D8,则完成;将下一个八位组移到T2-ROT-T2-1;将其转换为高1位PUSHJ P的nonet,U42U91;对余数U42U9X递归:LSHC T1,^D9;从T2 IDPB T1,P1获取下一个nonet;写下nonet POPJ P,
/* Write UCS-4 character to UTF-9 string (C version) * Accepts: pointer to nonet string * UCS-4 character to write * Returns: updated pointer */
/* Write UCS-4 character to UTF-9 string (C version) * Accepts: pointer to nonet string * UCS-4 character to write * Returns: updated pointer */
UINT9 *UCS4_to_UTF9 (UINT9 *utf9P,UINT31 ucs4) { if (ucs4 > 0x100) { if (ucs4 > 0x10000) { if (ucs4 > 0x1000000) *utf9P++ = 0x100 | ((ucs4 >> 24) & 0xff); *utf9P++ = 0x100 | ((ucs4 >> 16) & 0xff); } *utf9P++ = 0x100 | ((ucs4 >> 8) & 0xff); } *utf9P++ = ucs4 & 0xff; return utf9P; }
UINT9 *UCS4_to_UTF9 (UINT9 *utf9P,UINT31 ucs4) { if (ucs4 > 0x100) { if (ucs4 > 0x10000) { if (ucs4 > 0x1000000) *utf9P++ = 0x100 | ((ucs4 >> 24) & 0xff); *utf9P++ = 0x100 | ((ucs4 >> 16) & 0xff); } *utf9P++ = 0x100 | ((ucs4 >> 8) & 0xff); } *utf9P++ = ucs4 & 0xff; return utf9P; }
As the sample routines demonstrate, it is quite simple to implement UTF-9 and UTF-18 on a nonet-based architecture. More sophisticated routines can be found in ftp://panda.com/tops-20/utools.mac.txt or from lingling.panda.com via the file <UTF9>UTOOLS.MAC via ANONYMOUS [FTP].
如示例例程所示,在基于nonet的体系结构上实现UTF-9和UTF-18非常简单。更复杂的例程可以在ftp://panda.com/tops-20/utools.mac.txt 或者通过文件<UTF9>UTOOLS.MAC通过匿名[FTP]从玲玲.panda.com下载。
We are now in the process of implementing support for nonet-based text files and automated transformation between septet, octet, and nonet textual data.
我们现在正在实现对基于nonet的文本文件的支持,以及septet、octet和nonet文本数据之间的自动转换。
[FTP] Postel, J. and J. Reynolds, "File Transfer Protocol", STD 9, RFC 959, October 1985.
[FTP]Postel,J.和J.Reynolds,“文件传输协议”,STD 9,RFC 959,1985年10月。
[IAB-CHARACTER] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M., and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997.
[IAB-CHARACTER]Weider,C.,Preston,C.,Simonsen,K.,Alvestrand,H.,Atkinson,R.,Crispin,M.,和P.Svanberg,“1996年2月29日至3月1日举行的IAB字符集研讨会报告”,RFC 21301997年4月。
[ISO-10646] International Organization for Standardization, "Information Technology - Universal Multiple-octet coded Character Set (UCS)", ISO/IEC Standard 10646, comprised of ISO/IEC 10646-1:2000, "Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane", ISO/IEC 10646-2:2001, "Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 2: Supplementary Planes" and ISO/IEC 10646-1:2000/Amd 1:2002, "Mathematical symbols and other characters".
[ISO-10646]国际标准化组织,“信息技术-通用多八位编码字符集(UCS)”,ISO/IEC标准10646,由ISO/IEC 10646-1:2000组成,“信息技术-通用多八位编码字符集(UCS)-第1部分:体系结构和基本多语言平面”,ISO/IEC 10646-2:2001,“信息技术.通用多八位编码字符集(UCS).第2部分:补充平面”和ISO/IEC 10646-1:2000/Amd 1:2002,“数学符号和其他字符”。
[KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[关键词]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[UNICODE] The Unicode Consortium, "The Unicode Standard - Version 3.2", defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 and by the Unicode Standard Annex #28: Unicode 3.2, March 2002.
[UNICODE]UNICODE联盟,“UNICODE标准-版本3.2”,由UNICODE标准版本3.0(雷丁,马萨诸塞州,Addison-Wesley,2000年。ISBN 0-201-61633-5)定义,经UNICODE标准附录27:UNICODE 3.1和UNICODE标准附录28:UNICODE 3.2修订,2002年3月。
[US-ASCII] American National Standards Institute, "Coded Character Set - 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986.
[US-ASCII]美国国家标准协会,“编码字符集-信息交换用7位美国标准代码”,ANSI X3.41986。
[UTF-16] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000.
[UTF-16]Hoffman,P.和F.Yergeau,“UTF-16,ISO 10646编码”,RFC 2781,2000年2月。
[UTF-7] Goldsmith, D. and M. Davis, "UTF-7 A Mail-Safe Transformation Format of Unicode", RFC 2152, May 1997.
[UTF-7]Goldsmith,D.和M.Davis,“UTF-7是Unicode的邮件安全转换格式”,RFC 2152,1997年5月。
[UTF-8] Sollins, K., "Architectural Principles of Uniform Resource Name Resolution", RFC 2276, January 1998.
[UTF-8]Sollins,K.,“统一资源名称解析的体系结构原则”,RFC 2276,1998年1月。
As with UTF-8, UTF-9 can represent codepoints that are not in [UNICODE]. Applications should validate UTF-9 strings to ensure that all codepoints do not exceed the [UNICODE] maximum of U+10FFFF.
与UTF-8一样,UTF-9可以表示非UNICODE格式的代码点。应用程序应验证UTF-9字符串,以确保所有代码点不超过U+10FFFF的[UNICODE]最大值。
The sample routines in this document are for example purposes, and make no attempt to validate their arguments, e.g., test for overflow ([UNICODE] values great than 0x10ffff) or codepoints used for surrogates. Besides resulting in invalid data, this can also create covert channels.
本文档中的示例例程仅用于示例目的,不尝试验证其参数,例如,测试溢出([UNICODE]值大于0x10ffff)或用于代理的代码点。除了导致无效数据外,这还可能创建隐蔽通道。
The IANA shall reserve the charset names "UTF-9" and "UTF-18" for future assignment.
IANA应保留字符集名称“UTF-9”和“UTF-18”以备将来分配。
Author's Address
作者地址
Mark R. Crispin Panda Programming 6158 NE Lariat Loop Bainbridge Island, WA 98110-2098
Mark R.Crispin熊猫计划6158北拉里亚特环班布里奇岛,华盛顿州98110-2098
Phone: (206) 842-2385 EMail: UTF9@Lingling.Panda.COM
电话:(206)842-2385电子邮件:UTF9@Lingling.Panda.COM
Full Copyright Statement
完整版权声明
Copyright (C) The Internet Society (2005).
版权所有(C)互联网协会(2005年)。
This document is subject to the rights, licenses and restrictions contained in BCP 78 and at www.rfc-editor.org/copyright.html, and except as set forth therein, the authors retain all their rights.
本文件受BCP 78和www.rfc-editor.org/copyright.html中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。
This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件及其包含的信息是按“原样”提供的,贡献者、他/她所代表或赞助的组织(如有)、互联网协会和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Intellectual Property
知识产权
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.
Acknowledgement
确认
Funding for the RFC Editor function is currently provided by the Internet Society.
RFC编辑功能的资金目前由互联网协会提供。