Network Working Group                                         M. Crispin
Request for Comments: 5051                      University of Washington
Category: Standards Track                                   October 2007
        
Network Working Group                                         M. Crispin
Request for Comments: 5051                      University of Washington
Category: Standards Track                                   October 2007
        

i;unicode-casemap - Simple Unicode Collation Algorithm

我unicode casemap-简单的unicode排序算法

Status of This Memo

关于下段备忘

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。

Abstract

摘要

This document describes "i;unicode-casemap", a simple case-insensitive collation for Unicode strings. It provides equality, substring, and ordering operations.

本文档描述了“i;unicode casemap”,这是一种简单的unicode字符串不区分大小写的排序规则。它提供相等、子字符串和排序操作。

1. Introduction
1. 介绍

The "i;ascii-casemap" collation described in [COMPARATOR] is quite simple to implement and provides case-independent comparisons for the 26 Latin alphabetics. It is specified as the default and/or baseline comparator in some application protocols, e.g., [IMAP-SORT].

[COMPARATOR]中描述的“i;ascii casemap”排序规则实现起来非常简单,并为26个拉丁字母提供了独立于大小写的比较。在某些应用程序协议中,它被指定为默认和/或基线比较器,例如[IMAP-SORT]。

However, the "i;ascii-casemap" collation does not produce satisfactory results with non-ASCII characters. It is possible, with a modest extension, to provide a more sophisticated collation with greater multilingual applicability than "i;ascii-casemap". This extension provides case-independent comparisons for a much greater number of characters. It also collates characters with diacriticals with the non-diacritical character forms.

但是,“i;ascii casemap”排序规则对于非ascii字符不会产生令人满意的结果。通过适度的扩展,可以提供比“i;ascii casemap”更复杂的排序规则,具有更大的多语言适用性。此扩展为更多的字符提供了独立于大小写的比较。它还将带变音符号的字符和不带变音符号的字符形式进行比较。

This collation, "i;unicode-casemap", is intended to be an alternative to, and preferred over, "i;ascii-casemap". It does not replace the "i;basic" collation described in [BASIC].

此排序规则“i;unicode casemap”旨在替代“i;ascii casemap”,并优于“i;ascii casemap”。它不替换[basic]中描述的“i;basic”排序规则。

2. Unicode Casemap Collation Description
2. Unicode Casemap排序说明

The "i;unicode-casemap" collation is a simple collation which is case-insensitive in its treatment of characters. It provides equality, substring, and ordering operations. The validity test operation returns "valid" for any input.

“i;unicode casemap”排序规则是一种简单的排序规则,在处理字符时不区分大小写。它提供相等、子字符串和排序操作。有效性测试操作返回任何输入的“有效”。

This collation allows strings in arbitrary (and mixed) character sets, as long as the character set for each string is identified and it is possible to convert the string to Unicode. Strings which have an unidentified character set and/or cannot be converted to Unicode are not rejected, but are treated as binary.

此排序规则允许使用任意(和混合)字符集中的字符串,只要识别每个字符串的字符集并且可以将该字符串转换为Unicode。具有未识别字符集和/或无法转换为Unicode的字符串不会被拒绝,但会被视为二进制字符串。

Each input string is prepared by converting it to a "titlecased canonicalized UTF-8" string according to the following steps, using UnicodeData.txt ([UNICODE-DATA]):

通过使用UNICODE-DATA.txt([UNICODE-DATA]),按照以下步骤将每个输入字符串转换为“基于标题的规范化UTF-8”字符串,从而准备好每个输入字符串:

(1) A Unicode codepoint is obtained from the input string.

(1) 从输入字符串中获取Unicode码点。

(a) If the input string is in a known charset that can be converted to Unicode, a sequence in the string's charset is read and checked for validity according to the rules of that charset. If the sequence is valid, it is converted to a Unicode codepoint. Note that for input strings in UTF-8, the UTF-8 sequence must be valid according to the rules of [UTF-8]; e.g., overlong UTF-8 sequences are invalid.

(a) 如果输入字符串位于可以转换为Unicode的已知字符集中,则将读取该字符串字符集中的序列,并根据该字符集的规则检查其有效性。如果序列有效,则将其转换为Unicode码点。注意,对于UTF-8中的输入字符串,根据[UTF-8]的规则,UTF-8序列必须有效;e、 例如,过长的UTF-8序列无效。

(b) If the input string is in an unknown charset, or an invalid sequence occurs in step (1)(a), conversion ceases. No further preparation is performed, and any partial preparation results are discarded. The original string is used unchanged with the i;octet comparator.

(b) 如果输入字符串位于未知字符集中,或者在步骤(1)(a)中出现无效序列,则转换停止。不进行进一步的准备,并且丢弃任何部分准备结果。原始字符串的使用与i相同;八位组比较器。

(2) The following steps, using UnicodeData.txt ([UNICODE-DATA]), are performed on the resulting codepoint from step (1)(a).

(2) 使用UNICODE-DATA.txt([UNICODE-DATA])对步骤(1)(a)中得到的代码点执行以下步骤。

(a) If the codepoint has a titlecase property in UnicodeData.txt (this is normally the same as the uppercase property), the codepoint is converted to the codepoints in the titlecase property.

(a) 如果代码点在UnicodeData.txt中具有titlecase属性(通常与大写属性相同),则代码点将转换为titlecase属性中的代码点。

(b) If the resulting codepoint from (2)(a) has a decomposition property of any type in UnicodeData.txt, the codepoint is converted to the codepoints in the decomposition property. This step is recursively applied to each of the resulting codepoints until no more decomposition is possible (effectively Normalization Form KD).

(b) 如果(2)(a)中得到的代码点在UnicodeData.txt中具有任何类型的分解属性,则该代码点将转换为分解属性中的代码点。该步骤递归地应用于每个生成的代码点,直到不再可能进行分解(有效地规范化KD)。

          Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
          has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
          WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
          decomposition property of U+0044 (LATIN CAPITAL LETTER D)
          U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
          decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
        
          Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
          has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
          WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
          decomposition property of U+0044 (LATIN CAPITAL LETTER D)
          U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
          decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
        

(COMBINING CARON). Neither U+0044, U+007A, nor U+030C have any decomposition properties. Therefore, U+01C4 is converted to U+0044 U+007A U+030C by this step.

(卡隆)。U+0044、U+007A和U+030C都没有任何分解特性。因此,该步骤将U+01C4转换为U+0044 U+007A U+030C。

(3) The resulting codepoint(s) from step (2) is/are appended, in UTF-8 format, to the "titlecased canonicalized UTF-8" string.

(3) 步骤(2)产生的代码点以UTF-8格式附加到“基于标题的规范化UTF-8”字符串中。

(4) Repeat from step (1) until there is no more data in the input string.

(4) 重复步骤(1),直到输入字符串中没有更多数据。

Following the above preparation process on each string, the equality, ordering, and substring operations are as for i;octet.

在对每个字符串执行上述准备过程之后,相等、排序和子字符串操作与i相同;八重奏。

It is permitted to use an alternative implementation of the above preparation process if it produces the same results. For example, it may be more convenient for an implementation to convert all input strings to a sequence of UTF-16 or UTF-32 values prior to performing any of the step (2) actions. Similarly, if all input strings are (or are convertible to) Unicode, it may be possible to use UTF-32 as an alternative to UTF-8 in step (3).

如果上述制备过程产生相同的结果,则允许使用上述制备过程的替代实施方案。例如,在执行任何步骤(2)操作之前,实现将所有输入字符串转换为UTF-16或UTF-32值序列可能更方便。类似地,如果所有输入字符串都是Unicode(或可转换为Unicode),则可以在步骤(3)中将UTF-32用作UTF-8的替代品。

Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3), because UTF-16 surrogates will cause i;octet to collate codepoints U+E0000 through U+FFFF after non-BMP codepoints.

注:UTF-16不适合作为步骤(3)中UTF-8的替代品,因为UTF-16替代物会导致i;八位字节,用于在非BMP代码点之后,通过U+FFFF整理代码点U+E0000。

This collation is not locale sensitive. Consequently, care should be taken when using OS-supplied functions to implement this collation. Functions such as strcasecmp and toupper are sometimes locale sensitive and may inconsistently casemap letters.

此排序规则不区分区域设置。因此,在使用操作系统提供的函数来实现此排序规则时,应小心。诸如strcasecmp和toupper之类的函数有时是区域设置敏感的,并且可能不一致地使用大小写映射字母。

The i;unicode-casemap collation is well suited to use with many Internet protocols and computer languages. Use with natural language is often inappropriate; even though the collation apparently supports languages such as Swahili and English, in real-world use it tends to mis-sort a number of types of string:

我;unicode casemap排序非常适合与许多Internet协议和计算机语言一起使用。使用自然语言通常是不合适的;尽管排序规则显然支持斯瓦希里语和英语等语言,但在实际使用中,它往往会对多种类型的字符串进行错误排序:

o people and place names containing scripts that are not collated according to "alphabetical order". o words with characters that have diacriticals. However, i;unicode-casemap generally does a better job than i;ascii-casemap for most (but not all) languages. For example, German umlaut letters will sort correctly, but some Scandinavian letters will not. o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike in English), o strings containing other non-letter symbols; e.g., euro and pound sterling symbols, quotation marks other than '"', dashes/hyphens, etc.

o 包含未按“字母顺序”整理的脚本的人名和地名。o带有带变音符号的字符的单词。但是我,;unicode casemap通常比我做得更好;大多数(但不是所有)语言的ascii casemap。例如,德语乌姆劳特字母将正确排序,但某些斯堪的纳维亚字母不会。o名称,如“Lloyd”(与英语不同,威尔士语在“Lyon”之后排序),o包含其他非字母符号的字符串;e、 例如,欧元和英镑符号、除“'”以外的引号、破折号/连字符等。

3. Unicode Casemap Collation Registration
3. Unicode案例地图排序注册
   <?xml version='1.0'?>
   <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
   <collation rfc="5051" scope="global" intendedUse="common">
   <identifier>i;unicode-casemap</identifier>
   <title>Unicode Casemap</title>
   <operations>equality order substring</operations>
   <specification>RFC 5051</specification>
   <owner>IETF</owner>
   <submitter>mrc@cac.washington.edu</submitter>
   </collation>
        
   <?xml version='1.0'?>
   <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
   <collation rfc="5051" scope="global" intendedUse="common">
   <identifier>i;unicode-casemap</identifier>
   <title>Unicode Casemap</title>
   <operations>equality order substring</operations>
   <specification>RFC 5051</specification>
   <owner>IETF</owner>
   <submitter>mrc@cac.washington.edu</submitter>
   </collation>
        
4. Security Considerations
4. 安全考虑

The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-SECURITY] apply and are normative to this specification.

[UTF-8]、[STRINGPREP]和[UNICODE-security]的安全注意事项适用于本规范,并且是本规范的规范性内容。

The results from this comparator will vary depending upon the implementation for several reasons. Implementations MUST consider whether these possibilities are a problem for their use case:

由于若干原因,该比较结果将因实施情况而异。实现必须考虑这些可能性是否是用例的问题:

1) New characters added in Unicode may have decomposition or titlecase properties that will not be known to an implementation based upon an older revision of Unicode. This impacts step (2).

1) 在Unicode中添加的新字符可能具有基于旧版本Unicode的实现所不知道的分解或标题酶属性。这将影响第(2)步。

2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that does not require normalization of out-of-order diacriticals. However, an implementation MAY use an NFKD library routine that does such normalization. This impacts step (2)(b) and possibly also step (1)(a), and is an issue only with ill-formed UTF-8 input.

2) 步骤(2)(b)定义了标准化形式KD(NFKD)的子集,该子集不需要对无序变音键进行标准化。然而,一个实现可以使用一个NFKD库例程来进行这种规范化。这会影响第(2)(b)步,也可能会影响第(1)(a)步,并且仅在格式错误的UTF-8输入中存在问题。

3) The set of charsets handled in step (1)(a) is open-ended. UTF-8 (and, by extension, US-ASCII) are the only mandatory-to-implement charsets. This impacts step (1)(a).

3) 步骤(1)(a)中处理的字符集是开放式的。UTF-8(以及扩展后的US-ASCII)是实现字符集的唯一强制要求。这会影响步骤(1)(a)。

Implementations SHOULD, as far as feasible, support all the charsets they are likely to encounter in the input data, in order to avoid poor collation caused by the fall through to the (1)(b) rule.

在可行的情况下,实现应该尽可能支持它们在输入数据中可能遇到的所有字符集,以避免由于(1)(b)规则的错误而导致糟糕的排序。

4) Other charsets may have revisions which add new characters that are not known to an implementation based upon an older revision. This impacts step (1)(a) and possibly also step (1)(b).

4) 其他字符集可能有版本,这些版本添加了基于旧版本的实现所不知道的新字符。这会影响步骤(1)(a),也可能影响步骤(1)(b)。

An attacker may create input that is ill-formed or in an unknown charset, with the intention of impacting the results of this comparator or exploiting other parts of the system which process this input in different ways. Note, however, that even well-formed data in a known charset can impact the result of this comparator in unexpected ways. For example, an attacker can substitute U+0041 (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a non-match of strings which visually appear the same and/or causing the string to appear elsewhere in a sort.

攻击者可能创建格式错误或未知字符集的输入,目的是影响此比较器的结果,或利用系统中以不同方式处理此输入的其他部分。但是,请注意,即使是已知字符集中格式良好的数据也可能以意外的方式影响此比较器的结果。例如,攻击者可以将U+0041(拉丁文大写字母A)替换为U+0391(希腊语大写字母ALPHA)或U+0410(西里尔语大写字母A),目的是造成视觉上看起来相同的字符串不匹配和/或导致该字符串出现在排序的其他位置。

5. IANA Considerations
5. IANA考虑

The i;unicode-casemap collation defined in section 2 has been added to the registry of collations defined in [COMPARATOR].

我;第2节中定义的unicode casemap排序规则已添加到[COMPARATOR]中定义的排序规则注册表中。

6. Normative References
6. 规范性引用文件

[COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen, "Internet Application Protocol Collation Registry", RFC 4790, February 2007.

[比较方]Newman,C.,Duerst,M.,和A.Gulbrandsen,“互联网应用程序协议整理注册表”,RFC 47902007年2月。

[STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of Internationalized Strings ("stringprep")", RFC 3454, December 2002.

[STRINGPREP]Hoffman,P.和M.Blanchet,“国际化弦的准备(“STRINGPREP”)”,RFC 3454,2002年12月。

[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003.

[UTF-8]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,STD 63,RFC 3629,2003年11月。

   [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
                         UnicodeData.txt>
        
   [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
                         UnicodeData.txt>
        

Although the UnicodeData.txt file referenced here is part of the Unicode standard, it is subject to change as new characters are added to Unicode and errors are corrected in Unicode revisions. As a result, it may be less stable than might otherwise be implied by the standards status of this specification.

尽管此处引用的UnicodeData.txt文件是Unicode标准的一部分,但随着新字符添加到Unicode,并且在Unicode修订版中更正错误,该文件可能会发生更改。因此,其稳定性可能低于本规范的标准状态所暗示的稳定性。

[UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security Considerations", February 2006, <http://www.unicode.org/reports/tr36/>.

[UNICODE-SECURITY]Davis,M.和M.Suignard,“UNICODE安全注意事项”,2006年2月<http://www.unicode.org/reports/tr36/>.

7. Informative References
7. 资料性引用

[BASIC] Newman, C., Duerst, M., and A. Gulbrandsen, "i;basic - the Unicode Collation Algorithm", Work in Progress, March 2007.

[BASIC]Newman,C.,Duerst,M.,和A.Gulbrandsen,“i;BASIC-Unicode排序算法”,正在进行的工作,2007年3月。

[IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message Access Protocol - SORT and THREAD Extensions", Work in Progress, September 2007.

[IMAP-SORT]Crispin,M.和K.Murchison,“互联网消息访问协议-排序和线程扩展”,正在进行的工作,2007年9月。

Author's Address

作者地址

Mark R. Crispin Networks and Distributed Computing University of Washington 4545 15th Avenue NE Seattle, WA 98105-4527

Mark R. Crispin网络与分布式计算华盛顿大学西雅图第十五大街4545号WA98105-427

   Phone: +1 (206) 543-5762
   EMail: MRC@CAC.Washington.EDU
        
   Phone: +1 (206) 543-5762
   EMail: MRC@CAC.Washington.EDU
        

Full Copyright Statement

完整版权声明

Copyright (C) The IETF Trust (2007).

版权所有(C)IETF信托基金(2007年)。

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。

This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件及其包含的信息以“原样”为基础提供,贡献者、他/她所代表或赞助的组织(如有)、互联网协会、IETF信托基金和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Intellectual Property

知识产权

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.