Network Working Group B. Curtin Request for Comments: 2640 Defense Information Systems Agency Updates: 959 July 1999 Category: Proposed Standard
Network Working Group B. Curtin Request for Comments: 2640 Defense Information Systems Agency Updates: 959 July 1999 Category: Proposed Standard
Internationalization of the File Transfer Protocol
文件传输协议的国际化
Status of this Memo
本备忘录的状况
This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.
本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。
Copyright Notice
版权公告
Copyright (C) The Internet Society (1999). All Rights Reserved.
版权所有(C)互联网协会(1999年)。版权所有。
Abstract
摘要
The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC 1123 Section 4 [RFC1123], is one of the oldest and widely used protocols on the Internet. The protocol's primary character set, 7 bit ASCII, has served the protocol well through the early growth years of the Internet. However, as the Internet becomes more global, there is a need to support character sets beyond 7 bit ASCII.
RFC 959[RFC959]和RFC 1123第4节[RFC1123]中定义的文件传输协议是互联网上最古老和广泛使用的协议之一。该协议的主要字符集,7位ASCII,在互联网发展的早期为该协议提供了良好的服务。然而,随着互联网变得更加全球化,需要支持7位ASCII以外的字符集。
This document addresses the internationalization (I18n) of FTP, which includes supporting the multiple character sets and languages found throughout the Internet community. This is achieved by extending the FTP specification and giving recommendations for proper internationalization support.
本文档介绍FTP的国际化(I18n),其中包括支持互联网社区中的多种字符集和语言。这是通过扩展FTP规范并为适当的国际化支持提供建议来实现的。
Table of Contents
目录
ABSTRACT.......................................................1 1 INTRODUCTION.................................................2 1.1 Requirements Terminology..................................2 2 INTERNATIONALIZATION.........................................3 2.1 International Character Set...............................3 2.2 Transfer Encoding Set.....................................4 3 PATHNAMES....................................................5 3.1 General compliance........................................5 3.2 Servers compliance........................................6 3.3 Clients compliance........................................7 4 LANGUAGE SUPPORT.............................................7
ABSTRACT.......................................................1 1 INTRODUCTION.................................................2 1.1 Requirements Terminology..................................2 2 INTERNATIONALIZATION.........................................3 2.1 International Character Set...............................3 2.2 Transfer Encoding Set.....................................4 3 PATHNAMES....................................................5 3.1 General compliance........................................5 3.2 Servers compliance........................................6 3.3 Clients compliance........................................7 4 LANGUAGE SUPPORT.............................................7
4.1 The LANG command..........................................8 4.2 Syntax of the LANG command................................9 4.3 Feat response for LANG command...........................11 4.3.1 Feat examples.........................................11 5 SECURITY CONSIDERATIONS.....................................12 6 ACKNOWLEDGMENTS.............................................12 7 GLOSSARY....................................................13 8 BIBLIOGRAPHY................................................13 9 AUTHOR'S ADDRESS............................................15 ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16 A.1 General Considerations...................................16 A.2 Transition Considerations................................18 ANNEX B - SAMPLE CODE AND EXAMPLES............................19 B.1 Valid UTF-8 check........................................19 B.2 Conversions..............................................20 B.2.1 Conversion from Local Character Set to UTF-8..........20 B.2.2 Conversion from UTF-8 to Local Character Set..........23 B.2.3 ISO/IEC 8859-8 Example................................25 B.2.4 Vendor Codepage Example...............................25 B.3 Pseudo Code for Translating Servers......................26 Full Copyright Statement......................................27
4.1 The LANG command..........................................8 4.2 Syntax of the LANG command................................9 4.3 Feat response for LANG command...........................11 4.3.1 Feat examples.........................................11 5 SECURITY CONSIDERATIONS.....................................12 6 ACKNOWLEDGMENTS.............................................12 7 GLOSSARY....................................................13 8 BIBLIOGRAPHY................................................13 9 AUTHOR'S ADDRESS............................................15 ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16 A.1 General Considerations...................................16 A.2 Transition Considerations................................18 ANNEX B - SAMPLE CODE AND EXAMPLES............................19 B.1 Valid UTF-8 check........................................19 B.2 Conversions..............................................20 B.2.1 Conversion from Local Character Set to UTF-8..........20 B.2.2 Conversion from UTF-8 to Local Character Set..........23 B.2.3 ISO/IEC 8859-8 Example................................25 B.2.4 Vendor Codepage Example...............................25 B.3 Pseudo Code for Translating Servers......................26 Full Copyright Statement......................................27
1 Introduction
1导言
As the Internet grows throughout the world the requirement to support character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859] character set becomes ever more urgent. For FTP, because of the large installed base, it is paramount that this is done without breaking existing clients and servers. This document addresses this need. In doing so it defines a solution which will still allow the installed base to interoperate with new clients and servers.
随着互联网在全世界的发展,支持ASCII[ASCII]/Latin-1[ISO-8859]字符集之外的字符集的要求变得越来越迫切。对于FTP,由于安装基数很大,因此在不破坏现有客户端和服务器的情况下完成此操作至关重要。本文件解决了这一需要。在这样做的过程中,它定义了一个解决方案,该解决方案仍然允许安装的基础与新的客户端和服务器进行互操作。
This document enhances the capabilities of the File Transfer Protocol by removing the 7-bit restrictions on pathnames used in client commands and server responses, RECOMMENDs the use of a Universal Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS transformation format (UTF) UTF-8 [UTF-8], and defines a new command for language negotiation.
本文档通过删除客户端命令和服务器响应中使用的路径名的7位限制来增强文件传输协议的功能,建议使用通用字符集(UCS)ISO/IEC 10646[ISO-10646],建议使用UCS转换格式(UTF)UTF-8[UTF-8],并定义了用于语言协商的新命令。
The recommendations made in this document are consistent with the recommendations expressed by the IETF policy related to character sets and languages as defined in RFC 2277 [RFC2277].
本文件中提出的建议与RFC 2277[RFC2277]中定义的与字符集和语言相关的IETF政策所表达的建议一致。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [BCP14].
本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照BCP 14[BCP14]中所述进行解释。
2 Internationalization
2国际化
The File Transfer Protocol was developed when the predominate character sets were 7 bit ASCII and 8 bit EBCDIC. Today these character sets cannot support the wide range of characters needed by multinational systems. Given that there are a number of character sets in current use that provide more characters than 7-bit ASCII, it makes sense to decide on a convenient way to represent the union of those possibilities. To work globally either requires support of a number of character sets and to be able to convert between them, or the use of a single preferred character set. To assure global interoperability this document RECOMMENDS the latter approach and defines a single character set, in addition to NVT ASCII and EBCDIC, which is understandable by all systems. For FTP this character set SHALL be ISO/IEC 10646:1993. For support of global compatibility it is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding when exchanging pathnames. Clients and servers are, however, under no obligation to perform any conversion on the contents of a file for operations such as STOR or RETR.
文件传输协议是在主要字符集为7位ASCII和8位EBCDIC时开发的。如今,这些字符集无法支持多国系统所需的各种字符。鉴于当前使用的字符集数量超过了7位ASCII字符,因此有必要选择一种方便的方式来表示这些可能性的结合。要全局工作,要么需要支持多个字符集并能够在它们之间进行转换,要么需要使用单个首选字符集。为确保全球互操作性,本文件建议采用后一种方法,并定义了一个单一字符集,以及NVT ASCII和EBCDIC,这是所有系统都可以理解的。对于FTP,该字符集应为ISO/IEC 10646:1993。为了支持全局兼容性,强烈建议客户端和服务器在交换路径名时使用UTF-8编码。但是,客户机和服务器没有义务对文件内容执行任何转换,以进行诸如STOR或RETR之类的操作。
The character set used to store files SHALL remain a local decision and MAY depend on the capability of local operating systems. Prior to the exchange of pathnames they SHOULD be converted into a ISO/IEC 10646 format and UTF-8 encoded. This approach, while allowing international exchange of pathnames, will still allow backward compatibility with older systems because the code set positions for ASCII characters are identical to the one byte sequence in UTF-8.
用于存储文件的字符集应保持本地决定,并可能取决于本地操作系统的能力。在交换路径名之前,应将其转换为ISO/IEC 10646格式和UTF-8编码。这种方法虽然允许路径名的国际交换,但仍然允许与旧系统向后兼容,因为ASCII字符的代码集位置与UTF-8中的单字节序列相同。
Sections 2.1 and 2.2 give a brief description of the international character set and transfer encoding RECOMMENDED by this document. A more thorough description of UTF-8, ISO/IEC 10646, and UNICODE [UNICODE], beyond that given in this document, can be found in RFC 2279 [RFC2279].
第2.1节和第2.2节简要介绍了本文件推荐的国际字符集和传输编码。关于UTF-8、ISO/IEC 10646和UNICODE[UNICODE]的更全面的描述,除本文档中给出的描述外,可在RFC 2279[RFC2279]中找到。
The character set defined for international support of FTP SHALL be the Universal Character Set as defined in ISO 10646:1993 as amended. This standard incorporates the character sets of many existing international, national, and corporate standards. ISO/IEC 10646 defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a four byte (31 bit) encoding containing 2**31 code positions divided into 128 groups of 256 planes. Each plane consists of 256 rows of 256 cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane zero or the Basic Multilingual Plane (BMP). Currently, no codesets have been defined outside of the 2 byte BMP.
为FTP的国际支持定义的字符集应为ISO 10646:1993修订版中定义的通用字符集。本标准包含许多现有国际、国家和公司标准的字符集。ISO/IEC 10646定义了两种可选的编码形式,UCS-4和UCS-2。UCS-4是一种四字节(31位)编码,包含2**31个代码位置,分为128组256个平面。每个平面由256行256个单元格组成。UCS-2是由平面零或基本多语言平面(BMP)组成的2字节(16位)字符集。目前,在2字节BMP之外尚未定义任何代码集。
The Unicode standard version 2.0 [UNICODE] is consistent with the UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0 includes the repertoire of IS 10646 characters, amendments 1-7 of IS 10646, and editorial and technical corrigenda.
Unicode标准版本2.0[Unicode]与ISO/IEC 10646的UCS-2子集一致。Unicode标准版本2.0包括IS 10646字符集、IS 10646修订本1-7以及编辑和技术勘误表。
UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2 or UTF-FSS, SHALL be used as a transfer encoding to transmit the international character set. UTF-8 is a file safe encoding which avoids the use of byte values that have special significance during the parsing of pathname character strings. UTF-8 is an 8 bit encoding of the characters in the UCS. Some of UTF-8's benefits are that it is compatible with 7 bit ASCII, so it doesn't affect programs that give special meanings to various ASCII characters; it is immune to synchronization errors; its encoding rules allow for easy identification; and it has enough space to support a large number of character sets.
UCS转换格式8(UTF-8),在过去被称为UTF-2或UTF-FSS,应用作传输国际字符集的传输编码。UTF-8是一种文件安全编码,它避免了在解析路径名字符串期间使用具有特殊意义的字节值。UTF-8是UCS中字符的8位编码。UTF-8的一些优点是它与7位ASCII兼容,因此它不会影响给各种ASCII字符赋予特殊含义的程序;它不受同步错误的影响;其编码规则便于识别;它有足够的空间来支持大量的字符集。
UTF-8 encoding represents each UCS character as a sequence of 1 to 6 bytes in length. For all sequences of one byte the most significant bit is ZERO. For all sequences of more than one byte the number of ONE bits in the first byte, starting from the most significant bit position, indicates the number of bytes in the UTF-8 sequence followed by a ZERO bit. For example, the first byte of a 3 byte UTF-8 sequence would have 1110 as its most significant bits. Each additional bytes (continuing bytes) in the UTF-8 sequence, contain a ONE bit followed by a ZERO bit as their most significant bits. The remaining free bit positions in the continuing bytes are used to identify characters in the UCS. The relationship between UCS and UTF-8 is demonstrated in the following table:
UTF-8编码将每个UCS字符表示为长度为1到6字节的序列。对于一个字节的所有序列,最高有效位为零。对于超过一个字节的所有序列,从最高有效位位置开始的第一个字节中的一位数表示UTF-8序列中后跟零位的字节数。例如,3字节UTF-8序列的第一个字节的最高有效位为1110。UTF-8序列中的每个附加字节(连续字节)包含一个1位,后跟一个0位作为其最高有效位。连续字节中剩余的空闲位位置用于标识UCS中的字符。UCS和UTF-8之间的关系如下表所示:
UCS-4 range(hex) UTF-8 byte sequence(binary) 00000000 - 0000007F 0xxxxxxx 00000080 - 000007FF 110xxxxx 10xxxxxx 00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
UCS-4范围(十六进制)UTF-8字节序列(二进制)00000000-000000 7F 0xxxxxxx 000000 80-00000 7FF 110xxxxx 10xxxxxx 00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 000200000-03FFFFFF 11111 0xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000-7FFFFFF 1110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
A beneficial property of UTF-8 is that its single byte sequence is consistent with the ASCII character set. This feature will allow a transition where old ASCII-only clients can still interoperate with new servers that support the UTF-8 encoding.
UTF-8的一个有益特性是其单字节序列与ASCII字符集一致。此功能将允许旧的仅ASCII客户端仍然可以与支持UTF-8编码的新服务器进行互操作的转换。
Another feature is that the encoding rules make it very unlikely that a character sequence from a different character set will be mistaken for a UTF-8 encoded character sequence. Clients and servers can use a simple routine to determine if the character set being exchanged is valid UTF-8. Section B.1 shows a code example of this check.
另一个特点是编码规则使得来自不同字符集的字符序列不太可能被误认为UTF-8编码的字符序列。客户端和服务器可以使用一个简单的例程来确定所交换的字符集是否有效UTF-8。第B.1节显示了该检查的代码示例。
3 Pathnames
3个路径名
- The 7-bit restriction for pathnames exchanged is dropped.
- 对交换路径名的7位限制被取消。
- Many operating system allow the use of spaces <SP>, carriage return <CR>, and line feed <LF> characters as part of the pathname. The exchange of pathnames with these special command characters will cause the pathnames to be parsed improperly. This is because ftp commands associated with pathnames have the form:
- 许多操作系统允许使用空格<SP>、回车<CR>和换行<LF>字符作为路径名的一部分。使用这些特殊命令字符交换路径名将导致路径名解析不正确。这是因为与路径名关联的ftp命令的格式为:
COMMAND <SP> <pathname> <CRLF>.
命令<SP><pathname><CRLF>。
To allow the exchange of pathnames containing these characters, the definition of pathname is changed from
要允许交换包含这些字符的路径名,路径名的定义从
<pathname> ::= <string> ; in BNF format to pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF].
<pathname> ::= <string> ; in BNF format to pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF].
To avoid mistaking these characters within pathnames as special command characters the following rules will apply:
为避免将路径名中的这些字符误认为特殊命令字符,将应用以下规则:
There MUST be only one <SP> between a ftp command and the pathname. Implementations MUST assume <SP> characters following the initial <SP> as part of the pathname. For example the pathname in STOR <SP><SP><SP>foo.bar<CRLF> is <SP><SP>foo.bar.
ftp命令和路径名之间必须只有一个<SP>。实现必须假定初始<SP>后面的<SP>字符作为路径名的一部分。例如,STOR<SP><SP><SP>foo.bar<CRLF>中的路径名是<SP><SP>foo.bar。
Current implementations, which may allow multiple <SP> characters as separators between the command and pathname, MUST assure that they comply with this single <SP> convention. Note: Implementations which treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4 character command by padding the command with a trailing <SP> are in non-compliance to this specification.
当前的实现可能允许多个<SP>字符作为命令和路径名之间的分隔符,必须确保它们符合此单个<SP>约定。注意:通过使用尾随<SP>填充命令,将3个字符的命令(例如CWD、MKD等)视为固定的4个字符的命令的实现不符合本规范。
When a <CR> character is encountered as part of a pathname it MUST be padded with a <NUL> character prior to sending the command. On receipt of a pathname containing a <CR><NUL> sequence the <NUL> character MUST be stripped away. This approach is described in the Telnet protocol [RFC854] on pages 11 and 12. For example, to store a pathname foo<CR><LF>boo.bar the pathname would become
When a <CR> character is encountered as part of a pathname it MUST be padded with a <NUL> character prior to sending the command. On receipt of a pathname containing a <CR><NUL> sequence the <NUL> character MUST be stripped away. This approach is described in the Telnet protocol [RFC854] on pages 11 and 12. For example, to store a pathname foo<CR><LF>boo.bar the pathname would become
foo<CR><NUL><LF>boo.bar prior to sending the command STOR <SP>foo<CR><NUL><LF>boo.bar<CRLF>. Upon receipt of the altered pathname the <NUL> character following the <CR> would be stripped away to form the original pathname.
在发送命令STOR<SP>foo<CR><NUL><LF>boo.bar<CRLF>之前,执行foo<CR><NUL><LF>boo.bar<CRLF>。收到更改后的路径名后,<CR>后面的<NUL>字符将被剥离,以形成原始路径名。
- Conforming clients and servers MUST support UTF-8 for the transfer and receipt of pathnames. Clients and servers MAY in addition give users a choice of specifying interpretation of pathnames in another encoding. Note that configuring clients and servers to use character sets / encoding other than UTF-8 is outside of the scope of this document. While it is recognized that in certain operational scenarios this may be desirable, this is left as a quality of implementation and operational issue.
- 一致性客户端和服务器必须支持UTF-8以传输和接收路径名。此外,客户机和服务器还可以让用户选择在另一种编码中指定路径名的解释。请注意,将客户端和服务器配置为使用UTF-8以外的字符集/编码超出了本文档的范围。虽然人们认识到,在某些操作场景中,这可能是可取的,但这是一个实施质量和操作问题。
- Pathnames are sequences of bytes. The encoding of names that are valid UTF-8 sequences is assumed to be UTF-8. The character set of other names is undefined. Clients and servers, unless otherwise configured to support a specific native character set, MUST check for a valid UTF-8 byte sequence to determine if the pathname being presented is UTF-8.
- 路径名是字节序列。有效UTF-8序列的名称编码假定为UTF-8。其他名称的字符集未定义。除非配置为支持特定本机字符集,否则客户端和服务器必须检查有效的UTF-8字节序列,以确定显示的路径名是否为UTF-8。
- To avoid data loss, clients and servers SHOULD use the UTF-8 encoded pathnames when unable to convert them to a usable code set.
- 为避免数据丢失,当无法将UTF-8编码的路径名转换为可用的代码集时,客户端和服务器应使用UTF-8编码的路径名。
- There may be cases when the code set / encoding presented to the server or client cannot be determined. In such cases the raw bytes SHOULD be used.
- 可能存在无法确定呈现给服务器或客户端的代码集/编码的情况。在这种情况下,应使用原始字节。
- Servers MUST support the UTF-8 feature in response to the FEAT command [RFC2389]. The UTF-8 feature is a line containing the exact string "UTF8". This string is not case sensitive, but SHOULD be transmitted in upper case. The response to a FEAT command SHOULD be:
- 服务器必须支持UTF-8功能以响应FEAT命令[RFC2389]。UTF-8功能是一行,包含确切的字符串“UTF8”。此字符串不区分大小写,但应以大写形式传输。对FEAT命令的响应应为:
C> feat S> 211- <any descriptive text> S> ... S> UTF8 S> ... S> 211 end
C> feat S> 211- <any descriptive text> S> ... S> UTF8 S> ... S> 211 end
The ellipses indicate placeholders where other features may be included, but are NOT REQUIRED. The one space indentation of the feature lines is mandatory [RFC2389].
省略号表示可能包含但不需要的其他功能的占位符。要素线的单空间缩进是强制性的[RFC2389]。
- Mirror servers may want to exactly reflect the site that they are mirroring. In such cases servers MAY store and present the exact pathname bytes that it received from the main server.
- 镜像服务器可能希望准确反映它们正在镜像的站点。在这种情况下,服务器可能存储并显示它从主服务器接收的确切路径名字节。
- Clients which do not require display of pathnames are under no obligation to do so. Non-display clients do not need to conform to requirements associated with display.
- 不需要显示路径名的客户端没有义务这样做。非显示客户端不需要符合与显示相关的要求。
- Clients, which are presented UTF-8 pathnames by the server, SHOULD parse UTF-8 correctly and attempt to display the pathname within the limitation of the resources available.
- 服务器提供UTF-8路径名的客户端应正确解析UTF-8,并尝试在可用资源限制内显示路径名。
- Clients MUST support the FEAT command and recognize the "UTF8" feature (defined in 3.2 above) to determine if a server supports UTF-8 encoding.
- 客户端必须支持FEAT命令并识别“UTF8”功能(在上面的3.2中定义),以确定服务器是否支持UTF-8编码。
- Character semantics of other names shall remain undefined. If a client detects that a server is non UTF-8, it SHOULD change its display appropriately. How a client implementation handles non UTF-8 is a quality of implementation issue. It MAY try to assume some other encoding, give the user a chance to try to assume something, or save encoding assumptions for a server from one FTP session to another.
- 其他名称的字符语义应保持未定义。如果客户端检测到服务器不是UTF-8,则应适当更改其显示。客户端实现如何处理非UTF-8是一个实现质量问题。它可能会尝试采用其他编码,让用户有机会尝试采用某种编码,或者将服务器的编码假设从一个FTP会话保存到另一个FTP会话。
- Glyph rendering is outside the scope of this document. How a client presents characters it cannot display is a quality of implementation issue. This document RECOMMENDS that octets corresponding to non-displayable characters SHOULD be presented in URL %HH format defined in RFC 1738 [RFC1738]. They MAY, however, display them as question marks, with their UCS hexadecimal value, or in any other suitable fashion.
- Glyph呈现不在本文档的范围内。客户端如何显示它无法显示的字符是一个实现质量问题。本文档建议以RFC 1738[RFC1738]中定义的URL%HH格式显示与不可显示字符相对应的八位字节。但是,它们可以用UCS十六进制值或任何其他合适的方式显示为问号。
- Many existing clients interpret 8-bit pathnames as being in the local character set. They MAY continue to do so for pathnames that are not valid UTF-8.
- 许多现有客户端将8位路径名解释为位于本地字符集中。对于无效的UTF-8路径名,它们可能会继续这样做。
The Character Set Workshop Report [RFC2130] suggests that clients and servers SHOULD negotiate a language for "greetings" and "error messages". This specification interprets the use of the term "error message", by RFC 2130, to mean any explanatory text string returned by server-PI in response to a user-PI command.
字符集研讨会报告[RFC2130]建议客户端和服务器应协商“问候语”和“错误消息”的语言。本规范将RFC 2130使用的术语“错误消息”解释为服务器PI响应用户PI命令返回的任何解释性文本字符串。
Implementers SHOULD note that FTP commands and numeric responses are protocol elements. As such, their use is not affected by any guidance expressed by this specification.
实现者应该注意,FTP命令和数字响应是协议元素。因此,其使用不受本规范所述任何指南的影响。
Language support of greetings and command responses shall be the default language supported by the server or the language supported by the server and selected by the client.
问候语和命令响应的语言支持应为服务器支持的默认语言或服务器支持并由客户端选择的语言。
It may be possible to achieve language support through a virtual host as described in [MLST]. However, an FTP server might not support virtual servers, or virtual servers might be configured to support an environment without regard for language. To allow language negotiation this specification defines a new LANG command. Clients and servers that comply with this specification MUST support the LANG command.
可以通过[MLST]中所述的虚拟主机实现语言支持。但是,FTP服务器可能不支持虚拟服务器,或者虚拟服务器可能配置为支持环境,而不考虑语言。为了允许语言协商,该规范定义了一个新的LANG命令。符合此规范的客户端和服务器必须支持LANG命令。
A new command "LANG" is added to the FTP command set to allow server-FTP process to determine in which language to present server greetings and the textual part of command responses. The parameter associated with the LANG command SHALL be one of the language tags defined in RFC 1766 [RFC1766]. If a LANG command without a parameter is issued the server's default language will be used.
FTP命令集中添加了一个新命令“LANG”,以允许服务器FTP进程确定用哪种语言表示服务器问候语以及命令响应的文本部分。与LANG命令相关的参数应为RFC 1766[RFC1766]中定义的语言标记之一。如果发出不带参数的LANG命令,将使用服务器的默认语言。
Greetings and responses issued prior to language negotiation SHALL be in the server's default language. Paragraph 4.5 of [RFC2277] state that this "default language MUST be understandable by an English-speaking person". This specification RECOMMENDS that the server default language be English encoded using ASCII. This text may be augmented by text from other languages. Once negotiated, server-PI MUST return server messages and textual part of command responses in the negotiated language and encoded in UTF-8. Server-PI MAY wish to re-send previously issued server messages in the newly negotiated language.
语言协商前发出的问候语和回复应使用服务器的默认语言。[RFC2277]第4.5段规定,“这种默认语言必须能被说英语的人理解”。本规范建议服务器默认语言使用ASCII进行英语编码。此文本可以由其他语言的文本扩充。一旦协商,服务器PI必须以协商语言返回服务器消息和命令响应的文本部分,并以UTF-8编码。服务器PI可能希望以新协商的语言重新发送以前发出的服务器消息。
The LANG command only affects presentation of greeting messages and explanatory text associated with command responses. No attempt should be made by the server to translate protocol elements (FTP commands and numeric responses) or data transmitted over the data connection.
LANG命令只影响问候语和与命令响应相关的解释性文本的表示。服务器不应尝试转换协议元素(FTP命令和数字响应)或通过数据连接传输的数据。
User-PI MAY issue the LANG command at any time during an FTP session. In order to gain the full benefit of this command, it SHOULD be presented prior to authentication. In general, it will be issued after the HOST command [MLST]. Note that the issuance of a HOST or
在FTP会话期间,用户PI可以随时发出LANG命令。为了充分利用此命令,应在身份验证之前显示该命令。通常,它将在主机命令[MLST]之后发出。请注意,主机或
REIN command [RFC959] will negate the affect of the LANG command. User-PI SHOULD be capable of supporting UTF-8 encoding for the language negotiated. Guidance on interpretation and rendering of UTF-8, defined in section 3, SHALL apply.
REIN命令[RFC959]将消除LANG命令的影响。用户PI应该能够支持协商语言的UTF-8编码。第3节中定义的UTF-8解释和呈现指南应适用。
Although NOT REQUIRED by this specification, a user-PI SHOULD issue a FEAT command [RFC2389] prior to a LANG command. This will allow the user-PI to determine if the server supports the LANG command and which language options.
尽管本规范没有要求,但用户PI应该在LANG命令之前发出FEAT命令[RFC2389]。这将允许用户PI确定服务器是否支持LANG命令以及哪些语言选项。
In order to aid the server in identifying whether a connection has been established with a client which conforms to this specification or an older client, user-PI MUST send a HOST [MLST] and/or LANG command prior to issuing any other command (other than FEAT [RFC2389]). If user-PI issues a HOST command, and the server's default language is acceptable, it need not issue a LANG command. However, if the implementation does not support the HOST command, a LANG command MUST be issued. Until server-PI is presented with either a HOST or LANG command it SHOULD assume that the user-PI does not comply with this specification.
为了帮助服务器识别是否已与符合本规范的客户端或旧客户端建立连接,用户PI必须在发出任何其他命令(FEAT[RFC2389]除外)之前发送HOST[MLST]和/或LANG命令。如果用户PI发出主机命令,并且服务器的默认语言是可接受的,则不需要发出LANG命令。但是,如果实现不支持HOST命令,则必须发出LANG命令。在服务器PI显示HOST或LANG命令之前,应假定用户PI不符合此规范。
The LANG command is defined as follows:
LANG命令定义如下:
lang-command = "Lang" [(SP lang-tag)] CRLF lang-tag = Primary-tag *( "-" Sub-tag) Primary-tag = 1*8ALPHA Sub-tag = 1*8ALPHA
lang-command = "Lang" [(SP lang-tag)] CRLF lang-tag = Primary-tag *( "-" Sub-tag) Primary-tag = 1*8ALPHA Sub-tag = 1*8ALPHA
lang-response = lang-ok / error-response lang-ok = "200" [SP *(%x00..%xFF) ] CRLF error-response = command-unrecognized / bad-argument / not-implemented / unsupported-parameter command-unrecognized = "500" [SP *(%x01..%xFF) ] CRLF bad-argument = "501" [SP *(%x01..%xFF) ] CRLF not-implemented = "502" [SP *(%x01..%xFF) ] CRLF unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF
lang-response = lang-ok / error-response lang-ok = "200" [SP *(%x00..%xFF) ] CRLF error-response = command-unrecognized / bad-argument / not-implemented / unsupported-parameter command-unrecognized = "500" [SP *(%x01..%xFF) ] CRLF bad-argument = "501" [SP *(%x01..%xFF) ] CRLF not-implemented = "502" [SP *(%x01..%xFF) ] CRLF unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF
The "lang" command word is case independent and may be specified in any character case desired. Therefore "LANG", "lang", "Lang", and "lAnG" are equivalent commands.
“lang”命令字与大小写无关,可以在所需的任何字符大小写中指定。因此,“LANG”、“LANG”、“LANG”和“LANG”是等效的命令。
The OPTIONAL "Lang-tag" given as a parameter specifies the primary language tags and zero or more sub-tags as defined in [RFC1766]. As described in [RFC1766] language tags are treated as case insensitive. If omitted server-PI MUST use the server's default language.
作为参数提供的可选“Lang tag”指定了[RFC1766]中定义的主要语言标记和零个或多个子标记。如[RFC1766]所述,语言标记不区分大小写。如果省略,服务器PI必须使用服务器的默认语言。
Server-FTP responds to the "Lang" command with either "lang-ok" or "error-response". "lang-ok" MUST be sent if Server-FTP supports the "Lang" command and can support some form of the "lang-tag". Support SHOULD be as follows:
服务器FTP以“Lang ok”或“error response”响应“Lang”命令。如果服务器FTP支持“lang”命令并且可以支持某种形式的“lang标记”,则必须发送“lang ok”。支持应如下:
- If server-FTP receives "Lang" with no parameters it SHOULD return messages and command responses in the server default language.
- 如果服务器FTP接收到没有参数的“Lang”,它应该以服务器默认语言返回消息和命令响应。
- If server-FTP receives "Lang" with only a primary tag argument (e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD return messages and command responses in the language associated with that primary tag. It is possible that server-FTP will only support the primary tag when combined with a sub-tag (e.g. en-US, en-UK, etc.). In such cases, server-FTP MAY determine the appropriate variant to use during the session. How server-FTP makes that determination is outside the scope of this specification. If server-FTP cannot determine if a sub-tag variant is appropriate it SHOULD return an "unsupported-parameter" (504) response.
- 如果服务器FTP接收到的“Lang”只有一个它可以支持的主标记参数(例如en、fr、de、ja、zh等),它应该以与该主标记相关联的语言返回消息和命令响应。服务器FTP可能仅在与子标记(例如en US、en UK等)组合时支持主标记。在这种情况下,服务器FTP可能会确定会话期间要使用的适当变体。服务器FTP如何做出该决定超出了本规范的范围。如果服务器FTP无法确定子标记变量是否合适,则应返回“unsupported parameter”(504)响应。
- If server-FTP receives "Lang" with a primary tag and sub-tag(s) argument, which is implemented, it SHOULD return messages and command responses in support of the language argument. It is possible that server-FTP can support the primary tag of the "Lang" argument but not the sub-tag(s). In such cases server-FTP MAY return messages and command responses in the most appropriate variant of the primary tag that has been implemented. How server-FTP makes that determination is outside the scope of this specification. If server-FTP cannot determine if a sub-tag variant is appropriate it SHOULD return an "unsupported-parameter" (504) response.
- 如果服务器FTP接收到带有主标记和子标记参数(已实现)的“Lang”,它应该返回消息和命令响应以支持language参数。服务器FTP可能支持“Lang”参数的主标记,但不支持子标记。在这种情况下,服务器FTP可能会以已实现的主标记的最合适变体返回消息和命令响应。服务器FTP如何做出该决定超出了本规范的范围。如果服务器FTP无法确定子标记变量是否合适,则应返回“unsupported parameter”(504)响应。
For example if client-FTP sends a "LANG en-AU" command and server-FTP has implemented language tags en-US and en-UK it may decide that the most appropriate language tag is en-UK and return "200 en-AU not supported. Language set to en-UK". The numeric response is a protocol element and can not be changed. The associated string is for illustrative purposes only.
例如,如果客户端FTP发送一个“LANG en AU”命令,而服务器FTP已实现语言标记en US和en UK,则可能会确定最合适的语言标记为en UK,并返回“200 en AU不受支持。语言设置为en UK”。数字响应是协议元素,无法更改。关联的字符串仅用于说明目的。
Clients and servers that conform to this specification MUST support the LANG command. Clients SHOULD, however, anticipate receiving a 500 or 502 command response, in cases where older or non-compliant servers do not recognize or have not implemented the "Lang". A 501 response SHOULD be sent if the argument to the "Lang" command is not syntactically correct. A 504 response SHOULD be sent if the "Lang" argument, while syntactically correct, is not implemented. As noted above, an argument may be considered a lexicon match even though it is not an exact syntax match.
符合此规范的客户端和服务器必须支持LANG命令。但是,如果旧的或不兼容的服务器无法识别或未实现“Lang”,则客户端应预期收到500或502命令响应。如果“Lang”命令的参数语法不正确,则应发送501响应。如果“Lang”参数在语法上正确,但未实现,则应发送504响应。如上所述,参数可能被视为词典匹配,即使它不是精确的语法匹配。
A server-FTP process that supports the LANG command, and language support for messages and command responses, MUST include in the response to the FEAT command [RFC2389], a feature line indicating that the LANG command is supported and a fact list of the supported language tags. A response to a FEAT command SHALL be in the following format:
支持LANG命令的服务器FTP进程以及对消息和命令响应的语言支持,必须在对FEAT命令[RFC2389]的响应中包含一个表示支持LANG命令的功能行以及支持的语言标记的事实列表。对FEAT命令的响应应采用以下格式:
Lang-feat = SP "LANG" SP lang-fact CRLF lang-fact = lang-tag ["*"] *(";" lang-tag ["*"])
Lang-feat = SP "LANG" SP lang-fact CRLF lang-fact = lang-tag ["*"] *(";" lang-tag ["*"])
lang-tag = Primary-tag *( "-" Sub-tag) Primary-tag= 1*8ALPHA Sub-tag = 1*8ALPHA
lang-tag = Primary-tag *( "-" Sub-tag) Primary-tag= 1*8ALPHA Sub-tag = 1*8ALPHA
The lang-feat response contains the string "LANG" followed by a language fact. This string is not case sensitive, but SHOULD be transmitted in upper case, as recommended in [RFC2389]. The initial space shown in the Lang-feat response is REQUIRED by the FEAT command. It MUST be a single space character. More or less space characters are not permitted. The lang-fact SHALL include the lang-tags which server-FTP can support. At least one lang-tag MUST be included with the FEAT response. The lang-tag SHALL be in the form described earlier in this document. The OPTIONAL asterisk, when present, SHALL indicate the current lang-tag being used by server-FTP for messages and responses.
lang feat响应包含字符串“lang”,后跟一个语言事实。该字符串不区分大小写,但应按照[RFC2389]中的建议以大写字母传输。Lang feat响应中显示的初始空间是feat命令所必需的。它必须是单个空格字符。不允许使用更多或更少的空格字符。lang事实应包括服务器FTP可支持的lang标记。FEAT响应中必须至少包含一个lang标记。语言标签应采用本文件前面描述的形式。可选星号(如果存在)应表示服务器FTP用于消息和响应的当前lang标记。
C> feat S> 211- <any descriptive text> S> ... S> LANG EN* S> ... S> 211 end
C> feat S> 211- <any descriptive text> S> ... S> LANG EN* S> ... S> 211 end
In this example server-FTP can only support English, which is the current language (as shown by the asterisk) being used by the server for messages and command responses.
在此示例中,服务器FTP只能支持英语,这是服务器用于消息和命令响应的当前语言(如星号所示)。
C> feat S> 211- <any descriptive text> S> ... S> LANG EN*;FR S> ... S> 211 end
C> feat S> 211- <any descriptive text> S> ... S> LANG EN*;FR S> ... S> 211 end
C> LANG fr S> 200 Le response sera changez au francais
C> LANG fr S>200法国反应血清变化
C> feat S> 211- <quelconque descriptif texte> S> ... S> LANG EN;FR* S> ... S> 211 end
C> feat S> 211- <quelconque descriptif texte> S> ... S> LANG EN;FR* S> ... S> 211 end
In this example server-FTP supports both English and French as shown by the initial response to the FEAT command. The asterisk indicates that English is the current language in use by server-FTP. After a LANG command is issued to change the language to French, the FEAT response shows French as the current language in use.
在本例中,服务器FTP支持英语和法语,如对FEAT命令的初始响应所示。星号表示FTP服务器当前使用的语言为英语。发出LANG命令将语言更改为法语后,FEAT响应将法语显示为当前使用的语言。
In the above examples ellipses indicate placeholders where other features may be included, but are NOT REQUIRED.
在上述示例中,省略号表示可能包含但不需要其他功能的占位符。
5 Security Considerations
5安全考虑
This document addresses the support of character sets beyond 1 byte and a new language negotiation command. Conformance to this document should not induce a security risk.
本文档介绍了对超过1字节的字符集的支持,以及一个新的语言协商命令。遵守本文件不应导致安全风险。
6 Acknowledgments
6致谢
The following people have contributed to this document:
以下人员对本文件作出了贡献:
D. J. Bernstein Martin J. Duerst Mark Harris Paul Hethmon Alun Jones Gregory Lundberg James Matthews Keith Moore Sandra O'Donnell Benjamin Riefenstahl Stephen Tihor
D.J.Bernstein Martin J.Duerst Mark Harris Paul Hethmon Alun Jones Gregory Lundberg James Matthews Keith Moore Sandra O'Donnell Benjamin Riefenstahl Stephen Tihor
(and others from the FTPEXT working group)
(和FTPEXT工作组的其他成员)
7 Glossary
7词汇表
BIDI - abbreviation for Bi-directional, a reference to mixed right-to-left and left-to-right text.
BIDI-双向的缩写,指从右向左和从左向右混合的文本。
Character Set - a collection of characters used to represent textual information in which each character has a numeric value
字符集-用于表示文本信息的字符集合,其中每个字符都有一个数值
Code Set - (see character set).
代码集-(请参见字符集)。
Glyph - a character image represented on a display device.
字形-显示设备上显示的字符图像。
I18N - "I eighteen N", the first and last letters of the word "internationalization" and the eighteen letters in between.
I18N——“I18N”,单词“国际化”的第一个和最后一个字母,以及介于两者之间的十八个字母。
UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.
UCS-2-ISO/IEC 10646双八位字节通用字符集格式。
UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.
UCS-4-ISO/IEC 10646四个八位字节通用字符集格式。
UTF-8 - the UCS Transformation Format represented in 8 bits.
UTF-8—以8位表示的UCS转换格式。
TF-16 - A 16-bit format including the BMP (directly encoded) and surrogate pairs to represent characters in planes 01-16; equivalent to Unicode.
TF-16—一种16位格式,包括BMP(直接编码)和代理项对,用于表示平面01-16中的字符;相当于Unicode。
8 Bibliography
8参考书目
[ABNF] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997.
[ABNF]Crocker,D.和P.Overell,“语法规范的扩充BNF:ABNF”,RFC 2234,1997年11月。
[ASCII] ANSI X3.4:1986 Coded Character Sets - 7 Bit American National Standard Code for Information Interchange (7- bit ASCII)
[ASCII]ANSI X3.4:1986编码字符集.信息交换用7位美国国家标准代码(7位ASCII)
[ISO-8859] ISO 8859. International standard -- Information processing -- 8-bit single-byte coded graphic character sets -- Part 1:Latin alphabet No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- Part 5: Latin/Cyrillic alphabet (1988) -- Part 6: Latin/Arabic alphabet (1987) -- Part : Latin/Greek alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin alphabet No. 6 (1992)
[ISO-8859] ISO 8859. International standard -- Information processing -- 8-bit single-byte coded graphic character sets -- Part 1:Latin alphabet No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) -- Part 3: Latin alphabet No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- Part 5: Latin/Cyrillic alphabet (1988) -- Part 6: Latin/Arabic alphabet (1987) -- Part : Latin/Greek alphabet (1987) -- Part 8: Latin/Hebrew alphabet (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part10: Latin alphabet No. 6 (1992)
[BCP14] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[BCP14]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[ISO-10646] ISO/IEC 10646-1:1993. International standard -- Information technology -- Universal multiple-octet coded character set (UCS) -- Part 1: Architecture and basic multilingual plane.
[ISO-10646]ISO/IEC 10646-1:1993。国际标准信息技术通用多八位编码字符集(UCS)第1部分:体系结构和基本多语言平面。
[MLST] Elz, R. and P. Hethmon, "Extensions to FTP", Work in Progress.
[MLST]Elz,R.和P.Hethmon,“FTP扩展”,正在进行中。
[RFC854] Postel, J. and J. Reynolds, "Telnet Protocol Specification", STD 8, RFC 854, May 1983.
[RFC854]Postel,J.和J.Reynolds,“Telnet协议规范”,STD 8,RFC 854,1983年5月。
[RFC959] Postel, J. and J. Reynolds, "File Transfer Protocol (FTP)", STD 9, RFC 959, October 1985.
[RFC959]Postel,J.和J.Reynolds,“文件传输协议(FTP)”,标准9,RFC959,1985年10月。
[RFC1123] Braden, R., "Requirements for Internet Hosts -- Application and Support", STD 3, RFC 1123, October 1989.
[RFC1123]Braden,R.,“互联网主机的要求——应用和支持”,STD 3,RFC 1123,1989年10月。
[RFC1738] Berners-Lee, T., Masinter, L. and M. McCahill, "Uniform Resource Locators (URL)", RFC 1738, December 1994.
[RFC1738]Berners Lee,T.,Masinter,L.和M.McCahill,“统一资源定位器(URL)”,RFC 17381994年12月。
[RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995.
[RFC1766]Alvestrand,H.,“语言识别标签”,RFC1766,1995年3月。
[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M. and P. Svanberg, "Character Set Workshop Report", RFC 2130, April 1997.
[RFC2130]Weider,C.,Preston,C.,Simonsen,K.,Alvestrand,H.,Atkinson,R.,Crispin,M.和P.Svanberg,“字符集研讨会报告”,RFC 21301997年4月。
[RFC2277] Alvestrand, H., " IETF Policy on Character Sets and Languages", RFC 2277, January 1998.
[RFC2277]Alvestrand,H.,“IETF字符集和语言政策”,RFC2277,1998年1月。
[RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998.
[RFC2279]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,RFC 2279,1998年1月。
[RFC2389] Elz, R. and P. Hethmon, "Feature Negotiation Mechanism for the File Transfer Protocol", RFC 2389, August 1998.
[RFC2389]Elz,R.和P.Hethmon,“文件传输协议的特征协商机制”,RFC 2389,1998年8月。
[UNICODE] The Unicode Consortium, "The Unicode Standard - Version 2.0", Addison Westley Developers Press, July 1996.
[UNICODE]UNICODE联盟,“UNICODE标准-版本2.0”,Addison Westley开发者出版社,1996年7月。
[UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8 (UTF-8).
[UTF-8]ISO/IEC 10646-1:1993修订件2(1996年)。UCS转换格式8(UTF-8)。
9 Author's Address
9作者地址
Bill Curtin JIEO Attn: JEBBD Ft. Monmouth, N.J. 07703-5613
比尔·柯廷·杰奥收件人:新泽西州蒙茅斯街杰伯德街07703-5613号
EMail: curtinw@ftm.disa.mil
EMail: curtinw@ftm.disa.mil
Annex A - Implementation Considerations
附件A——实施考虑
- Implementers should ensure that their code accounts for potential problems, such as using a NULL character to terminate a string or no longer being able to steal the high order bit for internal use, when supporting the extended character set.
- 实现者应该确保他们的代码考虑到潜在的问题,例如在支持扩展字符集时,使用空字符终止字符串,或者不再能够窃取高阶位以供内部使用。
- Implementers should be aware that there is a chance that pathnames that are non UTF-8 may be parsed as valid UTF-8. The probabilities are low for some encoding or statistically zero to zero for others. A recent non-scientific analysis found that EUC encoded Japanese words had a 2.7% false reading; SJIS had a 0.0005% false reading; other encoding such as ASCII or KOI-8 have a 0% false reading. This probability is highest for short pathnames and decreases as pathname size increases. Implementers may want to look for signs that pathnames which parse as UTF-8 are not valid UTF-8, such as the existence of multiple local character sets in short pathnames. Hopefully, as more implementations conform to UTF-8 transfer encoding there will be a smaller need to guess at the encoding.
- 实现者应该知道,非UTF-8的路径名有可能被解析为有效的UTF-8。某些编码的概率较低,或者其他编码的概率在统计上为零到零。最近的一项非科学分析发现,EUC编码的日语单词有2.7%的误读率;SJIS的错误读数为0.0005%;其他编码(如ASCII或KOI-8)的错误读取率为0%。对于短路径名,此概率最高,并且随着路径名大小的增加而降低。实现者可能希望查找解析为UTF-8的路径名不是有效的UTF-8的迹象,例如短路径名中存在多个本地字符集。希望随着更多的实现符合UTF-8传输编码,猜测编码的需求会减少。
- Client developers should be aware that it will be possible for pathnames to contain mixed characters (e.g. //Latin1DirectoryName/HebrewFileName). They should be prepared to handle the Bi-directional (BIDI) display of these character sets (i.e. right to left display for the directory and left to right display for the filename). While bi-directional display is outside the scope of this document and more complicated than the above example, an algorithm for bi-directional display can be found in the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can have different byte ordering yet be logically and display-wise equivalent due to the insertion of BIDI control characters at different points during composition. Also note that mixed character sets may also present problems with font swapping.
- 客户端开发人员应该知道,路径名可能包含混合字符(例如://Latin1DirectoryName/HebrewFileName)。他们应该准备好处理这些字符集的双向(BIDI)显示(即,目录从右到左显示,文件名从左到右显示)。虽然双向显示超出了本文档的范围,并且比上面的示例更复杂,但可以在UNICODE 2.0[UNICODE]标准中找到一种双向显示算法。还请注意,由于在合成过程中在不同点插入BIDI控制字符,路径名可以具有不同的字节顺序,但在逻辑上和显示上是等效的。还要注意,混合字符集也可能会出现字体交换问题。
- A server that copies pathnames transparently from a local filesystem may continue to do so. It is then up to the local file creators to use UTF-8 pathnames.
- 从本地文件系统透明复制路径名的服务器可能会继续这样做。然后由本地文件创建者使用UTF-8路径名。
- Servers can supports charset labeling of files and/or directories, such that different pathnames may have different charsets. The server should attempt to convert all pathnames to UTF-8, but if it can't then it should leave that name in its raw form.
- 服务器可以支持文件和/或目录的字符集标签,这样不同的路径名可能具有不同的字符集。服务器应尝试将所有路径名转换为UTF-8,但如果无法转换,则应将该名称保留为原始形式。
- Some server's OS do not mandate character sets, but allow administrators to configure it in the FTP server. These servers should be configured to use a particular mapping table (either
- 某些服务器的操作系统不强制使用字符集,但允许管理员在FTP服务器中配置字符集。这些服务器应配置为使用特定的映射表(或
external or built-in). This will allow the flexibility of defining different charsets for different directories.
外部或内置)。这将允许为不同的目录定义不同的字符集。
- If the server's OS does not mandate the character set and the FTP server cannot be configured, the server should simply use the raw bytes in the file name. They might be ASCII or UTF-8.
- 如果服务器的操作系统不强制使用字符集,并且无法配置FTP服务器,则服务器应仅使用文件名中的原始字节。它们可能是ASCII或UTF-8。
- If the server is a mirror, and wants to look just like the site it is mirroring, it should store the exact file name bytes that it received from the main server.
- 如果服务器是一个镜像,并且希望看起来像它正在镜像的站点,那么它应该存储从主服务器接收的确切文件名字节。
- Servers which support this specification, when presented a pathname from an old client (one which does not support this specification), can nearly always tell whether the pathname is in UTF-8 (see B.1) or in some other code set. In order to support these older clients, servers may wish to default to a non UTF-8 code set. However, how a server supports non UTF-8 is outside the scope of this specification.
- 支持此规范的服务器在从旧客户机(不支持此规范的客户机)提供路径名时,几乎总能知道路径名是在UTF-8(请参见B.1)中还是在其他代码集中。为了支持这些旧客户端,服务器可能希望默认为非UTF-8代码集。但是,服务器如何支持非UTF-8不在本规范的范围内。
- Clients which support this specification will be able to determine if the server can support UTF-8 (i.e. supports this specification) by the ability of the server to support the FEAT command and the UTF8 feature (defined in 3.2). If the newer clients determine that the server does not support UTF-8 it may wish to default to a different code set. Client developers should take into consideration that pathnames, associated with older servers, might be stored in UTF-8. However, how a client supports non UTF-8 is outside the scope of this specification.
- 支持此规范的客户端将能够通过服务器支持FEAT命令和UTF8功能(在3.2中定义)的能力来确定服务器是否支持UTF-8(即支持此规范)。如果较新的客户端确定服务器不支持UTF-8,则可能希望默认为其他代码集。客户端开发人员应该考虑到,与旧服务器关联的路径名可能存储在UTF-8中。但是,客户机如何支持非UTF-8不在本规范的范围内。
- Clients and servers can transition to UTF-8 by either converting to/from the local encoding, or the users can store UTF-8 filenames. The former approach is easier on tightly controlled file systems (e.g. PCs and MACs). The latter approach is easier on more free form file systems (e.g. Unix).
- 客户端和服务器可以通过转换为本地编码或从本地编码转换为UTF-8,或者用户可以存储UTF-8文件名。前一种方法在严格控制的文件系统(如PC和Mac)上更容易实现。后一种方法在更自由格式的文件系统(例如Unix)上更容易实现。
- For interactive use attention should be focused on user interface and ease of use. Non-interactive use requires a consistent and controlled behavior.
- 对于交互式使用,应注意用户界面和易用性。非交互式使用需要一致且受控的行为。
- There may be many applications which reference files under their old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8 will cause access to the old URL to fail. A solution may be for the server to act as if there was 2 different pathnames associated with the file. This might be done internal to the server on controlled file systems or by using symbolic links on free form systems. While this approach may work for single file transfer non-interactive use, a non-interactive transfer of all of the files in a directory will produce duplicates. Interactive users may be presented with lists of files which are double the actual number files.
- 可能有许多应用程序引用其旧原始路径名下的文件(例如链接URL)。将路径名更改为UTF-8将导致对旧URL的访问失败。一种解决方案是,服务器可以像有两个不同的路径名与文件关联一样工作。这可以在受控文件系统上的服务器内部完成,也可以在自由格式系统上使用符号链接完成。虽然这种方法可能适用于单文件传输非交互式使用,但目录中所有文件的非交互式传输将产生重复。交互式用户可能会看到文件列表,这些文件的数量是实际文件数量的两倍。
Annex B - Sample Code and Examples
附录B-样本代码和示例
The following routine checks if a byte sequence is valid UTF-8. This is done by checking for the proper tagging of the first and following bytes to make sure they conform to the UTF-8 format. It then checks to assure that the data part of the UTF-8 sequence conforms to the proper range allowed by the encoding. Note: This routine will not detect characters that have not been assigned and therefore do not exist.
以下例行程序检查字节序列是否有效UTF-8。这是通过检查第一个和后续字节的正确标记来完成的,以确保它们符合UTF-8格式。然后检查以确保UTF-8序列的数据部分符合编码允许的适当范围。注意:此例程不会检测未分配的字符,因此不存在。
int utf8_valid(const unsigned char *buf, unsigned int len) { const unsigned char *endbuf = buf + len; unsigned char byte2mask=0x00, c; int trailing = 0; // trailing (continuation) bytes to follow
int utf8_valid(const unsigned char *buf, unsigned int len) { const unsigned char *endbuf = buf + len; unsigned char byte2mask=0x00, c; int trailing = 0; // trailing (continuation) bytes to follow
while (buf != endbuf) { c = *buf++; if (trailing) if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format? {if (byte2mask) // Need to check 2nd byte for proper range? if (c&byte2mask) // Are appropriate bits set? byte2mask=0x00; else return 0; trailing--; } else return 0; else if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8 else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8 if (c&0x1E) // Is UTF-8 byte in // proper range? trailing =1; else return 0; else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8 {if (!(c&0x0F)) // Is UTF-8 byte in // proper range? byte2mask=0x20; // If not set mask // to check next byte trailing = 2;} else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8 {if (!(c&0x07)) // Is UTF-8 byte in // proper range?
while (buf != endbuf) { c = *buf++; if (trailing) if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format? {if (byte2mask) // Need to check 2nd byte for proper range? if (c&byte2mask) // Are appropriate bits set? byte2mask=0x00; else return 0; trailing--; } else return 0; else if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8 else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8 if (c&0x1E) // Is UTF-8 byte in // proper range? trailing =1; else return 0; else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8 {if (!(c&0x0F)) // Is UTF-8 byte in // proper range? byte2mask=0x20; // If not set mask // to check next byte trailing = 2;} else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8 {if (!(c&0x07)) // Is UTF-8 byte in // proper range?
byte2mask=0x30; // If not set mask // to check next byte trailing = 3;} else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8 {if (!(c&0x03)) // Is UTF-8 byte in // proper range? byte2mask=0x38; // If not set mask // to check next byte trailing = 4;} else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8 {if (!(c&0x01)) // Is UTF-8 byte in // proper range? byte2mask=0x3C; // If not set mask // to check next byte trailing = 5;} else return 0; } return trailing == 0; }
byte2mask=0x30; // If not set mask // to check next byte trailing = 3;} else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8 {if (!(c&0x03)) // Is UTF-8 byte in // proper range? byte2mask=0x38; // If not set mask // to check next byte trailing = 4;} else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8 {if (!(c&0x01)) // Is UTF-8 byte in // proper range? byte2mask=0x3C; // If not set mask // to check next byte trailing = 5;} else return 0; } return trailing == 0; }
The code examples in this section closely reflect the algorithm in ISO 10646 and may not present the most efficient solution for converting to / from UTF-8 encoding. If efficiency is an issue, implementers should use the appropriate bitwise operators.
本节中的代码示例密切反映了ISO10646中的算法,可能不是转换为UTF-8编码或从UTF-8编码转换为UTF-8编码的最有效解决方案。如果效率是一个问题,那么实现者应该使用适当的位运算符。
Additional code examples and numerous mapping tables can be found at the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.
可以在Unicode站点HTTP://www.Unicode.org或FTP://Unicode.org上找到其他代码示例和大量映射表。
Note that the conversion examples below assume that the local character set supported in the operating system is something other than UCS2/UTF-16. There are some operating systems that already support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no conversion will be necessary from the local character set to the UCS.
请注意,下面的转换示例假定操作系统中支持的本地字符集不是UCS2/UTF-16。有些操作系统已经支持UCS2/UTF-16(尤其是Plan 9和Windows NT)。在这种情况下,无需将本地字符集转换为UCS。
Conversion from the local filesystem character set to UTF-8 will normally involve a two step process. First convert the local character set to the UCS; then convert the UCS to UTF-8.
从本地文件系统字符集到UTF-8的转换通常需要两个步骤。首先将本地字符集转换为UCS;然后将UCS转换为UTF-8。
The first step in the process can be performed by maintaining a mapping table that includes the local character set code and the corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859] code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte ISO/IEC 10646 code is 0x000005D5.
该过程的第一步可以通过维护包含本地字符集代码和相应UCS代码的映射表来执行。例如,希伯来文字母“VAV”的ISO/IEC 8859-8[ISO-8859]代码为0xE4。对应的4字节ISO/IEC 10646代码为0x000005D5。
The next step is to convert the UCS character code to the UTF-8 encoding. The following routine can be used to determine and encode the correct number of bytes based on the UCS-4 character code:
下一步是将UCS字符代码转换为UTF-8编码。以下例程可用于根据UCS-4字符代码确定和编码正确的字节数:
unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int ucs4_len, unsigned char *utf8_buf)
无符号整数ucs4到utf8(无符号长*ucs4\u buf,无符号整数ucs4\u len,无符号字符*utf8\u buf)
{ const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; unsigned int utf8_len = 0; // return value for UTF8 size unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer // to load UTF8 values
{ const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; unsigned int utf8_len = 0; // return value for UTF8 size unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer // to load UTF8 values
while (ucs4_buf != ucs4_endbuf) { if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed { *t_utf8_buf++ = (unsigned char) *ucs4_buf; utf8_len++; ucs4_buf++; } else if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=2; ucs4_buf++; } else if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The values 0x0000FFFE, 0x0000FFFF and 0x0000D800 - 0x0000DFFF do not occur in UCS-4 */ { *t_utf8_buf++= (unsigned char) (0xE0 + (*ucs4_buf/0x1000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=3; ucs4_buf++; } else if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF0 + (*ucs4_buf/0x040000));
while (ucs4_buf != ucs4_endbuf) { if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed { *t_utf8_buf++ = (unsigned char) *ucs4_buf; utf8_len++; ucs4_buf++; } else if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=2; ucs4_buf++; } else if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The values 0x0000FFFE, 0x0000FFFF and 0x0000D800 - 0x0000DFFF do not occur in UCS-4 */ { *t_utf8_buf++= (unsigned char) (0xE0 + (*ucs4_buf/0x1000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=3; ucs4_buf++; } else if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF0 + (*ucs4_buf/0x040000));
*t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x10000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=4; ucs4_buf++;
*t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x10000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=4; ucs4_buf++;
} else if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF8 + (*ucs4_buf/0x01000000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x040000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x1000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=5; ucs4_buf++; } else if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF8 +(*ucs4_buf/0x40000000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x01000000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x040000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x1000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=6; ucs4_buf++;
} else if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF8 + (*ucs4_buf/0x01000000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x040000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x1000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=5; ucs4_buf++; } else if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range { *t_utf8_buf++= (unsigned char) (0xF8 +(*ucs4_buf/0x40000000)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x01000000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x040000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x1000)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + ((*ucs4_buf/0x40)%0x40)); *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); utf8_len+=6; ucs4_buf++;
} } return (utf8_len); }
} } return (utf8_len); }
When moving from UTF-8 encoding to the local character set the reverse procedure is used. First the UTF-8 encoding is transformed into the UCS-4 character set. The UCS-4 is then converted to the local character set from a mapping table (i.e. the opposite of the table used to form the UCS-4 character code).
当从UTF-8编码移动到本地字符集时,使用相反的过程。首先将UTF-8编码转换为UCS-4字符集。然后将UCS-4从映射表转换为本地字符集(即用于形成UCS-4字符代码的表的相反部分)。
To convert from UTF-8 to UCS-4 the free bits (those that do not define UTF-8 sequence size or signify continuation bytes) in a UTF-8 sequence are concatenated as a bit string. The bits are then distributed into a four-byte sequence starting from the least significant bits. Those bits not assigned a bit in the four-byte sequence are padded with ZERO bits. The following routine converts the UTF-8 encoding to UCS-4 character codes:
要从UTF-8转换为UCS-4,UTF-8序列中的空闲位(那些不定义UTF-8序列大小或表示连续字节的位)被连接为位字符串。然后,从最低有效位开始,将这些位分配到一个四字节序列中。在四字节序列中未分配位的位用零位填充。以下例程将UTF-8编码转换为UCS-4字符代码:
int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len, unsigned char *utf8_buf) {
整数utf8到整数ucs4(无符号长*整数ucs4,无符号整数utf8,无符号字符*整数utf8){
const unsigned char *utf8_endbuf = utf8_buf + utf8_len; unsigned int ucs_len=0;
const unsigned char *utf8_endbuf = utf8_buf + utf8_len; unsigned int ucs_len=0;
while (utf8_buf != utf8_endbuf) {
while(utf8_buf!=utf8_endbuf){
if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion needed */ { *ucs4_buf++ = (unsigned long) *utf8_buf; utf8_buf++; ucs_len++; } else if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40) + ( *(utf8_buf+1) - 0x80)); utf8_buf += 2; ucs_len++; } else if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000) + (( *(utf8_buf+1) - 0x80) * 0x40) + ( *(utf8_buf+2) - 0x80));
if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion needed */ { *ucs4_buf++ = (unsigned long) *utf8_buf; utf8_buf++; ucs_len++; } else if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40) + ( *(utf8_buf+1) - 0x80)); utf8_buf += 2; ucs_len++; } else if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000) + (( *(utf8_buf+1) - 0x80) * 0x40) + ( *(utf8_buf+2) - 0x80));
utf8_buf+=3; ucs_len++; } else if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF0) * 0x040000) + (( *(utf8_buf+1) - 0x80) * 0x1000) + (( *(utf8_buf+2) - 0x80) * 0x40) + ( *(utf8_buf+3) - 0x80)); utf8_buf+=4; ucs_len++; } else if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF8) * 0x01000000) + ((*(utf8_buf+1) - 0x80) * 0x040000) + (( *(utf8_buf+2) - 0x80) * 0x1000) + (( *(utf8_buf+3) - 0x80) * 0x40) + ( *(utf8_buf+4) - 0x80)); utf8_buf+=5; ucs_len++; } else if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xFC) * 0x40000000) + ((*(utf8_buf+1) - 0x80) * 0x010000000) + ((*(utf8_buf+2) - 0x80) * 0x040000) + (( *(utf8_buf+3) - 0x80) * 0x1000) + (( *(utf8_buf+4) - 0x80) * 0x40) + ( *(utf8_buf+5) - 0x80)); utf8_buf+=6; ucs_len++; }
utf8_buf+=3; ucs_len++; } else if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF0) * 0x040000) + (( *(utf8_buf+1) - 0x80) * 0x1000) + (( *(utf8_buf+2) - 0x80) * 0x40) + ( *(utf8_buf+3) - 0x80)); utf8_buf+=4; ucs_len++; } else if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xF8) * 0x01000000) + ((*(utf8_buf+1) - 0x80) * 0x040000) + (( *(utf8_buf+2) - 0x80) * 0x1000) + (( *(utf8_buf+3) - 0x80) * 0x40) + ( *(utf8_buf+4) - 0x80)); utf8_buf+=5; ucs_len++; } else if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8 range */ { *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xFC) * 0x40000000) + ((*(utf8_buf+1) - 0x80) * 0x010000000) + ((*(utf8_buf+2) - 0x80) * 0x040000) + (( *(utf8_buf+3) - 0x80) * 0x1000) + (( *(utf8_buf+4) - 0x80) * 0x40) + ( *(utf8_buf+5) - 0x80)); utf8_buf+=6; ucs_len++; }
} return (ucs_len); }
} return (ucs_len); }
This example demonstrates mapping ISO/IEC 8859-8 character set to UTF-8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter "VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple lookup of a conversion/mapping file.
此示例演示如何将ISO/IEC 8859-8字符集映射到UTF-8并返回到ISO/IEC 8859-8。如前所述,通过简单查找转换/映射文件,将希伯来文字母“VAV”从ISO/IEC 8859-8字符代码0xE4转换为相应的4字节ISO/IEC 10646代码0x000005D5。
The UCS-4 character code is transformed into UTF-8 using the ucs4_to_utf8 routine described earlier by:
使用前面所述的ucs4到utf8例程将UCS-4字符代码转换为UTF-8:
1. Because the UCS-4 character is between 0x80 and 0x07FF it will map to a 2 byte UTF-8 sequence. 2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7.
1. 由于UCS-4字符介于0x80和0x07FF之间,因此它将映射到一个2字节的UTF-8序列。2.第一个字节由(0xC0+(0x000005D5/0x40))=0xD7定义。
3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95.
3. 第二个字节由(0x80+(0x000005D5%0x40))=0x95定义。
The UTF-8 encoding is transferred back to UCS-4 by using the utf8_to_ucs4 routine described earlier by:
UTF-8编码通过使用前面所述的utf8_to_ucs4例程传输回UCS-4:
1. Because the first byte of the sequence, when the '&' operator with a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0) the UTF-8 is a 2 byte sequence. 2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0) * 0x40) + (0x95 -0x80)) = 0x000005D5.
1. 由于序列的第一个字节(当应用值为0xE0的“&”运算符时)将产生0xC0(0xD7&0xE0=0xC0),因此UTF-8是一个2字节序列。2.四字节UCS-4字符代码由((0xD7-0xC0)*0x40)+(0x95-0x80))=0x000005D5生成。
Finally, the UCS-4 character code is converted to ISO/IEC 8859-8 character code (using the mapping table which matches ISO/IEC 8859-8 to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter "VAV".
最后,将UCS-4字符代码转换为ISO/IEC 8859-8字符代码(使用将ISO/IEC 8859-8与UCS-4匹配的映射表),以生成希伯来文字母“VAV”的原始0xE4代码。
This example demonstrates the mapping of a codepage to UTF-8 and back to a vendor codepage. Mapping between vendor codepages can be done in a very similar manner as described above. For instance both the PC and Mac codepages reflect the character set from the Thai standard TIS 620-2533. The character code on both platforms for the Thai letter "SO SO" is 0xAB. This character can then be mapped into the UCS-4 by way of a conversion/mapping file to produce the UCS-4 code of 0x0E0B.
此示例演示了将代码页映射到UTF-8并返回到供应商代码页的过程。供应商代码页之间的映射可以按照上述非常类似的方式完成。例如,PC和Mac代码页都反映了泰国标准TIS 620-2533中的字符集。两种平台上泰语字母“SO SO”的字符代码均为0xAB。然后,可以通过转换/映射文件将此字符映射到UCS-4,以生成UCS-4代码0x0E0B。
The UCS-4 character code is transformed into UTF-8 using the ucs4_to_utf8 routine described earlier by:
使用前面所述的ucs4到utf8例程将UCS-4字符代码转换为UTF-8:
1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will map to a 3 byte UTF-8 sequence. 2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) = 0xE0.
1. 由于UCS-4字符介于0x0800和0xFFFF之间,因此它将映射到一个3字节的UTF-8序列。2.第一个字节由(0xE0+(0x00000E0B/0x1000)=0xE0定义。
3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) % 0x40))) = 0xB8. 4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B.
3. 第二个字节定义为(0x80+((0x00000E0B/0x40)%0x40))=0xB8。4.第三个字节由(0x80+(0x00000E0B%0x40))=0x8B定义。
The UTF-8 encoding is transferred back to UCS-4 by using the utf8_to_ucs4 routine described earlier by:
UTF-8编码通过使用前面所述的utf8_to_ucs4例程传输回UCS-4:
1. Because the first byte of the sequence, when the '&' operator with a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0) the UTF-8 is a 3 byte sequence. 2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0) * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B.
1. 由于序列的第一个字节在应用值为0xF0的“&”运算符时将产生0xE0(0xE0&0xF0=0xE0),因此UTF-8是一个3字节序列。2.四字节UCS-4字符代码由((0xE0-0xE0)*0x1000)+((0xB8-0x80)*0x40)+(0x8B-0x80)=0x0000E0B生成。
Finally, the UCS-4 character code is converted to either the PC or MAC codepage character code (using the mapping table which matches codepage to UCS-4 ) to produce the original 0xAB code for the Thai letter "SO SO".
最后,将UCS-4字符代码转换为PC或MAC代码页字符代码(使用将代码页与UCS-4匹配的映射表),以生成泰国字母“SO”的原始0xAB代码。
if utf8_valid(fn) { attempt to convert fn to the local charset, producing localfn if (conversion fails temporarily) return error if (conversion succeeds) { attempt to open localfn if (open fails temporarily) return error if (open succeeds) return success } } attempt to open fn if (open fails temporarily) return error if (open succeeds) return success return permanent error
if utf8_valid(fn) { attempt to convert fn to the local charset, producing localfn if (conversion fails temporarily) return error if (conversion succeeds) { attempt to open localfn if (open fails temporarily) return error if (open succeeds) return success } } attempt to open fn if (open fails temporarily) return error if (open succeeds) return success return permanent error
Full Copyright Statement
完整版权声明
Copyright (C) The Internet Society (1999). All Rights Reserved.
版权所有(C)互联网协会(1999年)。版权所有。
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.
本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。
The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.
上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。
This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Acknowledgement
确认
Funding for the RFC Editor function is currently provided by the Internet Society.
RFC编辑功能的资金目前由互联网协会提供。