Internet Engineering Task Force (IETF) T. Talpey Request for Comments: 5666 Unaffiliated Category: Standards Track B. Callaghan ISSN: 2070-1721 Apple January 2010
Internet Engineering Task Force (IETF) T. Talpey Request for Comments: 5666 Unaffiliated Category: Standards Track B. Callaghan ISSN: 2070-1721 Apple January 2010
Remote Direct Memory Access Transport for Remote Procedure Call
远程过程调用的远程直接内存访问传输
Abstract
摘要
This document describes a protocol providing Remote Direct Memory Access (RDMA) as a new transport for Remote Procedure Call (RPC). The RDMA transport binding conveys the benefits of efficient, bulk-data transport over high-speed networks, while providing for minimal change to RPC applications and with no required revision of the application RPC protocol, or the RPC protocol itself.
本文档描述了一种协议,该协议提供远程直接内存访问(RDMA)作为远程过程调用(RPC)的新传输。RDMA传输绑定传递了通过高速网络进行高效、大容量数据传输的好处,同时提供了对RPC应用程序的最小更改,并且无需对应用程序RPC协议或RPC协议本身进行必要的修订。
Status of This Memo
关于下段备忘
This is an Internet Standards Track document.
这是一份互联网标准跟踪文件。
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.
本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。有关互联网标准的更多信息,请参见RFC 5741第2节。
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc5666.
有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc5666.
Copyright Notice
版权公告
Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.
版权所有(c)2010 IETF信托基金和确定为文件作者的人员。版权所有。
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。
This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English.
本文件可能包含2008年11月10日之前发布或公开的IETF文件或IETF贡献中的材料。控制某些材料版权的人员可能未授予IETF信托允许在IETF标准流程之外修改此类材料的权利。在未从控制此类材料版权的人员处获得充分许可的情况下,不得在IETF标准流程之外修改本文件,也不得在IETF标准流程之外创建其衍生作品,除了将其格式化以RFC形式发布或将其翻译成英语以外的其他语言。
Table of Contents
目录
1. Introduction ....................................................3 1.1. Requirements Language ......................................4 2. Abstract RDMA Requirements ......................................4 3. Protocol Outline ................................................5 3.1. Short Messages .............................................6 3.2. Data Chunks ................................................6 3.3. Flow Control ...............................................7 3.4. XDR Encoding with Chunks ...................................8 3.5. XDR Decoding with Read Chunks .............................11 3.6. XDR Decoding with Write Chunks ............................12 3.7. XDR Roundup and Chunks ....................................13 3.8. RPC Call and Reply ........................................14 3.9. Padding ...................................................17 4. RPC RDMA Message Layout ........................................18 4.1. RPC-over-RDMA Header ......................................18 4.2. RPC-over-RDMA Header Errors ...............................20 4.3. XDR Language Description ..................................20 5. Long Messages ..................................................22 5.1. Message as an RDMA Read Chunk .............................23 5.2. RDMA Write of Long Replies (Reply Chunks) .................24 6. Connection Configuration Protocol ..............................25 6.1. Initial Connection State ..................................26 6.2. Protocol Description ......................................26 7. Memory Registration Overhead ...................................28 8. Errors and Error Recovery ......................................28 9. Node Addressing ................................................28 10. RPC Binding ...................................................29 11. Security Considerations .......................................30 12. IANA Considerations ...........................................31 13. Acknowledgments ...............................................32 14. References ....................................................33 14.1. Normative References .....................................33 14.2. Informative References ...................................33
1. Introduction ....................................................3 1.1. Requirements Language ......................................4 2. Abstract RDMA Requirements ......................................4 3. Protocol Outline ................................................5 3.1. Short Messages .............................................6 3.2. Data Chunks ................................................6 3.3. Flow Control ...............................................7 3.4. XDR Encoding with Chunks ...................................8 3.5. XDR Decoding with Read Chunks .............................11 3.6. XDR Decoding with Write Chunks ............................12 3.7. XDR Roundup and Chunks ....................................13 3.8. RPC Call and Reply ........................................14 3.9. Padding ...................................................17 4. RPC RDMA Message Layout ........................................18 4.1. RPC-over-RDMA Header ......................................18 4.2. RPC-over-RDMA Header Errors ...............................20 4.3. XDR Language Description ..................................20 5. Long Messages ..................................................22 5.1. Message as an RDMA Read Chunk .............................23 5.2. RDMA Write of Long Replies (Reply Chunks) .................24 6. Connection Configuration Protocol ..............................25 6.1. Initial Connection State ..................................26 6.2. Protocol Description ......................................26 7. Memory Registration Overhead ...................................28 8. Errors and Error Recovery ......................................28 9. Node Addressing ................................................28 10. RPC Binding ...................................................29 11. Security Considerations .......................................30 12. IANA Considerations ...........................................31 13. Acknowledgments ...............................................32 14. References ....................................................33 14.1. Normative References .....................................33 14.2. Informative References ...................................33
Remote Direct Memory Access (RDMA) [RFC5040, RFC5041], [IB] is a technique for efficient movement of data between end nodes, which becomes increasingly compelling over high-speed transports. By directing data into destination buffers as it is sent on a network, and placing it via direct memory access by hardware, the double benefit of faster transfers and reduced host overhead is obtained.
远程直接内存访问(RDMA)[RFC5040,RFC5041],[IB]是一种在终端节点之间高效移动数据的技术,在高速传输中变得越来越引人注目。通过在网络上发送数据时将数据定向到目标缓冲区,并通过硬件的直接内存访问将其放置,可以获得更快传输和减少主机开销的双重好处。
Open Network Computing Remote Procedure Call (ONC RPC, or simply, RPC) [RFC5531] is a remote procedure call protocol that has been run over a variety of transports. Most RPC implementations today use UDP or TCP. RPC messages are defined in terms of an eXternal Data Representation (XDR) [RFC4506], which provides a canonical data representation across a variety of host architectures. An XDR data stream is conveyed differently on each type of transport. On UDP, RPC messages are encapsulated inside datagrams, while on a TCP byte stream, RPC messages are delineated by a record marking protocol. An RDMA transport also conveys RPC messages in a unique fashion that must be fully described if client and server implementations are to interoperate.
开放网络计算远程过程调用(ONC-RPC,或者简单地说,RPC)[RFC5531]是一种远程过程调用协议,已在各种传输上运行。目前大多数RPC实现都使用UDP或TCP。RPC消息是根据外部数据表示(XDR)[RFC4506]定义的,它提供了跨各种主机体系结构的规范数据表示。XDR数据流在每种传输类型上的传输方式不同。在UDP上,RPC消息封装在数据报中,而在TCP字节流上,RPC消息由记录标记协议描述。RDMA传输还以一种独特的方式传递RPC消息,如果客户端和服务器实现要进行互操作,则必须充分描述这种方式。
RDMA transports present new semantics unlike the behaviors of either UDP or TCP alone. They retain message delineations like UDP while also providing a reliable, sequenced data transfer like TCP. Also, they provide the new efficient, bulk-transfer service of RDMA. RDMA transports are therefore naturally viewed as a new transport type by RPC.
RDMA传输提供了新的语义,与UDP或TCP单独的行为不同。它们保留了UDP等消息描述,同时也提供了TCP等可靠、有序的数据传输。此外,它们还提供了新的高效的RDMA批量传输服务。因此,RDMA传输自然被RPC视为一种新的传输类型。
RDMA as a transport will benefit the performance of RPC protocols that move large "chunks" of data, since RDMA hardware excels at moving data efficiently between host memory and a high-speed network with little or no host CPU involvement. In this context, the Network File System (NFS) protocol, in all its versions [RFC1094] [RFC1813] [RFC3530] [RFC5661], is an obvious beneficiary of RDMA. A complete problem statement is discussed in [RFC5532], and related NFSv4 issues are discussed in [RFC5661]. Many other RPC-based protocols will also benefit.
RDMA作为一种传输方式将有利于移动大量数据的RPC协议的性能,因为RDMA硬件擅长在主机内存和高速网络之间高效地移动数据,而很少或根本不涉及主机CPU。在这种情况下,网络文件系统(NFS)协议在其所有版本[RFC1094][RFC1813][RFC3530][RFC5661]中都是RDMA的明显受益者。完整的问题陈述在[RFC5532]中讨论,相关的NFSv4问题在[RFC5661]中讨论。许多其他基于RPC的协议也将受益。
Although the RDMA transport described here provides relatively transparent support for any RPC application, the proposal goes further in describing mechanisms that can optimize the use of RDMA with more active participation by the RPC application.
尽管此处描述的RDMA传输为任何RPC应用程序提供了相对透明的支持,但该提案进一步描述了能够在RPC应用程序更积极参与的情况下优化RDMA使用的机制。
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照[RFC2119]中所述进行解释。
An RPC transport is responsible for conveying an RPC message from a sender to a receiver. An RPC message is either an RPC call from a client to a server, or an RPC reply from the server back to the client. An RPC message contains an RPC call header followed by arguments if the message is an RPC call, or an RPC reply header followed by results if the message is an RPC reply. The call header contains a transaction ID (XID) followed by the program and procedure number as well as a security credential. An RPC reply header begins with an XID that matches that of the RPC call message, followed by a security verifier and results. All data in an RPC message is XDR encoded. For a complete description of the RPC protocol and XDR encoding, see [RFC5531] and [RFC4506].
RPC传输负责将RPC消息从发送方传输到接收方。RPC消息是从客户端到服务器的RPC调用,或从服务器返回到客户端的RPC回复。如果消息是RPC调用,则RPC消息包含后跟参数的RPC调用标头;如果消息是RPC应答,则RPC应答标头后跟结果。调用头包含一个事务ID(XID),后跟程序和过程号以及一个安全凭据。RPC应答头以与RPC调用消息的XID匹配的XID开头,后跟安全验证器和结果。RPC消息中的所有数据都是XDR编码的。有关RPC协议和XDR编码的完整说明,请参阅[RFC5531]和[RFC4506]。
This protocol assumes the following abstract model for RDMA transports. These terms, common in the RDMA lexicon, are used in this document. A more complete glossary of RDMA terms can be found in [RFC5040].
此协议假定RDMA传输采用以下抽象模型。这些术语在RDMA词典中很常见,在本文档中使用。在[RFC5040]中可以找到更完整的RDMA术语表。
o Registered Memory All data moved via tagged RDMA operations is resident in registered memory at its destination. This protocol assumes that each segment of registered memory MUST be identified with a steering tag of no more than 32 bits and memory addresses of up to 64 bits in length.
o 注册内存通过标记的RDMA操作移动的所有数据都驻留在其目标的注册内存中。该协议假设每个注册内存段必须用不超过32位的转向标记和不超过64位的内存地址来标识。
o RDMA Send The RDMA provider supports an RDMA Send operation with completion signaled at the receiver when data is placed in a pre-posted buffer. The amount of transferred data is limited only by the size of the receiver's buffer. Sends complete at the receiver in the order they were issued at the sender.
o RDMA发送RDMA提供程序支持RDMA发送操作,当数据放置在预发布的缓冲区中时,接收器会发出完成信号。传输的数据量仅受接收器缓冲区大小的限制。按照在发送方发出的顺序在接收方完成发送。
o RDMA Write The RDMA provider supports an RDMA Write operation to directly place data in the receiver's buffer. An RDMA Write is initiated by the sender and completion is signaled at the sender. No completion is signaled at the receiver. The sender uses a steering tag, memory address, and length of the remote destination buffer. RDMA Writes are not necessarily ordered with respect to one another, but are ordered with respect to
o RDMA写入RDMA提供程序支持RDMA写入操作,直接将数据放入接收器的缓冲区。RDMA写入由发送方发起,并在发送方发出完成信号。在接收器处没有完成的信号。发送方使用转向标记、内存地址和远程目标缓冲区的长度。RDMA写入不一定是相对于彼此排序的,而是相对于
RDMA Sends; a subsequent RDMA Send completion obtained at the receiver guarantees that prior RDMA Write data has been successfully placed in the receiver's memory.
RDMA发送;在接收器处获得的后续RDMA发送完成保证先前的RDMA写入数据已成功放置在接收器的内存中。
o RDMA Read The RDMA provider supports an RDMA Read operation to directly place peer source data in the requester's buffer. An RDMA Read is initiated by the receiver and completion is signaled at the receiver. The receiver provides steering tags, memory addresses, and a length for the remote source and local destination buffers. Since the peer at the data source receives no notification of RDMA Read completion, there is an assumption that on receiving the data, the receiver will signal completion with an RDMA Send message, so that the peer can free the source buffers and the associated steering tags.
o RDMA读取RDMA提供程序支持RDMA读取操作,直接将对等源数据放入请求程序的缓冲区。RDMA读取由接收器启动,并在接收器发出完成信号。接收器为远程源和本地目标缓冲区提供转向标记、内存地址和长度。由于数据源处的对等方未收到RDMA读取完成的通知,因此假设在接收数据时,接收器将使用RDMA发送消息发出完成信号,以便对等方可以释放源缓冲区和相关的转向标记。
This protocol is designed to be carried over all RDMA transports meeting the stated requirements. This protocol conveys to the RPC peer information sufficient for that RPC peer to direct an RDMA layer to perform transfers containing RPC data and to communicate their result(s). For example, it is readily carried over RDMA transports such as Internet Wide Area RDMA Protocol (iWARP) [RFC5040, RFC5041], or InfiniBand [IB].
本协议设计用于满足规定要求的所有RDMA传输。此协议向RPC对等方传递足够的信息,以便该RPC对等方指示RDMA层执行包含RPC数据的传输并传递其结果。例如,它很容易通过RDMA传输,例如Internet广域RDMA协议(iWARP)[RFC5040,RFC5041]或InfiniBand[IB]。
An RPC message can be conveyed in identical fashion, whether it is a call or reply message. In each case, the transmission of the message proper is preceded by transmission of a transport-specific header for use by RPC-over-RDMA transports. This header is analogous to the record marking used for RPC over TCP, but is more extensive, since RDMA transports support several modes of data transfer; it is important to allow the upper-layer protocol to specify the most efficient mode for each of the segments in a message. Multiple segments of a message may thereby be transferred in different ways to different remote memory destinations.
RPC消息可以以相同的方式传输,无论是呼叫消息还是应答消息。在每种情况下,消息本身的传输之前都会传输特定于传输的报头,供RPC over RDMA传输使用。此标头类似于TCP上RPC使用的记录标记,但更广泛,因为RDMA传输支持多种数据传输模式;重要的是允许上层协议为消息中的每个段指定最有效的模式。因此,可以以不同的方式将消息的多个段传输到不同的远程存储器目的地。
All transfers of a call or reply begin with an RDMA Send that transfers at least the RPC-over-RDMA header, usually with the call or reply message appended, or at least some part thereof. Because the size of what may be transmitted via RDMA Send is limited by the size of the receiver's pre-posted buffer, the RPC-over-RDMA transport provides a number of methods to reduce the amount transferred by means of the RDMA Send, when necessary, by transferring various parts of the message using RDMA Read and RDMA Write.
呼叫或应答的所有传输都以RDMA发送开始,该发送至少传输RPC over RDMA报头,通常附加呼叫或应答消息,或至少部分呼叫或应答消息。由于可通过RDMA Send传输的内容的大小受到接收机预投递缓冲区大小的限制,RPC over RDMA传输提供了许多方法,在必要时,通过使用RDMA Read和RDMA Write传输消息的各个部分来减少通过RDMA Send传输的量。
RPC-over-RDMA framing replaces all other RPC framing (such as TCP record marking) when used atop an RPC/RDMA association, even though the underlying RDMA protocol may itself be layered atop a protocol with a defined RPC framing (such as TCP). It is however possible for RPC/RDMA to be dynamically enabled, in the course of negotiating the use of RDMA via an upper-layer exchange. Because RPC framing delimits an entire RPC request or reply, the resulting shift in framing must occur between distinct RPC messages, and in concert with the transport.
当在RPC/RDMA关联上使用时,RPC over RDMA帧替换所有其他RPC帧(例如TCP记录标记),即使底层RDMA协议本身可能在具有定义的RPC帧(例如TCP)的协议上分层。然而,在通过上层交换协商使用RDMA的过程中,可以动态启用RPC/RDMA。由于RPC帧界定了整个RPC请求或应答,因此帧中产生的移位必须发生在不同的RPC消息之间,并且与传输一致。
Many RPC messages are quite short. For example, the NFS version 3 GETATTR request, is only 56 bytes: 20 bytes of RPC header, plus a 32-byte file handle argument and 4 bytes of length. The reply to this common request is about 100 bytes.
许多RPC消息都很短。例如,NFS版本3 GETATTR请求只有56个字节:20个字节的RPC头,加上一个32字节的文件句柄参数和4个字节的长度。对这个常见请求的回复大约为100字节。
There is no benefit in transferring such small messages with an RDMA Read or Write operation. The overhead in transferring steering tags and memory addresses is justified only by large transfers. The critical message size that justifies RDMA transfer will vary depending on the RDMA implementation and network, but is typically of the order of a few kilobytes. It is appropriate to transfer a short message with an RDMA Send to a pre-posted buffer. The RPC-over-RDMA header with the short message (call or reply) immediately following is transferred using a single RDMA Send operation.
使用RDMA读写操作传输这样的小消息没有任何好处。传输转向标签和内存地址的开销只有通过大量传输才是合理的。支持RDMA传输的关键消息大小将因RDMA实现和网络而异,但通常为几千字节。将带有RDMA发送的短消息传输到预先发布的缓冲区是合适的。使用单个RDMA发送操作传输紧接着的带有短消息(呼叫或应答)的RPC over RDMA报头。
Short RPC messages over an RDMA transport:
RDMA传输上的短RPC消息:
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RPC Reply | | <------------------------------ | Send
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RPC Reply | | <------------------------------ | Send
Some protocols, like NFS, have RPC procedures that can transfer very large chunks of data in the RPC call or reply and would cause the maximum send size to be exceeded if one tried to transfer them as part of the RDMA Send. These large chunks typically range from a kilobyte to a megabyte or more. An RDMA transport can transfer large chunks of data more efficiently via the direct placement of an RDMA Read or RDMA Write operation. Using direct placement instead of inline transfer not only avoids expensive data copies, but provides correct data alignment at the destination.
有些协议(如NFS)具有RPC过程,可以在RPC调用或应答中传输非常大的数据块,如果试图将其作为RDMA发送的一部分传输,则会导致超过最大发送大小。这些大数据块的范围通常从千字节到兆字节或更大。RDMA传输可以通过直接放置RDMA读或RDMA写操作更有效地传输大块数据。使用直接放置而不是内联传输不仅避免了昂贵的数据拷贝,而且在目标位置提供了正确的数据对齐。
It is critical to provide RDMA Send flow control for an RDMA connection. RDMA receive operations will fail if a pre-posted receive buffer is not available to accept an incoming RDMA Send, and repeated occurrences of such errors can be fatal to the connection. This is a departure from conventional TCP/IP networking where buffers are allocated dynamically on an as-needed basis, and where pre-posting is not required.
为RDMA连接提供RDMA发送流控制至关重要。如果预发布的接收缓冲区不可用于接收传入的RDMA发送,则RDMA接收操作将失败,重复出现此类错误可能会对连接造成致命影响。这与传统的TCP/IP网络不同,传统的TCP/IP网络根据需要动态分配缓冲区,并且不需要预发布。
It is not practical to provide for fixed credit limits at the RPC server. Fixed limits scale poorly, since posted buffers are dedicated to the associated connection until consumed by receive operations. Additionally, for protocol correctness, the RPC server must always be able to reply to client requests, whether or not new buffers have been posted to accept future receives. (Note that the RPC server may in fact be a client at some other layer. For example, NFSv4 callbacks are processed by the NFSv4 client, acting as an RPC server. The credit discussions apply equally in either case.)
在RPC服务器上提供固定的信用额度是不实际的。固定限制伸缩性差,因为在接收操作使用之前,已发布的缓冲区专用于关联的连接。此外,对于协议正确性,无论是否已发布新缓冲区以接受未来接收,RPC服务器必须始终能够响应客户端请求。(请注意,RPC服务器实际上可能是某个其他层的客户端。例如,NFSv4回调由NFSv4客户端作为RPC服务器进行处理。在这两种情况下,信用讨论都同样适用。)
Flow control for RDMA Send operations is implemented as a simple request/grant protocol in the RPC-over-RDMA header associated with each RPC message. The RPC-over-RDMA header for RPC call messages contains a requested credit value for the RPC server, which MAY be dynamically adjusted by the caller to match its expected needs. The RPC-over-RDMA header for the RPC reply messages provides the granted result, which MAY have any value except it MUST NOT be zero when no in-progress operations are present at the server, since such a value would result in deadlock. The value MAY be adjusted up or down at each opportunity to match the server's needs or policies.
RDMA发送操作的流控制在与每个RPC消息相关联的RPC over RDMA报头中作为简单的请求/授权协议实现。RPC调用消息的RPC over RDMA标头包含RPC服务器的请求信用值,调用者可以动态调整该值以匹配其预期需求。RPC回复消息的RPC over RDMA标头提供了授予的结果,该结果可以有任何值,但当服务器上不存在正在进行的操作时,该值不能为零,因为这样的值将导致死锁。该值可以在每次机会时向上或向下调整,以匹配服务器的需要或策略。
The RPC client MUST NOT send unacknowledged requests in excess of this granted RPC server credit limit. If the limit is exceeded, the RDMA layer may signal an error, possibly terminating the connection. Even if an error does not occur, it is OPTIONAL that the server handle the excess request(s), and it MAY return an RPC error to the client. Also note that the never-zero requirement implies that an RPC server MUST always provide at least one credit to each connected RPC client from which no requests are outstanding. The client would deadlock otherwise, unable to send another request.
RPC客户端发送的未确认请求不得超过此授予的RPC服务器信用限制。如果超过限制,RDMA层可能会发出错误信号,可能会终止连接。即使没有发生错误,服务器处理多余的请求也是可选的,并且它可能会向客户端返回RPC错误。还请注意,永不为零的要求意味着RPC服务器必须始终为每个连接的RPC客户端提供至少一个信用,其中没有未完成的请求。否则客户端将死锁,无法发送另一个请求。
While RPC calls complete in any order, the current flow control limit at the RPC server is known to the RPC client from the Send ordering properties. It is always the most recent server-granted credit value minus the number of requests in flight.
当RPC调用以任何顺序完成时,RPC客户端可以从发送顺序属性知道RPC服务器上的当前流控制限制。它始终是最新服务器授予的信用值减去飞行中的请求数。
Certain RDMA implementations may impose additional flow control restrictions, such as limits on RDMA Read operations in progress at the responder. Because these operations are outside the scope of this protocol, they are not addressed and SHOULD be provided for by other layers. For example, a simple upper-layer RPC consumer might perform single-issue RDMA Read requests, while a more sophisticated, multithreaded RPC consumer might implement its own First In, First Out (FIFO) queue of such operations. For further discussion of possible protocol implementations capable of negotiating these values, see Section 6 "Connection Configuration Protocol" of this document, or [RFC5661].
某些RDMA实现可能会施加额外的流控制限制,例如对响应程序中正在进行的RDMA读取操作的限制。因为这些操作不在本协议的范围内,所以它们不被处理,应该由其他层提供。例如,一个简单的上层RPC使用者可能执行单问题RDMA读取请求,而一个更复杂的多线程RPC使用者可能实现其自己的先进先出(FIFO)队列。有关能够协商这些值的可能协议实现的进一步讨论,请参阅本文件第6节“连接配置协议”,或[RFC5661]。
The data comprising an RPC call or reply message is marshaled or serialized into a contiguous stream by an XDR routine. XDR data types such as integers, strings, arrays, and linked lists are commonly implemented over two very simple functions that encode either an XDR data unit (32 bits) or an array of bytes.
包含RPC调用或应答消息的数据由XDR例程封送或序列化为连续流。XDR数据类型(如整数、字符串、数组和链表)通常通过两个非常简单的函数实现,这些函数对XDR数据单元(32位)或字节数组进行编码。
Normally, the separate data items in an RPC call or reply are encoded as a contiguous sequence of bytes for network transmission over UDP or TCP. However, in the case of an RDMA transport, local routines such as XDR encode can determine that (for instance) an opaque byte array is large enough to be more efficiently moved via an RDMA data transfer operation like RDMA Read or RDMA Write.
通常,RPC调用或应答中的独立数据项被编码为连续的字节序列,以便通过UDP或TCP进行网络传输。然而,在RDMA传输的情况下,XDR encode等本地例程可以确定(例如)不透明字节数组足够大,可以通过RDMA数据传输操作(如RDMA读取或RDMA写入)更有效地移动。
Semantically speaking, the protocol has no restriction regarding data types that may or may not be represented by a read or write chunk. In practice however, efficiency considerations lead to the conclusion that certain data types are not generally "chunkable". Typically, only those opaque and aggregate data types that may attain substantial size are considered to be eligible. With today's hardware, this size may be a kilobyte or more. However, any object MAY be chosen for chunking in any given message.
从语义上讲,协议没有关于数据类型的限制,这些数据类型可能由读或写块表示,也可能不由读或写块表示。然而,在实践中,出于效率考虑,我们得出结论,某些数据类型通常不“可分块”。通常,只有那些可能达到相当大大小的不透明和聚合数据类型才被认为是合格的。在今天的硬件中,这个大小可能是一个KB或更多。但是,可以选择任何对象在任何给定消息中进行分块。
The eligibility of XDR data items to be candidates for being moved as data chunks (as opposed to being marshaled inline) is not specified by the RPC-over-RDMA protocol. Chunk eligibility criteria MUST be determined by each upper-layer in order to provide for an interoperable specification. One such example with rationale, for the NFS protocol family, is provided in [RFC5667].
RPC over RDMA协议没有指定XDR数据项作为候选数据块移动(而不是内联封送)的资格。区块资格标准必须由每个上层确定,以便提供可互操作的规范。[RFC5667]中提供了NFS协议系列的一个具有基本原理的示例。
The interface by which an upper-layer implementation communicates the eligibility of a data item locally to RPC for chunking is out of scope for this specification. In many implementations, it is possible to implement a transparent RPC chunking facility. However, such implementations may lead to inefficiencies, either because they
上层实现将数据项的合格性本地传递给RPC进行分块的接口超出了本规范的范围。在许多实现中,可以实现透明的RPC分块功能。然而,这样的实现可能会导致效率低下,因为
require the RPC layer to perform expensive registration and de-registration of memory "on the fly", or they may require using RDMA chunks in reply messages, along with the resulting additional handshaking with the RPC-over-RDMA peer. However, these issues are internal and generally confined to the local interface between RPC and its upper layers, one in which implementations are free to innovate. The only requirement is that the resulting RPC RDMA protocol sent to the peer is valid for the upper layer. See, for example, [RFC5667].
要求RPC层“动态”执行昂贵的内存注册和注销,或者可能需要在回复消息中使用RDMA块,以及由此产生的通过RDMA对等点与RPC的额外握手。然而,这些问题是内部的,通常局限于RPC及其上层之间的本地接口,其中的实现可以自由创新。唯一的要求是发送给对等方的最终RPC RDMA协议对上层有效。例如,参见[RFC5667]。
When sending any message (request or reply) that contains an eligible large data chunk, the XDR encoding routine avoids moving the data into the XDR stream. Instead, it does not encode the data portion, but records the address and size of each chunk in a separate "read chunk list" encoded within RPC RDMA transport-specific headers. Such chunks will be transferred via RDMA Read operations initiated by the receiver.
当发送包含合格大数据块的任何消息(请求或回复)时,XDR编码例程避免将数据移动到XDR流中。相反,它不会对数据部分进行编码,而是将每个块的地址和大小记录在一个单独的“读取块列表”中,该列表编码在RPC RDMA传输特定的头中。这些数据块将通过接收方发起的RDMA读取操作进行传输。
When the read chunks are to be moved via RDMA, the memory for each chunk is registered. This registration may take place within XDR itself, providing for full transparency to upper layers, or it may be performed by any other specific local implementation.
当读取的块要通过RDMA移动时,每个块的内存都会被注册。该注册可以在XDR本身内进行,以提供对上层的完全透明,也可以由任何其他特定的本地实现来执行。
Additionally, when making an RPC call that can result in bulk data transferred in the reply, write chunks MAY be provided to accept the data directly via RDMA Write. These write chunks will therefore be pre-filled by the RPC server prior to responding, and XDR decode of the data at the client will not be required. These chunks undergo a similar registration and advertisement via "write chunk lists" built as a part of XDR encoding.
此外,当进行可能导致在应答中传输大容量数据的RPC调用时,可以提供写块以通过RDMA write直接接受数据。因此,这些写块将在响应之前由RPC服务器预先填充,并且不需要在客户端对数据进行XDR解码。这些块通过作为XDR编码一部分构建的“写块列表”进行类似的注册和发布。
Some RPC client implementations are not able to determine where an RPC call's results reside during the "encode" phase. This makes it difficult or impossible for the RPC client layer to encode the write chunk list at the time of building the request. In this case, it is difficult for the RPC implementation to provide transparency to the RPC consumer, which may require recoding to provide result information at this earlier stage.
在“编码”阶段,某些RPC客户端实现无法确定RPC调用的结果驻留在何处。这使得RPC客户机层在构建请求时很难或不可能对写块列表进行编码。在这种情况下,RPC实现很难向RPC使用者提供透明性,这可能需要重新编码以在此早期阶段提供结果信息。
Therefore, if the RPC client does not make a write chunk list available to receive the result, then the RPC server MAY return data inline in the reply, or if the upper-layer specification permits, it MAY be returned via a read chunk list. It is NOT RECOMMENDED that upper-layer RPC client protocol specifications omit write chunk lists for eligible replies, due to the lower performance of the additional handshaking to perform data transfer, and the requirement that the RPC server must expose (and preserve) the reply data for a period of
因此,如果RPC客户机没有使写块列表可用于接收结果,则RPC服务器可以在应答中内联返回数据,或者如果上层规范允许,则可以通过读块列表返回数据。由于执行数据传输的附加握手性能较低,并且要求RPC服务器必须在一段时间内公开(并保留)应答数据,因此不建议上层RPC客户端协议规范忽略符合条件的应答的写块列表
time. In the absence of a server-provided read chunk list in the reply, if the encoded reply overflows the posted receive buffer, the RPC will fail with an RDMA transport error.
时间在应答中没有服务器提供的读取区块列表的情况下,如果编码的应答溢出发布的接收缓冲区,RPC将失败,并出现RDMA传输错误。
When any data within a message is provided via either read or write chunks, the chunk itself refers only to the data portion of the XDR stream element. In particular, for counted fields (e.g., a "<>" encoding) the byte count that is encoded as part of the field remains in the XDR stream, and is also encoded in the chunk list. The data portion is however elided from the encoded XDR stream, and is transferred as part of chunk list processing. It is important to maintain upper-layer implementation compatibility -- both the count and the data must be transferred as part of the logical XDR stream. While the chunk list processing results in the data being available to the upper-layer peer for XDR decoding, the length present in the chunk list entries is not. Any byte count in the XDR stream MUST match the sum of the byte counts present in the corresponding read or write chunk list. If they do not agree, an RPC protocol encoding error results.
当消息中的任何数据通过读或写区块提供时,区块本身仅指XDR stream元素的数据部分。特别是,对于计数字段(例如“<>”编码),编码为字段一部分的字节计数保留在XDR流中,并且也编码在区块列表中。然而,数据部分从编码的XDR流中被省略,并且作为块列表处理的一部分被传输。维护上层实现兼容性很重要——计数和数据都必须作为逻辑XDR流的一部分传输。虽然区块列表处理导致数据可供上层对等方用于XDR解码,但区块列表条目中存在的长度不是。XDR流中的任何字节计数必须与相应读或写区块列表中的字节计数之和相匹配。如果他们不同意,则会导致RPC协议编码错误。
The following items are contained in a chunk list entry.
以下项目包含在区块列表条目中。
Handle Steering tag or handle obtained when the chunk memory is registered for RDMA.
为RDMA注册块内存时获得的句柄控制标记或句柄。
Length The length of the chunk in bytes.
Length块的长度,以字节为单位。
Offset The offset or beginning memory address of the chunk. In order to support the widest array of RDMA implementations, as well as the most general steering tag scheme, this field is unconditionally included in each chunk list entry.
偏移块的偏移量或起始内存地址。为了支持最广泛的RDMA实现,以及最通用的指导标记方案,该字段无条件地包含在每个区块列表条目中。
While zero-based offset schemes are available in many RDMA implementations, their use by RPC requires individual registration of each read or write chunk. On many such implementations, this can be a significant overhead. By providing an offset in each chunk, many pre-registration or region-based registrations can be readily supported, and by using a single, universal chunk representation, the RPC RDMA protocol implementation is simplified to its most general form.
虽然在许多RDMA实现中都可以使用基于零的偏移方案,但RPC使用这些方案需要单独注册每个读或写块。在许多这样的实现中,这可能是一个很大的开销。通过在每个区块中提供偏移量,可以很容易地支持许多预注册或基于区域的注册,并且通过使用单个通用区块表示,RPC-RDMA协议实现简化为其最通用的形式。
Position For data that is to be encoded, the position in the XDR stream where the chunk would normally reside. Note that the chunk therefore inserts its data into the XDR stream at this position,
要编码的数据的位置,块通常位于XDR流中的位置。请注意,块因此会在该位置将其数据插入XDR流,
but its transfer is no longer "inline". Also note therefore that all chunks belonging to a single RPC argument or result will have the same position. For data that is to be decoded, no position is used.
但它的传输不再是“内联”的。因此还要注意,属于单个RPC参数或结果的所有块都将具有相同的位置。对于要解码的数据,不使用位置。
When XDR marshaling is complete, the chunk list is XDR encoded, then sent to the receiver prepended to the RPC message. Any source data for a read chunk, or the destination of a write chunk, remain behind in the sender's registered memory, and their actual payload is not marshaled into the request or reply.
当XDR封送处理完成时,区块列表将进行XDR编码,然后发送到RPC消息的前置接收方。读块的任何源数据或写块的目标数据都保留在发送方的注册内存中,并且它们的实际有效负载不会封送到请求或应答中。
+----------------+----------------+------------- | RPC-over-RDMA | | | header w/ | RPC Header | Non-chunk args/results | chunks | | +----------------+----------------+-------------
+----------------+----------------+------------- | RPC-over-RDMA | | | header w/ | RPC Header | Non-chunk args/results | chunks | | +----------------+----------------+-------------
Read chunk lists and write chunk lists are structured somewhat differently. This is due to the different usage -- read chunks are decoded and indexed by their argument's or result's position in the XDR data stream; their size is always known. Write chunks, on the other hand, are used only for results, and have neither a preassigned offset in the XDR stream nor a size until the results are produced, since the buffers may be only partially filled, or may not be used for results at all. Their presence in the XDR stream is therefore not known until the reply is processed. The mapping of write chunks onto designated NFS procedures and their results is described in [RFC5667].
读块列表和写块列表的结构有所不同。这是由于不同的用法造成的——读取块根据其参数或结果在XDR数据流中的位置进行解码和索引;它们的大小总是众所周知的。另一方面,写块仅用于结果,在生成结果之前,XDR流中既没有预先分配的偏移量,也没有大小,因为缓冲区可能只是部分填充,也可能根本不用于结果。因此,在应答被处理之前,它们在XDR流中的存在是未知的。[RFC5667]中描述了将写块映射到指定NFS过程及其结果。
Therefore, read chunks are encoded into a read chunk list as a single array, with each entry tagged by its (known) size and its argument's or result's position in the XDR stream. Write chunks are encoded as a list of arrays of RDMA buffers, with each list element (an array) providing buffers for a separate result. Individual write chunk list elements MAY thereby result in being partially or fully filled, or in fact not being filled at all. Unused write chunks, or unused bytes in write chunk buffer lists, are not returned as results, and their memory is returned to the upper layer as part of RPC completion. However, the RPC layer MUST NOT assume that the buffers have not been modified.
因此,读块作为单个数组编码到读块列表中,每个条目根据其(已知)大小及其参数或结果在XDR流中的位置进行标记。写块被编码为RDMA缓冲区数组的列表,每个列表元素(一个数组)为单独的结果提供缓冲区。因此,单个写块列表元素可能会导致部分或全部被填充,或者实际上根本没有被填充。未使用的写块或写块缓冲区列表中未使用的字节不会作为结果返回,它们的内存作为RPC完成的一部分返回到上层。但是,RPC层不能假设缓冲区未被修改。
The XDR decode process moves data from an XDR stream into a data structure provided by the RPC client or server application. Where elements of the destination data structure are buffers or strings, the RPC application can either pre-allocate storage to receive the
XDR解码过程将数据从XDR流移动到RPC客户端或服务器应用程序提供的数据结构中。当目标数据结构的元素是缓冲区或字符串时,RPC应用程序可以预先分配存储以接收数据
data or leave the string or buffer fields null and allow the XDR decode stage of RPC processing to automatically allocate storage of sufficient size.
数据或保留字符串或缓冲区字段为空,并允许RPC处理的XDR解码阶段自动分配足够大小的存储。
When decoding a message from an RDMA transport, the receiver first XDR decodes the chunk lists from the RPC-over-RDMA header, then proceeds to decode the body of the RPC message (arguments or results). Whenever the XDR offset in the decode stream matches that of a chunk in the read chunk list, the XDR routine initiates an RDMA Read to bring over the chunk data into locally registered memory for the destination buffer.
在对来自RDMA传输的消息进行解码时,接收方首先对来自RPC over RDMA头的区块列表进行XDR解码,然后继续解码RPC消息体(参数或结果)。每当解码流中的XDR偏移量与读取块列表中的块的偏移量匹配时,XDR例程就会启动RDMA读取,以将块数据带到本地注册的内存中,供目标缓冲区使用。
When processing an RPC request, the RPC receiver (RPC server) acknowledges its completion of use of the source buffers by simply replying to the RPC sender (client), and the peer may then free all source buffers advertised by the request.
在处理RPC请求时,RPC接收方(RPC服务器)通过简单地回复RPC发送方(客户端)来确认其已完成对源缓冲区的使用,并且对等方随后可以释放该请求播发的所有源缓冲区。
When processing an RPC reply, after completing such a transfer, the RPC receiver (client) MUST issue an RDMA_DONE message (described in Section 3.8) to notify the peer (server) that the source buffers can be freed.
在处理RPC应答时,完成此类传输后,RPC接收方(客户端)必须发出RDMA_DONE消息(如第3.8节所述),通知对等方(服务器)可以释放源缓冲区。
The read chunk list is constructed and used entirely within the RPC/XDR layer. Other than specifying the minimum chunk size, the management of the read chunk list is automatic and transparent to an RPC application.
读取区块列表完全在RPC/XDR层中构建和使用。除了指定最小块大小之外,读取块列表的管理是自动的,并且对RPC应用程序是透明的。
When a write chunk list is provided for the results of the RPC call, the RPC server MUST provide any corresponding data via RDMA Write to the memory referenced in the chunk list entries. The RPC reply conveys this by returning the write chunk list to the client with the lengths rewritten to match the actual transfer. The XDR decode of the reply therefore performs no local data transfer but merely returns the length obtained from the reply.
当为RPC调用的结果提供写块列表时,RPC服务器必须通过RDMA write向块列表项中引用的内存提供任何相应的数据。RPC应答通过将写块列表返回给客户机,并将其长度重写为与实际传输匹配,从而传达这一点。因此,应答的XDR解码不执行本地数据传输,而仅返回从应答获得的长度。
Each decoded result consumes one entry in the write chunk list, which in turn consists of an array of RDMA segments. The length is therefore the sum of all returned lengths in all segments comprising the corresponding list entry. As each list entry is decoded, the entire entry is consumed.
每个解码结果都使用写块列表中的一个条目,而写块列表又由一个RDMA段数组组成。因此,长度是包含相应列表项的所有段中所有返回长度的总和。当每个列表条目被解码时,整个条目被消耗。
The write chunk list is constructed and used by the RPC application. The RPC/XDR layer simply conveys the list between client and server and initiates the RDMA Writes back to the client. The mapping of
写块列表由RPC应用程序构造和使用。RPC/XDR层只是在客户端和服务器之间传递列表,并启动RDMA写回客户端。映射
write chunk list entries to procedure arguments MUST be determined for each protocol. An example of a mapping is described in [RFC5667].
必须为每个协议确定将区块列表条目写入过程参数。[RFC5667]中描述了映射的示例。
The XDR protocol requires 4-byte alignment of each new encoded element in any XDR stream. This requirement is for efficiency and ease of decode/unmarshaling at the receiver -- if the XDR stream buffer begins on a native machine boundary, then the XDR elements will lie on similarly predictable offsets in memory.
XDR协议要求对任何XDR流中的每个新编码元素进行4字节对齐。这一要求是为了提高接收机的解码/解编效率和方便性——如果XDR流缓冲区从本机边界开始,那么XDR元素将位于内存中类似的可预测偏移量上。
Within XDR, when non-4-byte encodes (such as an odd-length string or bulk data) are marshaled, their length is encoded literally, while their data is padded to begin the next element at a 4-byte boundary in the XDR stream. For TCP or RDMA inline encoding, this minimal overhead is required because the transport-specific framing relies on the fact that the relative offset of the elements in the XDR stream from the start of the message determines the XDR position during decode.
在XDR中,当非4字节编码(如奇数长度字符串或批量数据)被封送时,它们的长度按字面编码,而它们的数据被填充,以在XDR流中的4字节边界处开始下一个元素。对于TCP或RDMA内联编码,需要最小的开销,因为特定于传输的帧依赖于XDR流中元素从消息开始的相对偏移量确定解码期间的XDR位置这一事实。
On the other hand, RPC/RDMA Read chunks carry the XDR position of each chunked element and length of the Chunk segment, and can be placed by the receiver exactly where they belong in the receiver's memory without regard to the alignment of their position in the XDR stream. Since any rounded-up data is not actually part of the upper layer's message, the receiver will not reference it, and there is no reason to set it to any particular value in the receiver's memory.
另一方面,RPC/RDMA读取区块携带每个区块元素的XDR位置和区块段的长度,并且可以由接收器精确地放置在它们在接收器内存中所属的位置,而不考虑它们在XDR流中的位置对齐。由于任何向上取整的数据实际上不是上层消息的一部分,因此接收器不会引用它,并且没有理由将其设置为接收器内存中的任何特定值。
When roundup is present at the end of a sequence of chunks, the length of the sequence will terminate it at a non-4-byte XDR position. When the receiver proceeds to decode the remaining part of the XDR stream, it inspects the XDR position indicated by the next chunk. Because this position will not match (else roundup would not have occurred), the receiver decoding will fall back to inspecting the remaining inline portion. If in turn, no data remains to be decoded from the inline portion, then the receiver MUST conclude that roundup is present, and therefore it advances the XDR decode position to that indicated by the next chunk (if any). In this way, roundup is passed without ever actually transferring additional XDR bytes.
当在块序列的末尾出现取整时,序列的长度将在非4字节XDR位置终止。当接收器继续解码XDR流的剩余部分时,它检查下一个块指示的XDR位置。由于该位置不匹配(否则不会发生取整),接收器解码将退回到检查剩余的内联部分。反过来,如果没有数据需要从内联部分解码,那么接收器必须断定存在取整,因此它将XDR解码位置提前到下一个块(如果有)指示的位置。这样,就可以在不实际传输额外XDR字节的情况下传递汇总。
Some protocol operations over RPC/RDMA, for instance NFS writes of data encountered at the end of a file or in direct I/O situations, commonly yield these roundups within RDMA Read Chunks. Because any roundup bytes are not actually present in the data buffers being written, memory for these bytes would come from noncontiguous buffers, either as an additional memory registration segment or as an additional Chunk. The overhead of these operations can be
RPC/RDMA上的某些协议操作,例如在文件结尾或直接I/O情况下遇到的NFS数据写入,通常会在RDMA读块中产生这些汇总。由于任何汇总字节实际上都不存在于正在写入的数据缓冲区中,因此这些字节的内存将来自非连续缓冲区,作为附加内存注册段或附加块。这些操作的开销可能很小
significant to both the sender to marshal them and even higher to the receiver to which to transfer them. Senders SHOULD therefore avoid encoding individual RDMA Read Chunks for roundup whenever possible. It is acceptable, but not necessary, to include roundup data in an existing RDMA Read Chunk, but only if it is already present in the XDR stream to carry upper-layer data.
对发送方来说,封送它们都很重要,对接收方来说,更重要的是要将它们传送到接收方。因此,发送方应尽可能避免将单个RDMA读取块编码为汇总。在现有RDMA读取块中包含汇总数据是可以接受的,但不是必须的,但前提是它已经存在于XDR流中,以承载上层数据。
Note that there is no exposure of additional data at the sender due to eliding roundup data from the XDR stream, since any additional sender buffers are never exposed to the peer. The data is literally not there to be transferred.
请注意,由于从XDR流中删除了汇总数据,因此发送方没有暴露额外的数据,因为任何额外的发送方缓冲区都不会暴露给对等方。数据实际上是不存在的,无法传输。
For RDMA Write Chunks, a simpler encoding method applies. Again, roundup bytes are not transferred, instead the chunk length sent to the receiver in the reply is simply increased to include any roundup. Because of the requirement that the RDMA Write Chunks are filled sequentially without gaps, this situation can only occur on the final chunk receiving data. Therefore, there is no opportunity for roundup data to insert misalignment or positional gaps into the XDR stream.
对于RDMA写块,可以使用更简单的编码方法。同样,不会传输取整字节,而是简单地增加在应答中发送给接收方的数据块长度,以包括任何取整。由于要求RDMA写块按顺序填充而没有间隙,因此这种情况只能发生在接收数据的最后一个块上。因此,汇总数据没有机会将未对准或位置间隙插入XDR流。
The RDMA transport for RPC provides three methods of moving data between RPC client and server:
RPC的RDMA传输提供了三种在RPC客户端和服务器之间移动数据的方法:
Inline Data is moved between RPC client and server within an RDMA Send.
内联数据在RDMA发送中在RPC客户端和服务器之间移动。
RDMA Read Data is moved between RPC client and server via an RDMA Read operation via steering tag; address and offset obtained from a read chunk list.
RDMA读取数据通过RDMA读取操作在RPC客户端和服务器之间移动,该操作通过转向标签进行;从读块列表中获取的地址和偏移量。
RDMA Write Result data is moved from RPC server to client via an RDMA Write operation via steering tag; address and offset obtained from a write chunk list or reply chunk in the client's RPC call message.
RDMA写入结果数据通过RDMA写入操作(通过转向标记)从RPC服务器移动到客户端;从客户端RPC调用消息中的写块列表或应答块中获取的地址和偏移量。
These methods of data movement may occur in combinations within a single RPC. For instance, an RPC call may contain some inline data along with some large chunks to be transferred via RDMA Read to the server. The reply to that call may have some result chunks that the server RDMA Writes back to the client. The following protocol interactions illustrate RPC calls that use these methods to move RPC message data:
这些数据移动方法可以在单个RPC中组合出现。例如,RPC调用可能包含一些内联数据,以及一些要通过RDMA Read传输到服务器的大数据块。对该调用的应答可能有一些结果块,服务器RDMA会将这些结果块写回客户端。以下协议交互说明了使用这些方法移动RPC消息数据的RPC调用:
An RPC with write chunks in the call message:
调用消息中包含写块的RPC:
RPC Client RPC Server | RPC Call + Write Chunk list | Send | ------------------------------> | | | | Chunk 1 | | <------------------------------ | Write | : | | Chunk n | | <------------------------------ | Write | | | RPC Reply | | <------------------------------ | Send
RPC Client RPC Server | RPC Call + Write Chunk list | Send | ------------------------------> | | | | Chunk 1 | | <------------------------------ | Write | : | | Chunk n | | <------------------------------ | Write | | | RPC Reply | | <------------------------------ | Send
In the presence of write chunks, RDMA ordering provides the guarantee that all data in the RDMA Write operations has been placed in memory prior to the client's RPC reply processing.
在存在写块的情况下,RDMA排序可以保证RDMA写操作中的所有数据都已在客户端的RPC应答处理之前放置在内存中。
An RPC with read chunks in the call message:
调用消息中包含读取区块的RPC:
RPC Client RPC Server | RPC Call + Read Chunk list | Send | ------------------------------> | | | | Chunk 1 | | +------------------------------ | Read | v-----------------------------> | | : | | Chunk n | | +------------------------------ | Read | v-----------------------------> | | | | RPC Reply | | <------------------------------ | Send
RPC Client RPC Server | RPC Call + Read Chunk list | Send | ------------------------------> | | | | Chunk 1 | | +------------------------------ | Read | v-----------------------------> | | : | | Chunk n | | +------------------------------ | Read | v-----------------------------> | | | | RPC Reply | | <------------------------------ | Send
An RPC with read chunks in the reply message:
回复消息中包含读取区块的RPC:
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RPC Reply + Read Chunk list | | <------------------------------ | Send | | | Chunk 1 | Read | ------------------------------+ | | <-----------------------------v | | : | | Chunk n | Read | ------------------------------+ | | <-----------------------------v | | | | Done | Send | ------------------------------> |
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RPC Reply + Read Chunk list | | <------------------------------ | Send | | | Chunk 1 | Read | ------------------------------+ | | <-----------------------------v | | : | | Chunk n | Read | ------------------------------+ | | <-----------------------------v | | | | Done | Send | ------------------------------> |
The final Done message allows the RPC client to signal the server that it has received the chunks, so the server can de-register and free the memory holding the chunks. A Done completion is not necessary for an RPC call, since the RPC reply Send is itself a receive completion notification. In the event that the client fails to return the Done message within some timeout period, the server MAY conclude that a protocol violation has occurred and close the RPC connection, or it MAY proceed with a de-register and free its chunk buffers. This may result in a fatal RDMA error if the client later attempts to perform an RDMA Read operation, which amounts to the same thing.
最终完成的消息允许RPC客户机向服务器发出它已收到区块的信号,以便服务器可以取消注册并释放包含区块的内存。RPC调用不需要完成,因为RPC应答发送本身就是接收完成通知。如果客户端未能在某个超时时间内返回Done消息,则服务器可能会断定发生了协议冲突并关闭RPC连接,或者继续进行注销并释放其区块缓冲区。如果客户端稍后尝试执行RDMA读取操作,这可能会导致致命的RDMA错误,这相当于相同的操作。
The use of read chunks in RPC reply messages is much less efficient than providing write chunks in the originating RPC calls, due to the additional message exchanges, the need for the RPC server to advertise buffers to the peer, the necessity of the server maintaining a timer for the purpose of recovery from misbehaving clients, and the need for additional memory registration. Their use is NOT RECOMMENDED by upper layers where efficiency is a primary concern [RFC5667]. However, they MAY be employed by upper-layer protocol bindings that are primarily concerned with transparency, since they can frequently be implemented completely within the RPC lower layers.
由于额外的消息交换、RPC服务器需要向对等方播发缓冲区、服务器需要维护计时器以便从行为不正常的客户端恢复,因此,在RPC应答消息中使用读块的效率远远低于在原始RPC调用中提供写块的效率,并且需要额外的内存注册。上层不建议使用它们,因为效率是主要考虑因素[RFC5667]。但是,上层协议绑定可以使用它们,这些绑定主要与透明度有关,因为它们通常可以完全在RPC底层中实现。
It is important to note that the Done message consumes a credit at the RPC server. The RPC server SHOULD provide sufficient credits to the client to allow the Done message to be sent without deadlock (driving the outstanding credit count to zero). The RPC client MUST
重要的是要注意,Done消息消耗RPC服务器上的信用。RPC服务器应该向客户端提供足够的信用,以允许发送完成的消息而不会出现死锁(将未完成的信用计数置为零)。RPC客户端必须
account for its required Done messages to the server in its accounting of available credits, and the server SHOULD replenish any credit consumed by its use of such exchanges at its earliest opportunity.
在计算可用信用时向服务器说明其所需的已完成消息,服务器应尽早补充使用此类交换所消耗的任何信用。
Finally, it is possible to conceive of RPC exchanges that involve any or all combinations of write chunks in the RPC call, read chunks in the RPC call, and read chunks in the RPC reply. Support for such exchanges is straightforward from a protocol perspective, but in practice such exchanges would be quite rare, limited to upper-layer protocol exchanges that transferred bulk data in both the call and corresponding reply.
最后,可以设想RPC交换涉及RPC调用中的写块、RPC调用中的读块和RPC应答中的读块的任何或所有组合。从协议的角度来看,对这种交换的支持是直接的,但实际上这种交换非常罕见,仅限于在呼叫和相应应答中传输大量数据的上层协议交换。
Alignment of specific opaque data enables certain scatter/gather optimizations. Padding leverages the useful property that RDMA transfers preserve alignment of data, even when they are placed into pre-posted receive buffers by Sends.
特定不透明数据的对齐可以实现某些分散/聚集优化。填充利用了RDMA传输保留数据对齐的有用特性,即使发送将数据放置在预发布的接收缓冲区中。
Many servers can make good use of such padding. Padding allows the chaining of RDMA receive buffers such that any data transferred by RDMA on behalf of RPC requests will be placed into appropriately aligned buffers on the system that receives the transfer. In this way, the need for servers to perform RDMA Read to satisfy all but the largest client writes is obviated.
许多服务器可以很好地利用这种填充。填充允许RDMA接收缓冲区的链接,以便RDMA代表RPC请求传输的任何数据都将被放置到接收传输的系统上适当对齐的缓冲区中。通过这种方式,服务器无需执行RDMA读取以满足除最大客户机写入之外的所有写入。
The effect of padding is demonstrated below showing prior bytes on an XDR stream ("XXX" in the figure below) followed by an opaque field consisting of four length bytes ("LLLL") followed by data bytes ("DDD"). The receiver of the RDMA Send has posted two chained receive buffers. Without padding, the opaque data is split across the two buffers. With the addition of padding bytes ("ppp") prior to the first data byte, the data can be forced to align correctly in the second buffer.
填充的效果如下所示,显示XDR流上的先前字节(“下图中的XXX”),后跟一个不透明字段,该字段由四个长度字节(“LLLL”)和数据字节(“DDD”)组成。RDMA发送的接收器已发布两个链接接收缓冲区。在没有填充的情况下,不透明数据会在两个缓冲区之间分割。通过在第一个数据字节之前添加填充字节(“ppp”),可以强制数据在第二个缓冲区中正确对齐。
Buffer 1 Buffer 2 Unpadded -------------- --------------
Buffer 1 Buffer 2 Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded
填充
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport encoding, flagged with a specific message type. Where padding is applied, two values are passed to the peer: an "rdma_align", which is the padding value used, and "rdma_thresh", which is the opaque data size at or above which padding is applied. For instance, if the server is using chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes could be used to achieve alignment of the data. The XDR routine at the peer MUST consult these values when decoding opaque values. Where the decoded length exceeds the rdma_thresh, the XDR decode MUST skip over the appropriate padding as indicated by rdma_align and the current XDR stream position.
填充完全在RDMA传输编码中实现,并用特定的消息类型进行标记。在应用填充的情况下,将两个值传递给对等方:“rdma_align”是使用的填充值,“rdma_thresh”是应用填充的不透明数据大小。例如,如果服务器使用链式4KB接收缓冲区,那么最多可以使用(4KB-1)个填充字节来实现数据对齐。对等端的XDR例程在解码不透明值时必须参考这些值。当解码长度超过rdma_阈值时,XDR解码必须跳过rdma_align和当前XDR流位置指示的适当填充。
RPC call and reply messages are conveyed across an RDMA transport with a prepended RPC-over-RDMA header. The RPC-over-RDMA header includes data for RDMA flow control credits, padding parameters, and lists of addresses that provide direct data placement via RDMA Read and Write operations. The layout of the RPC message itself is unchanged from that described in [RFC5531] except for the possible exclusion of large data chunks that will be moved by RDMA Read or Write operations. If the RPC message (along with the RPC-over-RDMA header) is too long for the posted receive buffer (even after any large chunks are removed), then the entire RPC message MAY be moved separately as a chunk, leaving just the RPC-over-RDMA header in the RDMA Send.
RPC调用和应答消息通过RDMA传输和预先设置的RPC over RDMA头进行传输。RPC over RDMA标头包括RDMA流控制信用、填充参数的数据,以及通过RDMA读写操作提供直接数据放置的地址列表。RPC消息本身的布局与[RFC5531]中描述的相同,但可能排除RDMA读写操作将移动的大数据块。如果RPC消息(连同RPC over RDMA头)对于发布的接收缓冲区来说太长(即使在删除了任何大的块之后),那么整个RPC消息可以作为块单独移动,只在RDMA发送中留下RPC over RDMA头。
The RPC-over-RDMA header begins with four 32-bit fields that are always present and that control the RDMA interaction including RDMA-specific flow control. These are then followed by a number of items such as chunk lists and padding that MAY or MUST NOT be present depending on the type of transmission. The four fields that are always present are:
RPC over RDMA报头以始终存在的四个32位字段开始,这些字段控制RDMA交互,包括特定于RDMA的流控制。然后,根据传输类型的不同,后面会出现或不出现大量项目,如块列表和填充。始终存在的四个字段是:
1. Transaction ID (XID). The XID generated for the RPC call and reply. Having the XID at the beginning of the message makes it easy to establish the message context. This XID MUST be the same as the XID in the RPC header. The receiver MAY perform its processing based solely on the XID in the RPC-over-RDMA header, and thereby ignore the XID in the RPC header, if it so chooses.
1. 事务ID(XID)。为RPC调用和应答生成的XID。将XID放在消息的开头可以很容易地建立消息上下文。此XID必须与RPC标头中的XID相同。接收机可以仅基于RPC over RDMA报头中的XID来执行其处理,并因此忽略RPC报头中的XID(如果它选择这样做的话)。
2. Version number. This version of the RPC RDMA message protocol is 1. The version number MUST be increased by 1 whenever the format of the RPC RDMA messages is changed.
2. 版本号。RPC RDMA消息协议的此版本为1。每当RPC RDMA消息的格式更改时,版本号必须增加1。
3. Flow control credit value. When sent in an RPC call message, the requested value is provided. When sent in an RPC reply message, the granted value is returned. RPC calls SHOULD NOT be sent in excess of the currently granted limit.
3. 流量控制信用值。在RPC调用消息中发送时,将提供请求的值。在RPC回复消息中发送时,将返回授予的值。发送的RPC调用不应超过当前授予的限制。
4. Message type.
4. 消息类型。
o RDMA_MSG = 0 indicates that chunk lists and RPC message follow.
o RDMA_MSG=0表示块列表和RPC消息跟随。
o RDMA_NOMSG = 1 indicates that after the chunk lists there is no RPC message. In this case, the chunk lists provide information to allow the message proper to be transferred using RDMA Read or Write and thus is not appended to the RPC-over-RDMA header.
o RDMA_NOMSG=1表示在区块列表之后没有RPC消息。在这种情况下,区块列表提供了允许使用RDMA读或写传输消息的信息,因此不会附加到RPC over RDMA头中。
o RDMA_MSGP = 2 indicates that a chunk list and RPC message with some padding follow.
o RDMA_MSGP=2表示块列表和带有一些填充的RPC消息随后出现。
o RDMA_DONE = 3 indicates that the message signals the completion of a chunk transfer via RDMA Read.
o RDMA_DONE=3表示消息通过RDMA Read发出区块传输完成的信号。
o RDMA_ERROR = 4 is used to signal any detected error(s) in the RPC RDMA chunk encoding.
o RDMA_ERROR=4用于向RPC RDMA区块编码中检测到的任何错误发送信号。
Because the version number is encoded as part of this header, and the RDMA_ERROR message type is used to indicate errors, these first four fields and the start of the following message body MUST always remain aligned at these fixed offsets for all versions of the RPC-over-RDMA header.
由于版本号编码为此标头的一部分,并且RDMA_错误消息类型用于指示错误,因此对于所有版本的RPC over RDMA标头,前四个字段和以下消息正文的开头必须始终保持在这些固定偏移量处对齐。
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write chunk lists follow. If the Read chunk list is null (a 32-bit word of zeros), then there are no chunks to be transferred separately and the RPC message follows in its entirety. If non-null, then it's the beginning of an XDR encoded sequence of Read chunk list entries. If the Write chunk list is non-null, then an XDR encoded sequence of Write chunk entries follows.
对于RDMA_MSG或RDMA_NOMSG类型的消息,下面是读写块列表。如果读取的块列表为null(32位的零字),则没有要单独传输的块,并且RPC消息将完全跟随。如果非空,则它是XDR编码的读取区块列表项序列的开始。如果写入区块列表为非空,则会出现XDR编码的写入区块条目序列。
If the message type is RDMA_MSGP, then two additional fields that specify the padding alignment and threshold are inserted prior to the Read and Write chunk lists.
如果消息类型为RDMA_MSGP,则在读写块列表之前插入另外两个指定填充对齐方式和阈值的字段。
A header of message type RDMA_MSG or RDMA_MSGP MUST be followed by the RPC call or RPC reply message body, beginning with the XID. The XID in the RDMA_MSG or RDMA_MSGP header MUST match this.
消息类型为RDMA_MSG或RDMA_MSGP的头后面必须紧跟RPC调用或RPC回复消息正文,以XID开头。RDMA_MSG或RDMA_MSGP标头中的XID必须与此匹配。
+--------+---------+---------+-----------+-------------+---------- | | | | Message | NULLs | RPC Call | XID | Version | Credits | Type | or | or | | | | | Chunk Lists | Reply Msg +--------+---------+---------+-----------+-------------+----------
+--------+---------+---------+-----------+-------------+---------- | | | | Message | NULLs | RPC Call | XID | Version | Credits | Type | or | or | | | | | Chunk Lists | Reply Msg +--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE and RDMA_ERROR, no chunk list or RPC message follows. As an implementation hint: a gather operation on the Send of the RDMA RPC message can be used to marshal the initial header, the chunk list, and the RPC message itself.
注意,在RDMA_DONE和RDMA_ERROR的情况下,不会出现块列表或RPC消息。作为一个实现提示:RDMA RPC消息的发送上的聚集操作可用于封送初始头、区块列表和RPC消息本身。
When a peer receives an RPC RDMA message, it MUST perform the following basic validity checks on the header and chunk contents. If such errors are detected in the request, an RDMA_ERROR reply MUST be generated.
当对等方收到RPC RDMA消息时,它必须对标头和区块内容执行以下基本有效性检查。如果在请求中检测到此类错误,则必须生成RDMA_错误回复。
Two types of errors are defined, version mismatch and invalid chunk format. When the peer detects an RPC-over-RDMA header version that it does not support (currently this document defines only version 1), it replies with an error code of ERR_VERS, and provides the low and high inclusive version numbers it does, in fact, support. The version number in this reply MUST be any value otherwise valid at the receiver. When other decoding errors are detected in the header or chunks, either an RPC decode error MAY be returned or the RPC/RDMA error code ERR_CHUNK MUST be returned.
定义了两种类型的错误:版本不匹配和块格式无效。当对等方检测到它不支持的RPC over RDMA标头版本(当前本文档仅定义版本1)时,它会用错误代码ERR\u VERS进行回复,并提供它实际上支持的低和高兼容版本号。此回复中的版本号必须是在接收方有效的任何值。当在标头或区块中检测到其他解码错误时,可能会返回RPC解码错误,或者必须返回RPC/RDMA错误代码ERR_区块。
Here is the message layout in XDR language.
下面是XDR语言中的消息布局。
struct xdr_rdma_segment { uint32 handle; /* Registered memory handle */ uint32 length; /* Length of the chunk in bytes */ uint64 offset; /* Chunk virtual address or offset */ };
struct xdr_rdma_segment { uint32 handle; /* Registered memory handle */ uint32 length; /* Length of the chunk in bytes */ uint64 offset; /* Chunk virtual address or offset */ };
struct xdr_read_chunk { uint32 position; /* Position in XDR stream */ struct xdr_rdma_segment target; };
struct xdr_read_chunk { uint32 position; /* Position in XDR stream */ struct xdr_rdma_segment target; };
struct xdr_read_list { struct xdr_read_chunk entry; struct xdr_read_list *next; };
struct xdr_read_list { struct xdr_read_chunk entry; struct xdr_read_list *next; };
struct xdr_write_chunk { struct xdr_rdma_segment target<>; };
struct xdr_write_chunk { struct xdr_rdma_segment target<>; };
struct xdr_write_list { struct xdr_write_chunk entry; struct xdr_write_list *next; };
struct xdr_write_list { struct xdr_write_chunk entry; struct xdr_write_list *next; };
struct rdma_msg { uint32 rdma_xid; /* Mirrors the RPC header xid */ uint32 rdma_vers; /* Version of this protocol */ uint32 rdma_credit; /* Buffers requested/granted */ rdma_body rdma_body; };
struct rdma_msg { uint32 rdma_xid; /* Mirrors the RPC header xid */ uint32 rdma_vers; /* Version of this protocol */ uint32 rdma_credit; /* Buffers requested/granted */ rdma_body rdma_body; };
enum rdma_proc { RDMA_MSG=0, /* An RPC call or reply msg */ RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ RDMA_MSGP=2, /* An RPC call or reply msg with padding */ RDMA_DONE=3, /* Client signals reply completion */ RDMA_ERROR=4 /* An RPC RDMA encoding error */ };
enum rdma_proc { RDMA_MSG=0, /* An RPC call or reply msg */ RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */ RDMA_MSGP=2, /* An RPC call or reply msg with padding */ RDMA_DONE=3, /* Client signals reply completion */ RDMA_ERROR=4 /* An RPC RDMA encoding error */ };
union rdma_body switch (rdma_proc proc) { case RDMA_MSG: rpc_rdma_header rdma_msg; case RDMA_NOMSG: rpc_rdma_header_nomsg rdma_nomsg; case RDMA_MSGP: rpc_rdma_header_padded rdma_msgp; case RDMA_DONE: void; case RDMA_ERROR: rpc_rdma_error rdma_error; };
union rdma_body switch (rdma_proc proc) { case RDMA_MSG: rpc_rdma_header rdma_msg; case RDMA_NOMSG: rpc_rdma_header_nomsg rdma_nomsg; case RDMA_MSGP: rpc_rdma_header_padded rdma_msgp; case RDMA_DONE: void; case RDMA_ERROR: rpc_rdma_error rdma_error; };
struct rpc_rdma_header { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; /* rpc body follows */ };
struct rpc_rdma_header { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; /* rpc body follows */ };
struct rpc_rdma_header_nomsg { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply;
struct rpc_rdma_header_nomsg { struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply;
};
};
struct rpc_rdma_header_padded { uint32 rdma_align; /* Padding alignment */ uint32 rdma_thresh; /* Padding threshold */ struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; /* rpc body follows */ };
struct rpc_rdma_header_padded { uint32 rdma_align; /* Padding alignment */ uint32 rdma_thresh; /* Padding threshold */ struct xdr_read_list *rdma_reads; struct xdr_write_list *rdma_writes; struct xdr_write_chunk *rdma_reply; /* rpc body follows */ };
enum rpc_rdma_errcode { ERR_VERS = 1, ERR_CHUNK = 2 };
enum rpc_rdma_errcode { ERR_VERS = 1, ERR_CHUNK = 2 };
union rpc_rdma_error switch (rpc_rdma_errcode err) { case ERR_VERS: uint32 rdma_vers_low; uint32 rdma_vers_high; case ERR_CHUNK: void; default: uint32 rdma_extra[8]; };
union rpc_rdma_error switch (rpc_rdma_errcode err) { case ERR_VERS: uint32 rdma_vers_low; uint32 rdma_vers_high; case ERR_CHUNK: void; default: uint32 rdma_extra[8]; };
The receiver of RDMA Send messages is required by RDMA to have previously posted one or more adequately sized buffers. The RPC client can inform the server of the maximum size of its RDMA Send messages via the Connection Configuration Protocol described later in this document.
RDMA要求RDMA发送消息的接收者之前发布一个或多个大小适当的缓冲区。RPC客户端可以通过本文档后面描述的连接配置协议通知服务器其RDMA发送消息的最大大小。
Since RPC messages are frequently small, memory savings can be achieved by posting small buffers. Even large messages like NFS READ or WRITE will be quite small once the chunks are removed from the message. However, there may be large messages that would demand a very large buffer be posted, where the contents of the buffer may not be a chunkable XDR element. A good example is an NFS READDIR reply, which may contain a large number of small filename strings. Also, the NFS version 4 protocol [RFC3530] features COMPOUND request and reply messages of unbounded length.
由于RPC消息通常很小,因此可以通过发布小的缓冲区来节省内存。一旦从消息中删除区块,即使是大型消息(如NFS读或写)也将非常小。但是,可能会有需要发布非常大的缓冲区的大型消息,其中缓冲区的内容可能不是可分块的XDR元素。一个很好的例子是NFS READDIR回复,它可能包含大量的小文件名字符串。此外,NFS版本4协议[RFC3530]具有长度无限的复合请求和回复消息。
Ideally, each upper layer will negotiate these limits. However, it is frequently necessary to provide a transparent solution.
理想情况下,每个上层将协商这些限制。然而,经常需要提供一个透明的解决方案。
One relatively simple method is to have the client identify any RPC message that exceeds the RPC server's posted buffer size and move it separately as a chunk, i.e., reference it as the first entry in the read chunk list with an XDR position of zero.
一种相对简单的方法是让客户端识别任何超过RPC服务器发布的缓冲区大小的RPC消息,并将其作为块单独移动,即,将其作为读取块列表中XDR位置为零的第一个条目引用。
Normal Message
正常信息
+--------+---------+---------+------------+-------------+---------- | | | | | | RPC Call | XID | Version | Credits | RDMA_MSG | Chunk Lists | or | | | | | | Reply Msg +--------+---------+---------+------------+-------------+----------
+--------+---------+---------+------------+-------------+---------- | | | | | | RPC Call | XID | Version | Credits | RDMA_MSG | Chunk Lists | or | | | | | | Reply Msg +--------+---------+---------+------------+-------------+----------
Long Message
长消息
+--------+---------+---------+------------+-------------+ | | | | | | | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | | | | | | | +--------+---------+---------+------------+-------------+ | | +---------- | | Long RPC Call +->| or | Reply Message +----------
+--------+---------+---------+------------+-------------+ | | | | | | | XID | Version | Credits | RDMA_NOMSG | Chunk Lists | | | | | | | +--------+---------+---------+------------+-------------+ | | +---------- | | Long RPC Call +->| or | Reply Message +----------
If the receiver gets an RPC-over-RDMA header with a message type of RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR position, it allocates a registered buffer and issues an RDMA Read of the long RPC message into it. The receiver then proceeds to XDR decode the RPC message as if it had received it inline with the Send data. Further decoding may issue additional RDMA Reads to bring over additional chunks.
如果接收方获得消息类型为RDMA_NOMSG的RPC over RDMA报头,并找到XDR位置为零的初始读取块列表条目,则它将分配一个已注册的缓冲区,并向其中发出长RPC消息的RDMA读取。然后,接收器继续对RPC消息进行XDR解码,就好像它已接收到与发送数据内联的RPC消息一样。进一步的解码可能会发出额外的RDMA读取以带来额外的块。
Although the handling of long messages requires one extra network turnaround, in practice these messages will be rare if the posted receive buffers are correctly sized, and of course they will be non-existent for RDMA-aware upper layers.
尽管处理长消息需要额外的网络周转,但实际上,如果发送的接收缓冲区大小正确,这些消息将很少,当然,对于支持RDMA的上层来说,它们将不存在。
A long call RPC with request supplied via RDMA Read
通过RDMA Read提供请求的长RPC调用
RPC Client RPC Server | RDMA-over-RPC Header | Send | ------------------------------> | | | | Long RPC Call Msg | | +------------------------------ | Read | v-----------------------------> | | | | RDMA-over-RPC Reply | | <------------------------------ | Send
RPC Client RPC Server | RDMA-over-RPC Header | Send | ------------------------------> | | | | Long RPC Call Msg | | +------------------------------ | Read | v-----------------------------> | | | | RDMA-over-RPC Reply | | <------------------------------ | Send
An RPC with long reply returned via RDMA Read
通过RDMA Read返回长回复的RPC
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RDMA-over-RPC Header | | <------------------------------ | Send | | | Long RPC Reply Msg | Read | ------------------------------+ | | <-----------------------------v | | | | Done | Send | ------------------------------> |
RPC Client RPC Server | RPC Call | Send | ------------------------------> | | | | RDMA-over-RPC Header | | <------------------------------ | Send | | | Long RPC Reply Msg | Read | ------------------------------+ | | <-----------------------------v | | | | Done | Send | ------------------------------> |
It is possible for a single RPC procedure to employ both a long call for its arguments and a long reply for its results. However, such an operation is atypical, as few upper layers define such exchanges.
单个RPC过程可以对其参数使用长调用,对其结果使用长应答。然而,这样的操作是非典型的,因为很少有上层定义这样的交换。
A superior method of handling long RPC replies is to have the RPC client post a large buffer into which the server can write a large RPC reply. This has the advantage that an RDMA Write may be slightly faster in network latency than an RDMA Read, and does not require the server to wait for the completion as it must for RDMA Read. Additionally, for a reply it removes the need for an RDMA_DONE message if the large reply is returned as a Read chunk.
处理长RPC回复的一种更好的方法是让RPC客户端发布一个大的缓冲区,服务器可以在其中写入一个大的RPC回复。这样做的优点是,RDMA写入的网络延迟可能比RDMA读取的网络延迟稍快,并且不需要服务器像RDMA读取那样等待完成。此外,对于回复,如果大的回复作为读取块返回,则不需要RDMA_DONE消息。
This protocol supports direct return of a large reply via the inclusion of an OPTIONAL rdma_reply write chunk after the read chunk list and the write chunk list. The client allocates a buffer sized to receive a large reply and enters its steering tag, address and length in the rdma_reply write chunk. If the reply message is too
该协议支持通过在读块列表和写块列表之后包含可选的rdma_reply write块来直接返回大型回复。客户机分配一个大小为接收大应答的缓冲区,并在rdma_应答写块中输入其引导标记、地址和长度。如果回复信息太多
long to return inline with an RDMA Send (exceeds the size of the client's posted receive buffer), even with read chunks removed, then the RPC server performs an RDMA Write of the RPC reply message into the buffer indicated by the rdma_reply chunk. If the client doesn't provide an rdma_reply chunk, or if it's too small, then if the upper-layer specification permits, the message MAY be returned as a Read chunk.
长时间内联返回RDMA发送(超过客户端发布的接收缓冲区的大小),即使删除了读取区块,RPC服务器也会将RPC回复消息的RDMA写入RDMA_回复区块指示的缓冲区。如果客户端不提供rdma_应答块,或者如果它太小,那么如果上层规范允许,消息可能会作为读取块返回。
An RPC with long reply returned via RDMA Write
通过RDMA Write返回长回复的RPC
RPC Client RPC Server | RPC Call with rdma_reply | Send | ------------------------------> | | | | Long RPC Reply Msg | | <------------------------------ | Write | | | RDMA-over-RPC Header | | <------------------------------ | Send
RPC Client RPC Server | RPC Call with rdma_reply | Send | ------------------------------> | | | | Long RPC Reply Msg | | <------------------------------ | Write | | | RDMA-over-RPC Header | | <------------------------------ | Send
The use of RDMA Write to return long replies requires that the client applications anticipate a long reply and have some knowledge of its size so that an adequately sized buffer can be allocated. This is certainly true of NFS READDIR replies; where the client already provides an upper bound on the size of the encoded directory fragment to be returned by the server.
使用RDMA Write返回长回复要求客户机应用程序预期长回复,并了解其大小,以便可以分配适当大小的缓冲区。NFS READDIR回复肯定是这样;其中,客户机已经提供了服务器返回的编码目录片段大小的上限。
The use of these "reply chunks" is highly efficient and convenient for both RPC client and server. Their use is encouraged for eligible RPC operations such as NFS READDIR, which would otherwise require extensive chunk management within the results or use of RDMA Read and a Done message [RFC5667].
这些“回复块”的使用对于RPC客户端和服务器都是高效和方便的。建议将其用于符合条件的RPC操作,如NFS READDIR,否则将需要在结果中进行大量块管理,或使用RDMA Read和Done消息[RFC5667]。
RDMA Send operations require the receiver to post one or more buffers at the RDMA connection endpoint, each large enough to receive the largest Send message. Buffers are consumed as Send messages are received. If a buffer is too small, or if there are no buffers posted, the RDMA transport MAY return an error and break the RDMA connection. The receiver MUST post sufficient, adequately buffers to avoid buffer overrun or capacity errors.
RDMA发送操作要求接收方在RDMA连接端点上发布一个或多个缓冲区,每个缓冲区都足够大,可以接收最大的发送消息。接收发送消息时会使用缓冲区。如果缓冲区太小,或者没有发布缓冲区,RDMA传输可能返回错误并中断RDMA连接。接收方必须发布足够的缓冲区,以避免缓冲区溢出或容量错误。
The protocol described above includes only a mechanism for managing the number of such receive buffers and no explicit features to allow the RPC client and server to provision or control buffer sizing, nor any other session parameters.
上面描述的协议仅包括用于管理此类接收缓冲区数量的机制,并且没有允许RPC客户端和服务器提供或控制缓冲区大小的显式特性,也没有任何其他会话参数。
In the past, this type of connection management has not been necessary for RPC. RPC over UDP or TCP does not have a protocol to negotiate the link. The server can get a rough idea of the maximum size of messages from the server protocol code. However, a protocol to negotiate transport features on a more dynamic basis is desirable.
在过去,RPC不需要这种类型的连接管理。UDP或TCP上的RPC没有协商链接的协议。服务器可以从服务器协议代码中大致了解消息的最大大小。然而,需要一种在更动态的基础上协商传输特性的协议。
The Connection Configuration Protocol allows the client to pass its connection requirements to the server, and allows the server to inform the client of its connection limits.
连接配置协议允许客户端将其连接要求传递给服务器,并允许服务器通知客户端其连接限制。
Use of the Connection Configuration Protocol by an upper layer is OPTIONAL.
上层使用连接配置协议是可选的。
This protocol MAY be used for connection setup prior to the use of another RPC protocol that uses the RDMA transport. It operates in-band, i.e., it uses the connection itself to negotiate the connection parameters. To provide a basis for connection negotiation, the connection is assumed to provide a basic level of interoperability: the ability to exchange at least one RPC message at a time that is at least 1 KB in size. The server MAY exceed this basic level of configuration, but the client MUST NOT assume more than one, and MUST receive a valid reply from the server carrying the actual number of available receive messages, prior to sending its next request.
在使用另一个使用RDMA传输的RPC协议之前,此协议可用于连接设置。它在频带内工作,即使用连接本身协商连接参数。为了为连接协商提供基础,假定连接提供基本的互操作性级别:一次至少交换一条RPC消息的能力,其大小至少为1KB。服务器可能会超过此基本配置级别,但客户端不得采用多个配置,并且在发送下一个请求之前,必须从服务器接收有效回复,其中包含实际可用接收消息数。
Version 1 of the Connection Configuration Protocol consists of a single procedure that allows the client to inform the server of its connection requirements and the server to return connection information to the client.
连接配置协议的版本1由一个过程组成,该过程允许客户端将其连接要求通知服务器,服务器将连接信息返回给客户端。
The maxcall_sendsize argument is the maximum size of an RPC call message that the client MAY send inline in an RDMA Send message to the server. The server MAY return a maxcall_sendsize value that is smaller or larger than the client's request. The client MUST NOT send an inline call message larger than what the server will accept. The maxcall_sendsize limits only the size of inline RPC calls. It does not limit the size of long RPC messages transferred as an initial chunk in the Read chunk list.
maxcall_sendsize参数是客户端可以在RDMA发送消息中内联发送到服务器的RPC调用消息的最大大小。服务器可能返回比客户端请求小或大的maxcall\u sendsize值。客户端发送的内联调用消息不得大于服务器将接受的值。maxcall_sendsize仅限制内联RPC调用的大小。它不限制作为读取块列表中的初始块传输的长RPC消息的大小。
The maxreply_sendsize is the maximum size of an inline RPC message that the client will accept from the server.
maxreply_sendsize是客户端将从服务器接受的内联RPC消息的最大大小。
The maxrdmaread is the maximum number of RDMA Reads that may be active at the peer. This number correlates to the RDMA incoming RDMA Read count ("IRD") configured into each originating endpoint by the client or server. If more than this number of RDMA Read operations by the connected peer are issued simultaneously, connection loss or suboptimal flow control may result; therefore, the value SHOULD be observed at all times. The peers' values need not be equal. If zero, the peer MUST NOT issue requests that require RDMA Read to satisfy, as no transfer will be possible.
maxrdmaread是对等机上可能处于活动状态的RDMA读取的最大数量。该数字与客户端或服务器配置到每个原始端点的RDMA传入RDMA读取计数(“IRD”)相关。如果连接的对等方同时发出的RDMA读取操作超过此数量,则可能导致连接丢失或流量控制不理想;因此,应始终遵守该值。对等方的价值不必相等。如果为零,则对等方不得发出需要RDMA读取才能满足的请求,因为不可能进行传输。
The align value is the value recommended by the server for opaque data values such as strings and counted byte arrays. The client MAY use this value to compute the number of prepended pad bytes when XDR encoding opaque values in the RPC call message.
align值是服务器为不透明数据值(如字符串和计数字节数组)建议的值。当XDR对RPC调用消息中的不透明值进行编码时,客户机可以使用此值来计算前置焊盘字节数。
typedef unsigned int uint32;
typedef无符号整数uint32;
struct config_rdma_req { uint32 maxcall_sendsize; /* max size of inline RPC call */ uint32 maxreply_sendsize; /* max size of inline RPC reply */ uint32 maxrdmaread; /* max active RDMA Reads at client */ };
struct config_rdma_req { uint32 maxcall_sendsize; /* max size of inline RPC call */ uint32 maxreply_sendsize; /* max size of inline RPC reply */ uint32 maxrdmaread; /* max active RDMA Reads at client */ };
struct config_rdma_reply { uint32 maxcall_sendsize; /* max call size accepted by server */ uint32 align; /* server's receive buffer alignment */ uint32 maxrdmaread; /* max active RDMA Reads at server */ };
struct config_rdma_reply { uint32 maxcall_sendsize; /* max call size accepted by server */ uint32 align; /* server's receive buffer alignment */ uint32 maxrdmaread; /* max active RDMA Reads at server */ };
program CONFIG_RDMA_PROG { version VERS1 { /* * Config call/reply */ config_rdma_reply CONF_RDMA(config_rdma_req) = 1; } = 1; } = 100417;
program CONFIG_RDMA_PROG { version VERS1 { /* * Config call/reply */ config_rdma_reply CONF_RDMA(config_rdma_req) = 1; } = 1; } = 100417;
RDMA requires that all data be transferred between registered memory regions at the source and destination. All protocol headers as well as separately transferred data chunks use registered memory. Since the cost of registering and de-registering memory can be a large proportion of the RDMA transaction cost, it is important to minimize registration activity. This is easily achieved within RPC controlled memory by allocating chunk list data and RPC headers in a reusable way from pre-registered pools.
RDMA要求在源和目标的注册内存区域之间传输所有数据。所有协议头以及单独传输的数据块都使用注册内存。由于注册和取消注册内存的成本可能占RDMA事务成本的很大一部分,因此最小化注册活动非常重要。通过以可重用的方式从预先注册的池中分配块列表数据和RPC头,可以在RPC控制的内存中轻松实现这一点。
The data chunks transferred via RDMA MAY occupy memory that persists outside the bounds of the RPC transaction. Hence, the default behavior of an RPC-over-RDMA transport is to register and de-register these chunks on every transaction. However, this is not a limitation of the protocol -- only of the existing local RPC API. The API is easily extended through such functions as rpc_control(3) to change the default behavior so that the application can assume responsibility for controlling memory registration through an RPC-provided registered memory allocator.
通过RDMA传输的数据块可能会占用RPC事务边界之外的内存。因此,RPC over RDMA传输的默认行为是在每个事务上注册和注销这些块。然而,这并不是协议的限制——只是现有的本地RPC API的限制。API很容易通过rpc_control(3)等函数进行扩展,以更改默认行为,这样应用程序就可以通过rpc提供的注册内存分配器负责控制内存注册。
RPC RDMA protocol errors are described in Section 4. RPC errors and RPC error recovery are not affected by the protocol, and proceed as for any RPC error condition. RDMA transport error reporting and recovery are outside the scope of this protocol.
第4节描述了RPC RDMA协议错误。RPC错误和RPC错误恢复不受协议的影响,并按任何RPC错误条件进行。RDMA传输错误报告和恢复不在本协议的范围内。
It is assumed that the link itself will provide some degree of error detection and retransmission. iWARP's Marker PDU Aligned (MPA) layer (when used over TCP), Stream Control Transmission Protocol (SCTP), as well as the InfiniBand link layer all provide Cyclic Redundancy Check (CRC) protection of the RDMA payload, and CRC-class protection is a general attribute of such transports. Additionally, the RPC layer itself can accept errors from the link level and recover via retransmission. RPC recovery can handle complete loss and re-establishment of the link.
假定链路本身将提供某种程度的错误检测和重传。iWARP的标记PDU对齐(MPA)层(通过TCP使用时)、流控制传输协议(SCTP)以及InfiniBand链路层都提供RDMA有效负载的循环冗余校验(CRC)保护,CRC类保护是此类传输的一般属性。此外,RPC层本身可以接受链路级别的错误,并通过重传进行恢复。RPC恢复可以处理链路的完全丢失和重新建立。
See Section 11 for further discussion of the use of RPC-level integrity schemes to detect errors and related efficiency issues.
有关使用RPC级别完整性方案检测错误和相关效率问题的进一步讨论,请参见第11节。
In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The mechanism used to obtain this address, and to open an RDMA connection is dependent on the type of RDMA transport, and is the responsibility of each RPC protocol binding and its local implementation.
在设置新的RDMA连接时,RPC客户端的第一个操作是获取服务器的传输地址。用于获取此地址和打开RDMA连接的机制取决于RDMA传输的类型,由每个RPC协议绑定及其本地实现负责。
RPC services normally register with a portmap or rpcbind [RFC1833] service, which associates an RPC program number with a service address. (In the case of UDP or TCP, the service address for NFS is normally port 2049.) This policy is no different with RDMA interconnects, although it may require the allocation of port numbers appropriate to each upper-layer binding that uses the RPC framing defined here.
RPC服务通常向portmap或rpcbind[RFC1833]服务注册,该服务将RPC程序编号与服务地址相关联。(在UDP或TCP的情况下,NFS的服务地址通常为端口2049。)此策略与RDMA互连没有区别,尽管它可能需要为使用此处定义的RPC帧的每个上层绑定分配适当的端口号。
When mapped atop the iWARP [RFC5040, RFC5041] transport, which uses IP port addressing due to its layering on TCP and/or SCTP, port mapping is trivial and consists merely of issuing the port in the connection process. The NFS/RDMA protocol service address has been assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP.
当映射到iWARP[RFC5040,RFC5041]传输上时,由于其在TCP和/或SCTP上的分层,因此使用IP端口寻址,端口映射很简单,只包括在连接过程中发出端口。IANA为iWARP/TCP和iWARP/SCTP分配了NFS/RDMA协议服务地址端口20049。
When mapped atop InfiniBand [IB], which uses a Group Identifier (GID)-based service endpoint naming scheme, a translation MUST be employed. One such translation is defined in the InfiniBand Port Addressing Annex [IBPORT], which is appropriate for translating IP port addressing to the InfiniBand network. Therefore, in this case, IP port addressing may be readily employed by the upper layer.
当映射到InfiniBand[IB]上时(它使用基于组标识符(GID)的服务端点命名方案),必须使用转换。InfiniBand端口寻址附件[IBPORT]中定义了一种这样的转换,它适用于将IP端口寻址转换为InfiniBand网络。因此,在这种情况下,上层可以容易地使用IP端口寻址。
When a mapping standard or convention exists for IP ports on an RDMA interconnect, there are several possibilities for each upper layer to consider:
当RDMA互连上的IP端口存在映射标准或约定时,每个上层都有几种可能需要考虑:
One possibility is to have an upper-layer server register its mapped IP port with the rpcbind service, under the netid (or netid's) defined here. An RPC/RDMA-aware client can then resolve its desired service to a mappable port, and proceed to connect. This is the most flexible and compatible approach, for those upper layers that are defined to use the rpcbind service.
一种可能是让上层服务器在此处定义的netid(或netid)下向rpcbind服务注册其映射的IP端口。然后,RPC/RDMA感知客户端可以将其所需的服务解析为可映射端口,并继续连接。对于那些定义为使用rpcbind服务的上层,这是最灵活、最兼容的方法。
A second possibility is to have the server's portmapper register itself on the RDMA interconnect at a "well known" service address. (On UDP or TCP, this corresponds to port 111.) A client could connect to this service address and use the portmap protocol to obtain a service address in response to a program number, e.g., an iWARP port number, or an InfiniBand GID.
第二种可能是让服务器的端口映射器在RDMA互连上以“众所周知”的服务地址注册自己。(在UDP或TCP上,这对应于端口111。)客户端可以连接到此服务地址,并使用portmap协议获取服务地址,以响应程序号,例如iWARP端口号或InfiniBand GID。
Alternatively, the client could simply connect to the mapped well-known port for the service itself, if it is appropriately defined. By convention, the NFS/RDMA service, when operating atop such an InfiniBand fabric, will use the same 20049 assignment as for iWARP.
或者,客户机可以简单地连接到服务本身的映射的已知端口(如果已正确定义)。按照惯例,NFS/RDMA服务在这样的InfiniBand结构上运行时,将使用与iWARP相同的20049分配。
Historically, different RPC protocols have taken different approaches to their port assignment; therefore, the specific method is left to each RPC/RDMA-enabled upper-layer binding, and not addressed here.
历史上,不同的RPC协议对其端口分配采取了不同的方法;因此,特定的方法留给每个启用RPC/RDMA的上层绑定,而不是在这里讨论。
In Section 12, "IANA Considerations", this specification defines two new "netid" values, to be used for registration of upper layers atop iWARP [RFC5040, RFC5041] and (when a suitable port translation service is available) InfiniBand [IB]. Additional RDMA-capable networks MAY define their own netids, or if they provide a port translation, MAY share the one defined here.
在第12节“IANA注意事项”中,本规范定义了两个新的“netid”值,用于在iWARP[RFC5040,RFC5041]和(当合适的端口转换服务可用时)InfiniBand[IB]上注册上层。其他支持RDMA的网络可以定义自己的NetID,或者如果它们提供端口转换,则可以共享此处定义的NetID。
RPC provides its own security via the RPCSEC_GSS framework [RFC2203]. RPCSEC_GSS can provide message authentication, integrity checking, and privacy. This security mechanism will be unaffected by the RDMA transport. The data integrity and privacy features alter the body of the message, presenting it as a single chunk. For large messages the chunk may be large enough to qualify for RDMA Read transfer. However, there is much data movement associated with computation and verification of integrity, or encryption/decryption, so certain performance advantages may be lost.
RPC通过RPCSEC_GSS框架[RFC2203]提供自己的安全性。RPCSEC_GSS可以提供消息身份验证、完整性检查和隐私。此安全机制不受RDMA传输的影响。数据完整性和隐私特性改变了消息的主体,将其呈现为单个块。对于大型消息,区块可能足够大,足以满足RDMA读取传输的条件。但是,与完整性计算和验证或加密/解密相关的数据移动很多,因此某些性能优势可能会丢失。
For efficiency, a more appropriate security mechanism for RDMA links may be link-level protection, such as certain configurations of IPsec, which may be co-located in the RDMA hardware. The use of link-level protection MAY be negotiated through the use of the new RPCSEC_GSS mechanism defined in [RFC5403] in conjunction with the Channel Binding mechanism [RFC5056] and IPsec Channel Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED where integrity and/or privacy is desired, and where efficiency is required.
为了提高效率,RDMA链路的更合适的安全机制可能是链路级保护,例如IPsec的某些配置,它们可能位于RDMA硬件中。链路级保护的使用可以通过使用[RFC5403]中定义的新RPCSEC_GSS机制以及通道绑定机制[RFC5056]和IPsec通道连接锁存[RFC5660]来协商。在需要完整性和/或隐私以及需要效率的情况下,需要使用此类机制。
An additional consideration is the protection of the integrity and privacy of local memory by the RDMA transport itself. The use of RDMA by RPC MUST NOT introduce any vulnerabilities to system memory contents, or to memory owned by user processes. These protections are provided by the RDMA layer specifications, and specifically their security models. It is REQUIRED that any RDMA provider used for RPC transport be conformant to the requirements of [RFC5042] in order to satisfy these protections.
另外一个需要考虑的问题是RDMA传输本身对本地内存的完整性和隐私的保护。RPC使用RDMA不能给系统内存内容或用户进程拥有的内存带来任何漏洞。这些保护由RDMA层规范提供,特别是其安全模型。要求用于RPC传输的任何RDMA提供程序符合[RFC5042]的要求,以满足这些保护要求。
Once delivered securely by the RDMA provider, any RDMA-exposed addresses will contain only RPC payloads in the chunk lists, transferred under the protection of RPCSEC_GSS integrity and privacy. By these means, the data will be protected end-to-end, as required by the RPC layer security model.
一旦RDMA提供程序安全地交付,任何RDMA公开地址将仅包含区块列表中的RPC有效负载,并在RPCSEC_GSS完整性和隐私保护下传输。通过这些方法,数据将按照RPC层安全模型的要求得到端到端的保护。
Where upper-layer protocols choose to supply results to the requester via read chunks, a server resource deficit can arise if the client does not promptly acknowledge their status via the RDMA_DONE message. This can potentially lead to a denial-of-service situation, with a single client unfairly (and unnecessarily) consuming server RDMA resources. Servers for such upper-layer protocols MUST protect against this situation, originating from one or many clients. For example, a time-based window of buffer availability may be offered, if the client fails to obtain the data within the window, it will simply retry using ordinary RPC retry semantics. Or, a more severe method would be for the server to simply close the client's RDMA connection, freeing the RDMA resources and allowing the server to reclaim them.
当上层协议选择通过读块向请求者提供结果时,如果客户端没有通过RDMA_DONE消息及时确认其状态,则可能会出现服务器资源不足。这可能会导致拒绝服务情况,单个客户端不公平地(不必要地)消耗服务器RDMA资源。这种上层协议的服务器必须针对这种情况进行保护,这种情况源于一个或多个客户端。例如,可以提供基于时间的缓冲区可用性窗口,如果客户机无法在该窗口内获取数据,它将使用普通RPC重试语义进行重试。或者,更严重的方法是服务器简单地关闭客户端的RDMA连接,释放RDMA资源并允许服务器回收它们。
A fairer and more useful method is provided by the protocol itself. The server MAY use the rdma_credit value to limit the number of outstanding requests for each client. By including the number of outstanding RDMA_DONE completions in the computation of available client credits, the server can limit its exposure to each client, and therefore provide uninterrupted service as its resources permit.
协议本身提供了一种更公平、更有用的方法。服务器可以使用rdma_信用值来限制每个客户端的未完成请求数。通过将未完成的RDMA_完成的完成数包括在可用客户端信用的计算中,服务器可以限制其对每个客户端的暴露,从而在其资源允许的情况下提供不间断的服务。
However, the server must ensure that it does not decrease the credit count to zero with this method, since the RDMA_DONE message is not acknowledged. If the credit count were to drop to zero solely due to outstanding RDMA_DONE messages, the client would deadlock since it would never obtain a new credit with which to continue. Therefore, if the server adjusts credits to zero for outstanding RDMA_DONE, it MUST withhold its reply to at least one message in order to provide the next credit. The time-based window (or any other appropriate method) SHOULD be used by the server to recover resources in the event that the client never returns.
但是,服务器必须确保它不会使用此方法将信用计数减少到零,因为RDMA_DONE消息未被确认。如果仅由于未完成的RDMA_DONE消息而导致信用计数降至零,则客户机将死锁,因为它将永远无法获得可继续使用的新信用。因此,如果服务器将未完成RDMA__DONE的积分调整为零,则必须保留对至少一条消息的回复,以便提供下一个积分。如果客户端从未返回,服务器应使用基于时间的窗口(或任何其他适当的方法)来恢复资源。
The Connection Configuration Protocol, when used, MUST be protected by an appropriate RPC security flavor, to ensure it is not attacked in the process of initiating an RPC/RDMA connection.
当使用连接配置协议时,必须使用适当的RPC安全风格对其进行保护,以确保在启动RPC/RDMA连接的过程中不会受到攻击。
Three new assignments are specified by this document:
本文件规定了三项新任务:
- A new set of RPC "netids" for resolving RPC/RDMA services
- 用于解析RPC/RDMA服务的一组新RPC“netID”
- Optional service port assignments for upper-layer bindings
- 上层绑定的可选服务端口分配
- An RPC program number assignment for the configuration protocol
- 配置协议的RPC程序编号分配
These assignments have been established, as below.
这些任务已确定,如下所示。
The new RPC transport has been assigned an RPC "netid", which is an rpcbind [RFC1833] string used to describe the underlying protocol in order for RPC to select the appropriate transport framing, as well as the format of the service addresses and ports.
新的RPC传输被分配了一个RPC“netid”,这是一个rpcbind[RFC1833]字符串,用于描述底层协议,以便RPC选择适当的传输帧以及服务地址和端口的格式。
The following "Netid" registry strings are defined for this purpose:
为此,定义了以下“Netid”注册表字符串:
NC_RDMA "rdma" NC_RDMA6 "rdma6"
NC_RDMA“RDMA”NC_RDMA6“RDMA6”
These netids MAY be used for any RDMA network satisfying the requirements of Section 2, and able to identify service endpoints using IP port addressing, possibly through use of a translation service as described above in Section 10, "RPC Binding". The "rdma" netid is to be used when IPv4 addressing is employed by the underlying transport, and "rdma6" for IPv6 addressing.
这些NetID可用于满足第2节要求的任何RDMA网络,并能够使用IP端口寻址(可能通过使用上文第10节“RPC绑定”中所述的翻译服务)来识别服务端点。当底层传输使用IPv4寻址时,将使用“rdma”netid,而“rdma6”用于IPv6寻址。
The netid assignment policy and registry are defined in [RFC5665].
[RFC5665]中定义了netid分配策略和注册表。
As a new RPC transport, this protocol has no effect on RPC program numbers or existing registered port numbers. However, new port numbers MAY be registered for use by RPC/RDMA-enabled services, as appropriate to the new networks over which the services will operate.
作为一种新的RPC传输,此协议对RPC程序号或现有注册端口号没有影响。但是,可以注册新的端口号以供启用RPC/RDMA的服务使用,这适用于服务将在其上运行的新网络。
For example, the NFS/RDMA service defined in [RFC5667] has been assigned the port 20049, in the IANA registry:
例如,[RFC5667]中定义的NFS/RDMA服务已分配给IANA注册表中的端口20049:
nfsrdma 20049/tcp Network File System (NFS) over RDMA nfsrdma 20049/udp Network File System (NFS) over RDMA nfsrdma 20049/sctp Network File System (NFS) over RDMA
通过RDMA的nfsrdma 20049/tcp网络文件系统(NFS)通过RDMA的nfsrdma 20049/udp网络文件系统(NFS)通过RDMA的nfsrdma 20049/sctp网络文件系统(NFS)
The OPTIONAL Connection Configuration Protocol described herein requires an RPC program number assignment. The value "100417" has been assigned:
本文描述的可选连接配置协议需要RPC程序编号分配。已指定值“100417”:
rdmaconfig 100417 rpc.rdmaconfig
rdmaconfig 100417 rpc.rdmaconfig
The RPC program number assignment policy and registry are defined in [RFC5531].
RPC程序编号分配策略和注册表在[RFC5531]中定义。
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak, Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve Kleiman, Mike Eisler, Mark Wittle, Shantanu Mehendale, David Robinson, and Mallikarjun Chadalapaka for their contributions to this document.
作者感谢Rob Thurlow、John Howard、Chet Juszczak、Alex Chiu、Peter Staubach、Dave Noveck、Brian Pawlowski、Steve Kleiman、Mike Eisler、Mark Wittle、Shantanu Mehendale、David Robinson和Mallikarjun Chadalapaka对本文件的贡献。
[RFC1833] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", RFC 1833, August 1995.
[RFC1833]Srinivasan,R.,“ONC RPC版本2的绑定协议”,RFC 1833,1995年8月。
[RFC2203] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol Specification", RFC 2203, September 1997.
[RFC2203]Eisler,M.,Chiu,A.,和L.Ling,“RPCSEC_GSS协议规范”,RFC 2203,1997年9月。
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[RFC4506] Eisler, M., Ed., "XDR: External Data Representation Standard", STD 67, RFC 4506, May 2006.
[RFC4506]艾斯勒,M.,编辑,“XDR:外部数据表示标准”,STD 67,RFC 4506,2006年5月。
[RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security", RFC 5042, October 2007.
[RFC5042]Pinkerton,J.和E.Deleganes,“直接数据放置协议(DDP)/远程直接内存访问协议(RDMAP)安全”,RFC 50422007年10月。
[RFC5056] Williams, N., "On the Use of Channel Bindings to Secure Channels", RFC 5056, November 2007.
[RFC5056]Williams,N.,“关于使用通道绑定保护通道”,RFC 5056,2007年11月。
[RFC5403] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February 2009.
[RFC5403]Eisler,M.,“RPCSEC_GSS版本2”,RFC 5403,2009年2月。
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, May 2009.
[RFC5531]Thurlow,R.,“RPC:远程过程调用协议规范版本2”,RFC 55312009年5月。
[RFC5660] Williams, N., "IPsec Channels: Connection Latching", RFC 5660, October 2009.
[RFC5660]Williams,N.,“IPsec通道:连接锁存”,RFC5660,2009年10月。
[RFC5665] Eisler, M., "IANA Considerations for Remote Procedure Call (RPC) Network Identifiers and Universal Address Formats", RFC 5665, January 2010.
[RFC5665]Eisler,M.“远程过程调用(RPC)网络标识符和通用地址格式的IANA注意事项”,RFC 5665,2010年1月。
[RFC1094] Sun Microsystems, "NFS: Network File System Protocol specification", RFC 1094, March 1989.
[RFC1094]Sun Microsystems,“NFS:网络文件系统协议规范”,RFC10941989年3月。
[RFC1813] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, June 1995.
[RFC1813]Callaghan,B.,Pawlowski,B.,和P.Staubach,“NFS版本3协议规范”,RFC 1813,1995年6月。
[RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003.
[RFC3530]Shepler,S.,Callaghan,B.,Robinson,D.,Thurlow,R.,Beame,C.,Eisler,M.,和D.Noveck,“网络文件系统(NFS)版本4协议”,RFC 3530,2003年4月。
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, October 2007.
[RFC5040]Recio,R.,Metzler,B.,Culley,P.,Hilland,J.,和D.Garcia,“远程直接内存访问协议规范”,RFC 50402007年10月。
[RFC5041] Shah, H., Pinkerton, J., Recio, R., and P. Culley, "Direct Data Placement over Reliable Transports", RFC 5041, October 2007.
[RFC5041]Shah,H.,Pinkerton,J.,Recio,R.,和P.Culley,“可靠传输上的直接数据放置”,RFC 50412007年10月。
[RFC5532] Talpey, T. and C. Juszczak, "Network File System (NFS) Remote Direct Memory Access (RDMA) Problem Statement", RFC 5532, May 2009.
[RFC5532]Talpey,T.和C.Juszczak,“网络文件系统(NFS)远程直接内存访问(RDMA)问题声明”,RFC 5532,2009年5月。
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network File System Version 4 Minor Version 1 Protocol", RFC 5661, January 2010.
[RFC5661]Shepler,S.,Ed.,Eisler,M.,Ed.,和D.Noveck,Ed.,“网络文件系统版本4次要版本1协议”,RFC 56612010年1月。
[RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) Direct Data Placement", RFC 5667, January 2010.
[RFC5667]Talpey,T.和B.Callaghan,“网络文件系统(NFS)直接数据放置”,RFC 5667,2010年1月。
[IB] InfiniBand Trade Association, InfiniBand Architecture Specifications, available from http://www.infinibandta.org.
[IB]InfiniBand贸易协会,InfiniBand体系结构规范,可从http://www.infinibandta.org.
[IBPORT] InfiniBand Trade Association, "IP Addressing Annex", available from http://www.infinibandta.org.
[IBPORT]InfiniBand贸易协会,“IP地址附件”,可从http://www.infinibandta.org.
Authors' Addresses
作者地址
Tom Talpey 170 Whitman St. Stow, MA 01775 USA
美国马萨诸塞州惠特曼街170号汤姆·塔尔佩01775
EMail: tmtalpey@gmail.com
EMail: tmtalpey@gmail.com
Brent Callaghan Apple Computer, Inc. MS: 302-4K 2 Infinite Loop Cupertino, CA 95014 USA
Brent Callaghan Apple Computer,Inc.MS:302-4K 2无限循环库珀蒂诺,加利福尼亚州95014
EMail: brentc@apple.com
EMail: brentc@apple.com