Independent Submission                                            M. Fox
Request for Comments: 7609                                   C. Kassimis
Category: Informational                                       J. Stevens
ISSN: 2070-1721                                                      IBM
                                                             August 2015
        
Independent Submission                                            M. Fox
Request for Comments: 7609                                   C. Kassimis
Category: Informational                                       J. Stevens
ISSN: 2070-1721                                                      IBM
                                                             August 2015
        

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

IBM基于RDMA(SMC-R)协议的共享内存通信

Abstract

摘要

This document describes IBM's Shared Memory Communications over RDMA (SMC-R) protocol. This protocol provides Remote Direct Memory Access (RDMA) communications to TCP endpoints in a manner that is transparent to socket applications. It further provides for dynamic discovery of partner RDMA capabilities and dynamic setup of RDMA connections, as well as transparent high availability and load balancing when redundant RDMA network paths are available. It maintains many of the traditional TCP/IP qualities of service such as filtering that enterprise users demand, as well as TCP socket semantics such as urgent data.

本文档描述IBM通过RDMA(SMC-R)协议进行的共享内存通信。该协议以对套接字应用程序透明的方式向TCP端点提供远程直接内存访问(RDMA)通信。它还提供了伙伴RDMA功能的动态发现和RDMA连接的动态设置,以及冗余RDMA网络路径可用时的透明高可用性和负载平衡。它维护了许多传统的TCP/IP服务质量,如企业用户所需的过滤,以及TCP套接字语义,如紧急数据。

Status of This Memo

关于下段备忘

This document is not an Internet Standards Track specification; it is published for informational purposes.

本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。

This is a contribution to the RFC Series, independently of any other RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not a candidate for any level of Internet Standard; see Section 2 of RFC 5741.

这是对RFC系列的贡献,独立于任何其他RFC流。RFC编辑器已选择自行发布此文档,并且未声明其对实现或部署的价值。RFC编辑批准发布的文件不适用于任何级别的互联网标准;见RFC 5741第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7609.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc7609.

Copyright Notice

版权公告

Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2015 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。

Table of Contents

目录

   1. Introduction ....................................................5
      1.1. Protocol Overview ..........................................6
           1.1.1. Hardware Requirements ...............................8
      1.2. Definition of Common Terms .................................8
      1.3. Conventions Used in This Document .........................11
   2. Link Architecture ..............................................11
      2.1. Remote Memory Buffers (RMBs) ..............................12
      2.2. SMC-R Link Groups .........................................18
           2.2.1. Link Group Types ...................................18
           2.2.2. Maximum Number of Links in Link Group ..............21
           2.2.3. Forming and Managing Link Groups ...................23
           2.2.4. SMC-R Link Identifiers .............................24
      2.3. SMC-R Resilience and Load Balancing .......................24
   3. SMC-R Rendezvous Architecture ..................................26
      3.1. TCP Options ...............................................26
      3.2. Connection Layer Control (CLC) Messages ...................27
      3.3. LLC Messages ..............................................27
      3.4. CDC Messages ..............................................29
      3.5. Rendezvous Flows ..........................................29
           3.5.1. First Contact ......................................29
                  3.5.1.1. Pre-negotiation of TCP Options ............29
                  3.5.1.2. Client Proposal ...........................30
                  3.5.1.3. Server Acceptance .........................32
                  3.5.1.4. Client Confirmation .......................32
                  3.5.1.5. Link (QP) Confirmation ....................32
                  3.5.1.6. Second SMC-R Link Setup ...................35
                           3.5.1.6.1. Client Processing of ADD LINK
                                      LLC Message from Server ........35
                           3.5.1.6.2. Server Processing of ADD LINK
                                      Reply LLC Message from Client ..36
                           3.5.1.6.3. Exchange of RKeys on
                                      Second SMC-R Link ..............38
                           3.5.1.6.4. Aborting SMC-R and
                                      Falling Back to IP .............38
        
   1. Introduction ....................................................5
      1.1. Protocol Overview ..........................................6
           1.1.1. Hardware Requirements ...............................8
      1.2. Definition of Common Terms .................................8
      1.3. Conventions Used in This Document .........................11
   2. Link Architecture ..............................................11
      2.1. Remote Memory Buffers (RMBs) ..............................12
      2.2. SMC-R Link Groups .........................................18
           2.2.1. Link Group Types ...................................18
           2.2.2. Maximum Number of Links in Link Group ..............21
           2.2.3. Forming and Managing Link Groups ...................23
           2.2.4. SMC-R Link Identifiers .............................24
      2.3. SMC-R Resilience and Load Balancing .......................24
   3. SMC-R Rendezvous Architecture ..................................26
      3.1. TCP Options ...............................................26
      3.2. Connection Layer Control (CLC) Messages ...................27
      3.3. LLC Messages ..............................................27
      3.4. CDC Messages ..............................................29
      3.5. Rendezvous Flows ..........................................29
           3.5.1. First Contact ......................................29
                  3.5.1.1. Pre-negotiation of TCP Options ............29
                  3.5.1.2. Client Proposal ...........................30
                  3.5.1.3. Server Acceptance .........................32
                  3.5.1.4. Client Confirmation .......................32
                  3.5.1.5. Link (QP) Confirmation ....................32
                  3.5.1.6. Second SMC-R Link Setup ...................35
                           3.5.1.6.1. Client Processing of ADD LINK
                                      LLC Message from Server ........35
                           3.5.1.6.2. Server Processing of ADD LINK
                                      Reply LLC Message from Client ..36
                           3.5.1.6.3. Exchange of RKeys on
                                      Second SMC-R Link ..............38
                           3.5.1.6.4. Aborting SMC-R and
                                      Falling Back to IP .............38
        
           3.5.2. Subsequent Contact .................................38
                  3.5.2.1. SMC-R Proposal ............................39
                  3.5.2.2. SMC-R Acceptance ..........................40
                  3.5.2.3. SMC-R Confirmation ........................41
                  3.5.2.4. TCP Data Flow Race with SMC
                           Confirm CLC Message .......................41
           3.5.3. First Contact Variation: Creating a
                  Parallel Link Group ................................42
           3.5.4. Normal SMC-R Link Termination ......................43
           3.5.5. Link Group Management Flows ........................44
                  3.5.5.1. Adding and Deleting Links in an
                           SMC-R Link Group ..........................44
                           3.5.5.1.1. Server-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.2. Client-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.3. Server-Initiated DELETE
                                      LINK Processing ................46
                           3.5.5.1.4. Client-Initiated DELETE
                                      LINK Request ...................48
                  3.5.5.2. Managing Multiple RKeys over
                           Multiple SMC-R Links in a Link Group ......49
                           3.5.5.2.1. Adding a New RMB to an
                                      SMC-R Link Group ...............50
                           3.5.5.2.2. Deleting an RMB from an
                                      SMC-R Link Group ...............53
                           3.5.5.2.3. Adding a New SMC-R Link to a
                                      Link Group with Multiple RMBs ..54
                  3.5.5.3. Serialization of LLC Exchanges,
                           and Collisions ............................56
                           3.5.5.3.1. Collisions with ADD
                                      LINK / CONFIRM LINK Exchange ...57
                           3.5.5.3.2. Collisions during
                                      DELETE LINK Exchange ...........58
                           3.5.5.3.3. Collisions during
                                      CONFIRM RKEY Exchange ..........59
   4. SMC-R Memory-Sharing Architecture ..............................60
      4.1. RMB Element Allocation Considerations .....................60
      4.2. RMB and RMBE Format .......................................60
      4.3. RMBE Control Information ..................................60
      4.4. Use of RMBEs ..............................................61
           4.4.1. Initializing and Accessing RMBEs ...................61
           4.4.2. RMB Element Reuse and Conflict Resolution ..........62
      4.5. SMC-R Protocol Considerations .............................63
           4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
           4.5.2. Small Data Sends ...................................64
           4.5.3. TCP Keepalive Processing ...........................65
        
           3.5.2. Subsequent Contact .................................38
                  3.5.2.1. SMC-R Proposal ............................39
                  3.5.2.2. SMC-R Acceptance ..........................40
                  3.5.2.3. SMC-R Confirmation ........................41
                  3.5.2.4. TCP Data Flow Race with SMC
                           Confirm CLC Message .......................41
           3.5.3. First Contact Variation: Creating a
                  Parallel Link Group ................................42
           3.5.4. Normal SMC-R Link Termination ......................43
           3.5.5. Link Group Management Flows ........................44
                  3.5.5.1. Adding and Deleting Links in an
                           SMC-R Link Group ..........................44
                           3.5.5.1.1. Server-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.2. Client-Initiated ADD
                                      LINK Processing ................45
                           3.5.5.1.3. Server-Initiated DELETE
                                      LINK Processing ................46
                           3.5.5.1.4. Client-Initiated DELETE
                                      LINK Request ...................48
                  3.5.5.2. Managing Multiple RKeys over
                           Multiple SMC-R Links in a Link Group ......49
                           3.5.5.2.1. Adding a New RMB to an
                                      SMC-R Link Group ...............50
                           3.5.5.2.2. Deleting an RMB from an
                                      SMC-R Link Group ...............53
                           3.5.5.2.3. Adding a New SMC-R Link to a
                                      Link Group with Multiple RMBs ..54
                  3.5.5.3. Serialization of LLC Exchanges,
                           and Collisions ............................56
                           3.5.5.3.1. Collisions with ADD
                                      LINK / CONFIRM LINK Exchange ...57
                           3.5.5.3.2. Collisions during
                                      DELETE LINK Exchange ...........58
                           3.5.5.3.3. Collisions during
                                      CONFIRM RKEY Exchange ..........59
   4. SMC-R Memory-Sharing Architecture ..............................60
      4.1. RMB Element Allocation Considerations .....................60
      4.2. RMB and RMBE Format .......................................60
      4.3. RMBE Control Information ..................................60
      4.4. Use of RMBEs ..............................................61
           4.4.1. Initializing and Accessing RMBEs ...................61
           4.4.2. RMB Element Reuse and Conflict Resolution ..........62
      4.5. SMC-R Protocol Considerations .............................63
           4.5.1. SMC-R Protocol Optimized Window Size Updates .......63
           4.5.2. Small Data Sends ...................................64
           4.5.3. TCP Keepalive Processing ...........................65
        
      4.6. TCP Connection Failover between SMC-R Links ...............67
           4.6.1. Validating Data Integrity ..........................67
           4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
      4.7. RMB Data Flows ............................................69
           4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
           4.7.2. Scenario 2: Send/Receive Flow, Window Size
                  Unconstrained ......................................71
           4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
           4.7.4. Scenario 4: Large Send, Flow Control, Full
                  Window Size Writes .................................74
           4.7.5. Scenario 5: Send Flow, Urgent Data, Window
                  Size Unconstrained .................................77
           4.7.6. Scenario 6: Send Flow, Urgent Data, Window
                  Size Closed ........................................79
      4.8. Connection Termination ....................................81
           4.8.1. Normal SMC-R Connection Termination Flows ..........81
           4.8.2. Abnormal SMC-R Connection Termination Flows ........86
           4.8.3. Other SMC-R Connection Termination Conditions ......88
   5. Security Considerations ........................................89
      5.1. VLAN Considerations .......................................89
      5.2. Firewall Considerations ...................................89
      5.3. Host-Based IP Filters .....................................89
      5.4. Intrusion Detection Services ..............................90
      5.5. IP Security (IPsec) .......................................90
      5.6. TLS/SSL ...................................................90
   6. IANA Considerations ............................................90
   7. Normative References ...........................................91
   Appendix A. Formats ...............................................92
     A.1. TCP Option .................................................92
     A.2. CLC Messages ...............................................92
          A.2.1. Peer ID Format ......................................93
          A.2.2. SMC Proposal CLC Message Format .....................94
          A.2.3. SMC Accept CLC Message Format .......................98
          A.2.4. SMC Confirm CLC Message Format .....................102
          A.2.5. SMC Decline CLC Message Format .....................105
     A.3. LLC Messages ..............................................106
          A.3.1. CONFIRM LINK LLC Message Format ....................107
          A.3.2. ADD LINK LLC Message Format ........................109
          A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
          A.3.4. DELETE LINK LLC Message Format .....................115
          A.3.5. CONFIRM RKEY LLC Message Format ....................117
          A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
          A.3.7. DELETE RKEY LLC Message Format .....................122
          A.3.8. TEST LINK LLC Message Format .......................124
     A.4. Connection Data Control (CDC) Message Format ..............125
        
      4.6. TCP Connection Failover between SMC-R Links ...............67
           4.6.1. Validating Data Integrity ..........................67
           4.6.2. Resuming the TCP Connection on a New SMC-R Link ....68
      4.7. RMB Data Flows ............................................69
           4.7.1. Scenario 1: Send Flow, Window Size Unconstrained ...69
           4.7.2. Scenario 2: Send/Receive Flow, Window Size
                  Unconstrained ......................................71
           4.7.3. Scenario 3: Send Flow, Window Size Constrained .....72
           4.7.4. Scenario 4: Large Send, Flow Control, Full
                  Window Size Writes .................................74
           4.7.5. Scenario 5: Send Flow, Urgent Data, Window
                  Size Unconstrained .................................77
           4.7.6. Scenario 6: Send Flow, Urgent Data, Window
                  Size Closed ........................................79
      4.8. Connection Termination ....................................81
           4.8.1. Normal SMC-R Connection Termination Flows ..........81
           4.8.2. Abnormal SMC-R Connection Termination Flows ........86
           4.8.3. Other SMC-R Connection Termination Conditions ......88
   5. Security Considerations ........................................89
      5.1. VLAN Considerations .......................................89
      5.2. Firewall Considerations ...................................89
      5.3. Host-Based IP Filters .....................................89
      5.4. Intrusion Detection Services ..............................90
      5.5. IP Security (IPsec) .......................................90
      5.6. TLS/SSL ...................................................90
   6. IANA Considerations ............................................90
   7. Normative References ...........................................91
   Appendix A. Formats ...............................................92
     A.1. TCP Option .................................................92
     A.2. CLC Messages ...............................................92
          A.2.1. Peer ID Format ......................................93
          A.2.2. SMC Proposal CLC Message Format .....................94
          A.2.3. SMC Accept CLC Message Format .......................98
          A.2.4. SMC Confirm CLC Message Format .....................102
          A.2.5. SMC Decline CLC Message Format .....................105
     A.3. LLC Messages ..............................................106
          A.3.1. CONFIRM LINK LLC Message Format ....................107
          A.3.2. ADD LINK LLC Message Format ........................109
          A.3.3. ADD LINK CONTINUATION LLC Message Format ...........112
          A.3.4. DELETE LINK LLC Message Format .....................115
          A.3.5. CONFIRM RKEY LLC Message Format ....................117
          A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format .......120
          A.3.7. DELETE RKEY LLC Message Format .....................122
          A.3.8. TEST LINK LLC Message Format .......................124
     A.4. Connection Data Control (CDC) Message Format ..............125
        
   Appendix B. Socket API Considerations ............................129
     B.1. setsockopt() / getsockopt() Considerations ................130
   Appendix C. Rendezvous Error Scenarios ...........................131
     C.1. SMC Decline during CLC Negotiation ........................131
     C.2. SMC Decline during LLC Negotiation ........................131
     C.3. The SMC Decline Window ....................................133
     C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
     C.5. Timeouts during CLC Negotiation ...........................134
     C.6. Protocol Errors during CLC Negotiation ....................134
     C.7. Timeouts during LLC Negotiation ...........................135
          C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
     C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
   Authors' Addresses ...............................................143
        
   Appendix B. Socket API Considerations ............................129
     B.1. setsockopt() / getsockopt() Considerations ................130
   Appendix C. Rendezvous Error Scenarios ...........................131
     C.1. SMC Decline during CLC Negotiation ........................131
     C.2. SMC Decline during LLC Negotiation ........................131
     C.3. The SMC Decline Window ....................................133
     C.4. Out-of-Sync Conditions during SMC-R Negotiation ...........133
     C.5. Timeouts during CLC Negotiation ...........................134
     C.6. Protocol Errors during CLC Negotiation ....................134
     C.7. Timeouts during LLC Negotiation ...........................135
          C.7.1. Recovery Actions for LLC Timeouts and Failures .....136
     C.8. Failure to Add Second SMC-R Link to a Link Group ..........142
   Authors' Addresses ...............................................143
        
1. Introduction
1. 介绍

This document specifies IBM's Shared Memory Communications over RDMA (SMC-R) protocol. SMC-R is a protocol for Remote Direct Memory Access (RDMA) communication between TCP socket endpoints. SMC-R runs over networks that support RDMA over Converged Ethernet (RoCE). It is designed to permit existing TCP applications to benefit from RDMA without requiring modifications to the applications or predefinition of RDMA partners.

本文档指定了IBM通过RDMA(SMC-R)协议进行的共享内存通信。SMC-R是TCP套接字端点之间远程直接内存访问(RDMA)通信的协议。SMC-R在支持聚合以太网RDMA(RoCE)的网络上运行。它旨在允许现有TCP应用程序受益于RDMA,而无需修改应用程序或预定义RDMA合作伙伴。

SMC-R provides dynamic discovery of the RDMA capabilities of TCP peers and automatic setup of RDMA connections that those peers can use. SMC-R also provides transparent high availability and load-balancing capabilities that are demanded by enterprise installations but are missing from current RDMA protocols. If redundant RoCE-capable hardware such as RDMA-capable Network Interface Cards (RNICs) and RoCE-capable switches is present, SMC-R can load-balance over that redundant hardware and can also non-disruptively move TCP traffic from failed paths to surviving paths, all seamlessly to the application and the sockets layer. Because SMC-R preserves socket semantics and the TCP three-way handshake, many TCP qualities of service such as filtering, load balancing, and Secure Socket Layer (SSL) encryption are preserved, as are TCP features such as urgent data.

SMC-R提供TCP对等方RDMA功能的动态发现,并自动设置这些对等方可以使用的RDMA连接。SMC-R还提供透明的高可用性和负载平衡功能,这些功能是企业安装所需要的,但在当前的RDMA协议中缺失。如果存在支持RoCE的冗余硬件,如支持RDMA的网络接口卡(RNIC)和支持RoCE的交换机,SMC-R可以在该冗余硬件上实现负载平衡,还可以无中断地将TCP流量从故障路径移动到幸存路径,所有这些都无缝地移动到应用程序和套接字层。因为SMC-R保留了套接字语义和TCP三方握手,所以许多TCP服务质量(如过滤、负载平衡和安全套接字层(SSL)加密)都得到了保留,TCP功能(如紧急数据)也得到了保留。

Because of the dynamic discovery and setup of SMC-R connectivity between peers, no RDMA connection manager (RDMA-CM) is required. This also means that support for Unreliable Datagram (UD) Queue Pairs (QPs) is also not required.

由于对等点之间SMC-R连接的动态发现和设置,因此不需要RDMA连接管理器(RDMA-CM)。这也意味着也不需要支持不可靠数据报(UD)队列对(QP)。

It is recommended that the SMC-R services be implemented in kernel space, which enables optimizations such as resource-sharing between connections across multiple processes and also permits applications using SMC-R to spawn multiple processes (e.g., fork) without losing SMC-R functionality. A user-space implementation is compatible with this architecture, but it may not support spawned processes (e.g., fork), which limits sharing and resource optimization to TCP connections that originate from the same process. This might be an appropriate design choice if the use case is a system that hosts a large single process application that creates many TCP connections to a peer host, or in implementations where a kernel-space implementation is not possible or introduces excessive overhead for "kernel space to user space" context switches.

建议在内核空间中实现SMC-R服务,这样可以实现优化,例如跨多个进程的连接之间的资源共享,还允许使用SMC-R的应用程序生成多个进程(例如fork),而不会丢失SMC-R功能。用户空间实现与此体系结构兼容,但它可能不支持派生进程(例如,fork),这将共享和资源优化限制在源于同一进程的TCP连接上。如果用例是一个系统,该系统承载一个大型单进程应用程序,该应用程序创建许多到对等主机的TCP连接,或者在内核空间实现不可能或为“内核空间到用户空间”上下文切换引入过多开销的实现中,这可能是一个合适的设计选择。

1.1. Protocol Overview
1.1. 协议概述

SMC-R defines the concept of the SMC-R link, which is a logical point-to-point link using reliably connected queue pairs between TCP/IP stack peers over a RoCE fabric. An SMC-R link is bound to a specific hardware path, meaning a specific RNIC on each peer. SMC-R links are created and maintained by an SMC-R layer, which may reside in kernel space or user space, depending upon operating system and implementation requirements. The SMC-R layer resides below the sockets layer and directs data traffic for TCP connections between connected peers over the RoCE fabric using RDMA rather than over a TCP connection. The TCP/IP stack, with its requirements for fragmentation, packetization, etc., is bypassed, and the application data is moved between peers using RDMA.

SMC-R定义了SMC-R链路的概念,它是一种逻辑点到点链路,使用RoCE结构上TCP/IP堆栈对等方之间可靠连接的队列对。SMC-R链路绑定到特定的硬件路径,这意味着每个对等机上都有一个特定的RNIC。SMC-R链路由SMC-R层创建和维护,SMC-R层可能位于内核空间或用户空间,具体取决于操作系统和实现要求。SMC-R层位于套接字层之下,使用RDMA而不是TCP连接在RoCE结构上为连接的对等方之间的TCP连接引导数据通信。绕过TCP/IP堆栈及其碎片化、打包等要求,并使用RDMA在对等方之间移动应用程序数据。

Multiple SMC-R links between the same two TCP/IP stack peers are also supported. A set of SMC-R links called a link group can be logically bonded together to provide redundant connectivity. If there is redundant hardware -- for example, two RNICs on each peer -- separate SMC-R links are created between the peers to exploit that redundant hardware. The link group architecture with redundant links provides load balancing and increased bandwidth, as well as seamless failover.

还支持同两个TCP/IP堆栈对等方之间的多个SMC-R链路。一组称为链路组的SMC-R链路可以逻辑地结合在一起,以提供冗余连接。如果存在冗余硬件(例如,每个对等机上有两个RNIC),则在对等机之间创建单独的SMC-R链路以利用该冗余硬件。具有冗余链路的链路组体系结构提供了负载平衡和增加的带宽,以及无缝故障切换。

Each SMC-R link group is associated with an area of memory called Remote Memory Buffers (RMBs), which are areas of memory that are available for SMC-R peers to write into using RDMA writes. Multiple TCP connections between peers may be multiplexed over a single SMC-R link, in which case the SMC-R layer manages the partitioning of the RMBs between the TCP connections. This multiplexing reduces the RDMA resources, such as QPs and RMBs, that are required to support multiple connections between peers, and it also reduces the processing and delays related to setting up QPs, pinning memory, and other RDMA setup tasks when new TCP connections are created. In a kernel-space SMC-R implementation in which the RMBs reside in kernel

每个SMC-R链路组都与一个称为远程内存缓冲区(RMB)的内存区域相关联,远程内存缓冲区是SMC-R对等方可以使用RDMA写操作写入的内存区域。对等点之间的多个TCP连接可以通过单个SMC-R链路进行多路复用,在这种情况下,SMC-R层管理TCP连接之间的RMBs分区。这种多路复用减少了支持对等点之间的多个连接所需的RDMA资源,如QPs和RMBs,并且在创建新TCP连接时,还减少了与设置QPs、固定内存和其他RDMA设置任务相关的处理和延迟。在内核空间SMC-R实现中,RMB驻留在内核中

storage, this sharing and optimization works across multiple processes executing on the same host. In a user-space SMC-R implementation in which the RMBs reside in user space, this sharing and optimization is limited to multiple TCP connections created by a single process, as separate RMBs and QPs will be required for each process.

在存储方面,这种共享和优化可以跨在同一主机上执行的多个进程工作。在RMB驻留在用户空间的用户空间SMC-R实现中,这种共享和优化仅限于由单个进程创建的多个TCP连接,因为每个进程都需要单独的RMB和QP。

SMC-R also introduces a rendezvous protocol that is used to dynamically discover the RDMA capabilities of TCP connection partners and exchange credentials necessary to exploit that capability if present. TCP connections are set up using the normal TCP three-way handshake [RFC793], with the addition of a new TCP option that indicates SMC-R capability. If both partners indicate SMC-R capability, then at the completion of the three-way TCP handshake the SMC-R layers in each peer take control of the TCP connection and use it to exchange additional Connection Layer Control (CLC) messages to negotiate SMC-R credentials such as QP information; addressability over the RoCE fabric; RMB buffer sizes; and keys and addresses for accessing RMBs over RDMA. If at any time during this negotiation a failure or decline occurs, the TCP connection falls back to using the IP fabric.

SMC-R还引入了一个集合协议,用于动态发现TCP连接伙伴的RDMA功能,并交换利用该功能所需的凭据(如果存在)。TCP连接使用正常的TCP三方握手[RFC793]进行设置,并添加一个新的TCP选项,指示SMC-R功能。如果双方都表示SMC-R能力,则在三向TCP握手完成时,每个对等方中的SMC-R层控制TCP连接,并使用它交换额外的连接层控制(CLC)消息,以协商SMC-R凭据,如QP信息;RoCE结构上的寻址能力;缓冲区大小;以及通过RDMA访问RMBs的密钥和地址。如果在此协商过程中的任何时候出现故障或拒绝,TCP连接将退回到使用IP结构。

If the SMC-R negotiation succeeds and either a new SMC-R link is set up or an existing SMC-R link is chosen for the TCP connection, then the SMC-R layers open the sockets to the applications and the applications use the sockets as normal. The SMC-R layer intercepts the socket reads and writes and moves the TCP connection data over the SMC-R link, "out of band" to the TCP connection, which remains open and idle over the IP fabric, except for termination flows and possible keepalive flows. Regular TCP sequence numbering methods are used for the TCP flows that do occur; data flowing over RDMA does not use or affect TCP sequence numbers.

如果SMC-R协商成功,并且建立了新的SMC-R链路或为TCP连接选择了现有的SMC-R链路,则SMC-R层向应用程序打开套接字,应用程序正常使用套接字。SMC-R层截获套接字读写,并将TCP连接数据通过SMC-R链路“带外”移动到TCP连接,该连接在IP结构上保持打开和空闲,但终止流和可能的保留流除外。对于确实发生的TCP流,使用常规TCP序列编号方法;通过RDMA的数据流不使用或影响TCP序列号。

This architecture does not support fallback of active SMC-R connections to IP. Once connection data has completed the switch to RDMA, a TCP connection cannot be switched back to IP and will reset if RDMA becomes unusable.

此体系结构不支持主动SMC-R连接到IP的回退。一旦连接数据完成到RDMA的切换,TCP连接就无法切换回IP,如果RDMA变得不可用,TCP连接将重置。

The SMC-R protocol defines the format of the RMBs that are used to receive TCP connection data written over RDMA, as well as the semantics for managing and writing to these buffers using Connection Data Control (CDC) messages.

SMC-R协议定义了用于接收通过RDMA写入的TCP连接数据的RMBs的格式,以及使用连接数据控制(CDC)消息管理和写入这些缓冲区的语义。

Finally, SMC-R defines Link Layer Control (LLC) messages that are exchanged over the RoCE fabric between peer SMC-R layers to manage the SMC-R links and link groups. These include messages to test and confirm connectivity over an SMC-R link, add and delete SMC-R links to or from the link group, and exchange RMB addressability information.

最后,SMC-R定义了在对等SMC-R层之间通过RoCE结构交换的链路层控制(LLC)消息,以管理SMC-R链路和链路组。这些消息包括测试和确认SMC-R链路上的连接性、向链路组添加和删除SMC-R链路或从链路组中删除SMC-R链路以及交换RMB可寻址信息的消息。

1.1.1. Hardware Requirements
1.1.1. 硬件要求

SMC-R does not require full Converged Enhanced Ethernet switch functionality. SMC-R functions over standard Ethernet fabrics, provided that endpoint RNICs are provided and IEEE 802.3x Global Pause Frame is supported and enabled in the switch fabric.

SMC-R不需要完整的聚合增强以太网交换机功能。SMC-R在标准以太网结构上运行,前提是提供了端点RNIC,并且在交换机结构中支持并启用了IEEE 802.3x全局暂停帧。

While SMC-R as specified in this document is designed to operate over RoCE fabrics, adjustments to the rendezvous methods could enable it to run over other RDMA fabrics, such as InfiniBand [RoCE] and iWARP.

虽然本文件中规定的SMC-R设计用于在RoCE结构上运行,但对会合方法的调整可使其能够在其他RDMA结构上运行,如InfiniBand[RoCE]和iWARP。

1.2. Definition of Common Terms
1.2. 通用术语的定义

This section provides definitions of terms that have a specific meaning to the SMC-R protocol and are used throughout this document.

本节提供了对SMC-R协议具有特定含义的术语定义,并在本文件中使用。

SMC-R Link

SMC-R链路

An SMC-R link is a logical point-to-point connection over the RoCE fabric via specific physical adapters (Media Access Control / Global Identifier (MAC/GID)). The link is formed during the "first contact" sequence of the TCP/IP three-way handshake sequence that occurs over the IP fabric. During this handshake, an RDMA reliably connected queue pair (RC-QP) connection is formed between the two peer SMC hosts and is defined as the SMC-R link. The SMC-R link can then support multiple TCP connections between the two peers. An SMC-R link is associated with a single LAN (or VLAN) segment and is not routable.

SMC-R链路是通过特定物理适配器(媒体访问控制/全局标识符(MAC/GID))在RoCE结构上的逻辑点对点连接。链路在IP结构上发生的TCP/IP三方握手序列的“第一次接触”序列期间形成。在此握手期间,在两个对等SMC主机之间形成RDMA可靠连接队列对(RC-QP)连接,并将其定义为SMC-R链路。然后,SMC-R链路可以支持两个对等方之间的多个TCP连接。SMC-R链路与单个LAN(或VLAN)段关联,不可路由。

SMC-R Link Group

SMC-R链路组

An SMC-R link group is a group of SMC-R links between the same two SMC-R peers, typically with each link over unique RoCE adapters. Each link in the link group has equal characteristics, such as the same VLAN ID (if VLANs are in use), access to the same RMB(s), and access to the same TCP server/client.

SMC-R链路组是相同两个SMC-R对等点之间的一组SMC-R链路,通常每个链路通过唯一的RoCE适配器。链路组中的每个链路具有相同的特征,例如相同的VLAN ID(如果VLAN正在使用)、对相同RMB的访问以及对相同TCP服务器/客户端的访问。

SMC-R Peer

SMC-R对等机

The SMC-R peer is the peer software stack within the peer operating system with respect to the Shared Memory Communications (messaging) protocol.

SMC-R对等机是对等操作系统中与共享内存通信(消息传递)协议相关的对等软件堆栈。

SMC-R Rendezvous

SMC-R交会

SMC-R Rendezvous is the SMC-R peer discovery and handshake sequence that occurs transparently over the IP (Ethernet) fabric during and immediately after the TCP connection three-way handshake by exchanging the SMC-R capabilities and credentials using experimental TCP option and CLC messages.

SMC-R Rendezvous是SMC-R对等发现和握手序列,通过使用实验性TCP选项和CLC消息交换SMC-R能力和凭证,在TCP连接三方握手期间和之后立即通过IP(以太网)结构透明地发生。

RoCE SendMsg

RoCE SendMsg

RoCE SendMsg is a send operation posted to a reliably connected queue pair with inline data, for the purpose of transferring control information between peers.

RoCE SendMsg是一种发送操作,发送到具有内联数据的可靠连接队列对,用于在对等方之间传输控制信息。

TCP Client

TCP客户端

The TCP client is the TCP socket-based peer that initiates a TCP connection.

TCP客户端是启动TCP连接的基于TCP套接字的对等方。

TCP Server

TCP服务器

The TCP server is the TCP socket-based peer that accepts a TCP connection.

TCP服务器是接受TCP连接的基于TCP套接字的对等方。

CLC Messages

CLC消息

The SMC-R protocol defines a set of Connection Layer Control messages that flow over the TCP connection that are used to manage SMC-R link rendezvous at TCP connection setup time. This mechanism is analogous to SSL setup messages.

SMC-R协议定义了一组连接层控制消息,这些消息流经TCP连接,用于在TCP连接设置时管理SMC-R链路会合。此机制类似于SSL设置消息。

LLC Commands

LLC指挥部

The SMC-R protocol defines a set of RoCE Link Layer Control commands that flow over the RoCE fabric using RoCE SendMsg, that are used to manage SMC-R links, SMC-R link groups, and SMC-R link group RMB expansion and contraction.

SMC-R协议定义了一组RoCE链路层控制命令,这些命令使用RoCE SendMsg流经RoCE结构,用于管理SMC-R链路、SMC-R链路组和SMC-R链路组的扩展和收缩。

CDC Message

CDC信息

The SMC-R protocol defines a Connection Data Control message that flows over the RoCE fabric using RoCE SendMsg that is used to manage the SMC-R connection data. This message provides information about data being transferred over the out-of-band RDMA connection, such as data cursors, sequence numbers, and data flags (for example, urgent data). The receipt of this message also provides an interrupt to inform the receiver that it has received RDMA data.

SMC-R协议使用用于管理SMC-R连接数据的RoCE SendMsg定义了一条连接数据控制消息,该消息在RoCE结构上流动。此消息提供有关通过带外RDMA连接传输的数据的信息,例如数据游标、序列号和数据标志(例如,紧急数据)。该消息的接收还提供一个中断,通知接收器它已接收到RDMA数据。

RMB

人民币

A Remote (RDMA) Memory Buffer is a fixed or pinned buffer allocated in each of the peer hosts for a TCP (via SMC-R) connection. The RMB is registered to the RNIC and allows remote access by the remote peer using RDMA semantics. Each host is passed the peer's RMB-specific access information (RMB Key (RKey) and RMB element offset) during the SMC-R Rendezvous process. The host stores socket application user data directly into the peer's RMB using RDMA over RoCE.

远程(RDMA)内存缓冲区是在每个对等主机中为TCP(通过SMC-R)连接分配的固定或固定缓冲区。RMB注册到RNIC,并允许远程对等方使用RDMA语义进行远程访问。在SMC-R会合过程中,向每个主机传递对等方的RMB特定访问信息(RMB密钥(RKey)和RMB元素偏移)。主机使用RDMA over RoCE将套接字应用程序用户数据直接存储到对等方的RMB中。

RToken

奥尔特肯

The RToken is the combination of an RMB's RKey and RDMA virtual address. An RToken provides RMB addressability information to an RDMA peer.

RToken是RMB的RKey和RDMA虚拟地址的组合。RToken向RDMA对等方提供RMB可寻址信息。

RMBE

人民币汇率

The Remote Memory Buffer Element (RMBE) is an area of an RMB that is allocated to a specific TCP connection. The RMBE contains data for the TCP connection. The RMBE represents the TCP receive buffer, whereby the remote peer writes into the RMBE and the local peer reads from the local RMBE. The alert token resolves to a specific RMBE.

远程内存缓冲区元素(RMBE)是分配给特定TCP连接的RMB区域。RMBE包含TCP连接的数据。RMBE表示TCP接收缓冲区,远程对等方写入RMBE,本地对等方从本地RMBE读取。警报令牌解析为特定的RMBE。

Alert Token

警报令牌

The SMC-R alert token is a 4-byte value that uniquely identifies the TCP connection over an SMC-R connection. The alert token allows the SMC peer to quickly identify the target TCP connection that now has new work. The format of the token is defined by the owning SMC-R endpoint and is considered opaque to the remote peer. However, the token should not simply be an index to an RMBE; it should reference a TCP connection and be able to be validated to avoid reading data from stale connections.

SMC-R警报令牌是一个4字节的值,用于唯一标识SMC-R连接上的TCP连接。警报令牌允许SMC对等方快速识别现在有新工作的目标TCP连接。令牌的格式由所属SMC-R端点定义,对远程对等方来说是不透明的。然而,代币不应该只是RMBE的索引;它应该引用TCP连接,并且能够进行验证,以避免从过时的连接读取数据。

RNIC

RNIC

The RDMA-capable Network Interface Card (RNIC) is an Ethernet NIC that supports RDMA semantics and verbs using RoCE.

支持RDMA的网络接口卡(RNIC)是一个以太网NIC,支持使用RoCE的RDMA语义和动词。

First Contact

第一次接触

"First contact" describes an SMC-R negotiation to set up the first link in a link group.

“第一次接触”描述了SMC-R协商以建立链路组中的第一个链路。

Subsequent Contact

后续联系

"Subsequent contact" describes an SMC-R negotiation between peers who are using an already-existing SMC-R link group.

“后续联系”描述了使用现有SMC-R链路组的对等方之间的SMC-R协商。

1.3. Conventions Used in This Document
1.3. 本文件中使用的公约
   In the rendezvous flow diagrams, dashed lines (----) are used to
   indicate flows over the TCP/IP fabric and dotted lines (....) are
   used to indicate flows over the RoCE fabric.
        
   In the rendezvous flow diagrams, dashed lines (----) are used to
   indicate flows over the TCP/IP fabric and dotted lines (....) are
   used to indicate flows over the RoCE fabric.
        
   In the data transfer ladder diagrams, dashed lines (----) are used to
   indicate RDMA write operations and dotted lines (....) are used to
   indicate CDC messages, which are RDMA messages with inline data that
   contain control information for the connection.
        
   In the data transfer ladder diagrams, dashed lines (----) are used to
   indicate RDMA write operations and dotted lines (....) are used to
   indicate CDC messages, which are RDMA messages with inline data that
   contain control information for the connection.
        
2. Link Architecture
2. 链路结构

An SMC-R link is based on reliably connected queue pairs (QPs) that form a "logical point-to-point link" between the two SMC-R peers over a RoCE fabric. An SMC-R link extends from SMC-R peer to SMC-R peer, where typically each peer would be a TCP/IP stack and would reside on separate hosts.

SMC-R链路基于可靠连接的队列对(QP),它们通过RoCE结构在两个SMC-R对等方之间形成“逻辑点对点链路”。SMC-R链路从SMC-R对等点扩展到SMC-R对等点,其中每个对等点通常是TCP/IP堆栈,并驻留在单独的主机上。

                            ,,.--..,_
     +----+             _-``         `-,           +-----+
     |QP 8|            -   RoCE         ',         |QP 64|
     |    |          /     VLAN M         .        |     |
     +----+--------+/                     \+-------+-----+
      | RNIC 1     |    SMC-R Link         | RNIC 2     |
      |            |<--------------------->|            |
      +------------+ ,                    /+------------+
              MAC A (GID A)             MAC B (GID B)
                       .                .`
                        `',          ,-`
                           ``''--''``
        
                            ,,.--..,_
     +----+             _-``         `-,           +-----+
     |QP 8|            -   RoCE         ',         |QP 64|
     |    |          /     VLAN M         .        |     |
     +----+--------+/                     \+-------+-----+
      | RNIC 1     |    SMC-R Link         | RNIC 2     |
      |            |<--------------------->|            |
      +------------+ ,                    /+------------+
              MAC A (GID A)             MAC B (GID B)
                       .                .`
                        `',          ,-`
                           ``''--''``
        

Figure 1: SMC-R Link Overview

图1:SMC-R链路概述

Figure 1 illustrates an overview of the basic concepts of SMC-R peer-to-peer connectivity; this is called the SMC-R link. The SMC-R link forms a logical point-to-point connection between two SMC-R peers via RoCE. The SMC-R link is defined and identified by the following attributes:

图1说明了SMC-R点对点连接的基本概念概述;这称为SMC-R链路。SMC-R链路通过RoCE在两个SMC-R对等点之间形成逻辑点对点连接。SMC-R链路由以下属性定义和标识:

SMC-R link = RC QPs (source VMAC GID QP + target VMAC GID QP + VLAN ID)

SMC-R链路=RC QPs(源VMAC GID QP+目标VMAC GID QP+VLAN ID)

The SMC-R link can optionally be associated with a VLAN ID. If VLANs are in use for the associated IP (LAN) connection, then the VLAN attribute is carried over on the SMC-R link. When VLANs are in use, each SMC-R link group is associated with a single and specific VLAN. The RoCE fabric is the same physical Ethernet LAN used for standard TCP/IP-over-Ethernet communications, with switches as described in Section 1.1.1.

SMC-R链路可以选择性地与VLAN ID关联。如果VLAN用于关联的IP(LAN)连接,则VLAN属性将在SMC-R链路上传递。当使用VLAN时,每个SMC-R链路组与单个特定VLAN相关联。RoCE结构与用于标准TCP/IP over Ethernet通信的物理以太网LAN相同,具有第1.1.1节所述的交换机。

An SMC-R link is designed to support multiple TCP connections between the same two peers. An SMC-R link is intended to be long lived, while the underlying TCP connections can dynamically come and go. The associated RMBs can also be dynamically added and removed from the link as needed. The first TCP connection between the peers establishes the SMC-R link. Subsequent TCP connections then use the previously established link. When the last TCP connection terminates, the link can then be terminated, typically after an implementation-defined idle timeout period has elapsed. The TCP server is responsible for initiating and terminating the SMC-R link.

SMC-R链路设计用于支持同两个对等方之间的多个TCP连接。SMC-R链路的寿命较长,而底层TCP连接可以动态地来来去去。相关的RMB也可以根据需要动态添加和从链接中删除。对等点之间的第一个TCP连接建立SMC-R链路。随后的TCP连接使用先前建立的链接。当最后一个TCP连接终止时,通常在经过实现定义的空闲超时时间后,可以终止链路。TCP服务器负责启动和终止SMC-R链路。

2.1. Remote Memory Buffers (RMBs)
2.1. 远程内存缓冲区(RMBs)

Figure 2 shows the hosts -- Hosts X and Y -- and their associated RMBs within each host. With the SMC-R link, and the associated RKeys and RDMA virtual addresses, each SMC-R-enabled TCP/IP stack can remotely access its peer's RMBs using RDMA. The RKeys and virtual addresses are exchanged during the rendezvous processing when the link is established. The combination of the RKey and the virtual address is the RToken. Note that the SMC-R link ends at the QP providing access to the RMB (via the link + RToken).

图2显示了主机X和Y及其在每个主机内的关联RMB。通过SMC-R链路以及相关的RKEY和RDMA虚拟地址,每个支持SMC-R的TCP/IP堆栈都可以使用RDMA远程访问其对等方的RMB。当链路建立时,在会合处理期间交换RKEY和虚拟地址。RKey和虚拟地址的组合是RToken。请注意,SMC-R链路在QP处结束,提供对人民币的访问(通过链路+RToken)。

          Host X                                     Host Y
     +-------------------+        ,.--.,_       +-------------------+
     |                   |     .'`       '.     |                   |
     | Protection        |   ,'            `,   |    Protection     |
     | Domain X          |  /                \  |    Domain Y       |
     |            +------+ /                  \ +------+            |
     |       QP 8 |RNIC 1| |   SMC-R Link     | |RNIC 2|  QP 64     |
     |        |   |      |<-------------------->|      |   |        |
     |        |   |      ||                    ||      |   |        |
     |        |   +------+|    VLAN A          |+------+   |        |
     |        |          ||                    ||          |        |
     |        |          | |   RoCE           | |          |        |
     |        |RToken X  | \                  / |RToken Y  |        |
     |        |          |  \                /  |          |        |
     |        V          |   `.            ,'   |          V        |
     | +--------+        |     '._       ,'     |        +--------+ |
     | |        |        |        `''-'``       |        |        | |
     | | RMB    |        |                      |        | RMB    | |
     | |        |        |                      |        |        | |
     | +--------+        |                      |        +--------+ |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+        ,.--.,_       +-------------------+
     |                   |     .'`       '.     |                   |
     | Protection        |   ,'            `,   |    Protection     |
     | Domain X          |  /                \  |    Domain Y       |
     |            +------+ /                  \ +------+            |
     |       QP 8 |RNIC 1| |   SMC-R Link     | |RNIC 2|  QP 64     |
     |        |   |      |<-------------------->|      |   |        |
     |        |   |      ||                    ||      |   |        |
     |        |   +------+|    VLAN A          |+------+   |        |
     |        |          ||                    ||          |        |
     |        |          | |   RoCE           | |          |        |
     |        |RToken X  | \                  / |RToken Y  |        |
     |        |          |  \                /  |          |        |
     |        V          |   `.            ,'   |          V        |
     | +--------+        |     '._       ,'     |        +--------+ |
     | |        |        |        `''-'``       |        |        | |
     | | RMB    |        |                      |        | RMB    | |
     | |        |        |                      |        |        | |
     | +--------+        |                      |        +--------+ |
     +-------------------+                      +-------------------+
        

Figure 2: SMC-R Link and RMBs

图2:SMC-R链路和RMBs

An SMC-R link can support multiple RMBs that are independently managed by each peer. The number and the size of RMBs are managed by the peers based on the host's unique memory management requirements; however, the maximum number of RMBs that can be associated to a link group on one peer is 255. The QP has a single protection domain, but each RMB has a unique RToken. All RTokens must be exchanged with the peer.

SMC-R链路可以支持由每个对等方独立管理的多个RMB。RMB的数量和大小由对等方根据主机独特的内存管理要求进行管理;但是,在一个对等机上可以关联到链路组的最大RMB数为255。QP有一个单一的保护域,但每个RMB有一个唯一的RToken。必须与对等方交换所有RTOKEN。

Each peer manages the RMBs in its local memory for its remote SMC-R peer by sharing access to the RMBs via RTokens with its peers. The remote peer writes into the RMBs via RDMA, and the local peer (RMB owner) then reads from the RMBs.

每个对等方通过其对等方通过RTokens共享对RMBs的访问,在其本地内存中为其远程SMC-R对等方管理RMBs。远程对等方通过RDMA写入RMBs,然后本地对等方(RMB所有者)读取RMBs。

When two peers decide to use SMC-R for a given TCP connection, they each allocate a local RMB element for the TCP connection and communicate the location of this local RMB element during rendezvous processing. To that end, RMB elements are created in pairs, with one RMB element allocated locally on each peer of the SMC-R link.

当两个对等方决定对给定的TCP连接使用SMC-R时,它们各自为TCP连接分配一个本地RMB元素,并在会合处理过程中告知该本地RMB元素的位置。为此,RMB元素成对创建,在SMC-R链路的每个对等点上本地分配一个RMB元素。

                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 1 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 2 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +----------------------------+
                       |            .               |
                       |            .               |
                       |            .               |
                       |            .               |
                       |    (up to 255 elements)    |
                       +----------------------------+
        
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 1 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +------------+---------------+
                  /\   |Eye Catcher |               |
                   |   +------------+               |
                   |   |                            |
         RMB Element 2 |                            |
                   |   |   Receive Buffer           |
                   |   |                            |
                   |   |                            |
                  \/   |                            |
                  ---  +----------------------------+
                       |            .               |
                       |            .               |
                       |            .               |
                       |            .               |
                       |    (up to 255 elements)    |
                       +----------------------------+
        

Figure 3: RMB Format

图3:人民币格式

Figure 3 illustrates the basic format of an RMB. The RMB is a virtual memory buffer whose backing real memory is pinned, which can support up to 255 TCP connections to exactly one remote SMC-R peer. Each RMB is therefore associated with the SMC-R links within a link group for the two peers and a specific RoCE Protection Domain. Other than the two peers identified by the SMC-R link, no other SMC-R peers can have RDMA access to an RMB; this requires a unique Protection Domain for every SMC-R link. This is critical to ensure integrity of SMC-R communications.

图3显示了人民币的基本格式。RMB是一个虚拟内存缓冲区,其支持的实际内存是固定的,它最多可以支持255个TCP连接到一个远程SMC-R对等机。因此,每个RMB与两个对等方和特定RoCE保护域的链路组内的SMC-R链路相关联。除了SMC-R链路识别的两个对等点外,其他SMC-R对等点都不能访问RMB;这要求每个SMC-R链路具有唯一的保护域。这对于确保SMC-R通信的完整性至关重要。

RMBs are subdivided into multiple elements for efficiency, with each RMB Element (RMBE) associated with a single TCP connection. Therefore, multiple TCP connections across an SMC-R link group can share the same memory for RDMA purposes, reducing the overhead of having to register additional memory with the RNIC for every new TCP connection. The number of elements in an RMB and the size of each RMBE are entirely governed by the owning peer, subject to the SMC-R architecture rules; however, all RMB elements within a given RMB must be the same size. Each peer can decide the level of resource-sharing that is desirable across TCP connections based on local constraints,

为了提高效率,RMB被细分为多个元素,每个RMB元素(RMBE)与一个TCP连接相关联。因此,一个SMC-R链路组中的多个TCP连接可以出于RDMA目的共享同一内存,从而减少了为每个新TCP连接向RNIC注册额外内存的开销。RMB中的元素数量和每个RMBE的大小完全由拥有的对等方控制,并遵守SMC-R体系结构规则;但是,给定人民币中的所有人民币元素必须大小相同。每个对等方都可以根据本地约束决定TCP连接中所需的资源共享级别,

such as available system memory. An RMB element is identified to the remote SMC-R peer via an RMB Element Token, which consists of the following:

例如可用的系统内存。RMB元素通过RMB元素令牌标识给远程SMC-R对等方,该令牌包括以下内容:

o RMB RToken: The combination of the RKey and virtual address provided by the RNIC that identifies the start of the RMB for RDMA operations.

o RMB RToken:RNIC提供的RKey和虚拟地址的组合,用于标识RDMA操作的RMB的开始。

o RMB Index: Identifies the RMB element index in the RMB. Used to locate a specific RMB element within an RMB. Valid value range is 1-255.

o 人民币指数:标识人民币中的人民币要素指数。用于在人民币中定位特定人民币元素。有效值范围为1-255。

o RMB Element Length: The length of the RMB element's eye catcher plus the length of the receive buffer. This length is equal for all RMB elements in a given RMB. This length can be variable across different RMBs.

o RMB元素长度:RMB元素吸引眼球的长度加上接收缓冲区的长度。该长度等于给定人民币中所有人民币元素的长度。该长度可以在不同的RMB之间变化。

Multiple RMBs can be associated to an SMC-R link group, and each peer in an SMC-R link group manages allocation of its RMBs. RMB allocation can be asymmetric. For example, Server X can allocate two RMBs to an SMC-R link group while Server Y allocates five. This provides maximum implementation flexibility to allow hosts to optimize RMB management for their own local requirements. The maximum number of RMBs that can be allocated on one peer to a link group is 255. If more RMBs are required, the peer may fall back to IP for subsequent connections or, if the peer is the server, create a parallel link group.

多个RMB可以关联到SMC-R链路组,SMC-R链路组中的每个对等方管理其RMB的分配。人民币配置可能是不对称的。例如,服务器X可以向SMC-R链路组分配两个RMB,而服务器Y分配五个RMB。这提供了最大的实现灵活性,允许主机根据其本地需求优化RMB管理。可在一个对等机上分配给链路组的最大RMB数为255。如果需要更多RMB,对等方可能会退回到IP进行后续连接,或者,如果对等方是服务器,则创建并行链路组。

One use case for multiple RMBs is multiple receive buffer sizes. Since every element in an RMB must be the same size, multiple RMBs with different element sizes can be allocated if varying receive buffer sizes are required.

多个RMB的一个用例是多个接收缓冲区大小。由于RMB中的每个元素必须具有相同的大小,如果需要不同的接收缓冲区大小,则可以分配具有不同元素大小的多个RMB。

Also, since the maximum number of TCP connections whose receive buffers can be allocated to an RMB is 255, multiple RMBs may be required to provide capacity for large numbers of TCP connections between two peers.

此外,由于其接收缓冲区可分配给RMB的TCP连接的最大数量为255,因此可能需要多个RMB来为两个对等方之间的大量TCP连接提供容量。

Separately from the RMB, the TCP/IP stack that owns each RMB maintains control data for each RMB element within its local control structures. The control data contains flags for maintaining the state of the TCP data (for example, urgent data indicator) and, most importantly, the following two cursors, which are illustrated below in Figure 4:

与RMB不同,拥有每个RMB的TCP/IP堆栈在其本地控制结构中维护每个RMB元素的控制数据。控制数据包含用于维护TCP数据状态的标志(例如,紧急数据指示器)以及最重要的以下两个游标,如图4所示:

o The peer producer cursor: This is a wrapping offset into the RMB element's receive buffer that points to the next byte of data to be written by the remote peer. This cursor is provided by the remote peer in a Connection Data Control (CDC) message, which is sent using RoCE SendMsg processing, and tells the local peer how far it can consume data in the RMBE buffer.

o 对等生产者游标:这是RMB元素接收缓冲区中的换行偏移量,指向远程对等方要写入的下一个数据字节。此游标由远程对等方在连接数据控制(CDC)消息中提供,该消息使用RoCE SendMsg处理发送,并告知本地对等方它可以在RMBE缓冲区中使用数据的距离。

o The peer consumer cursor: This is a wrapping offset into the remote peer's RMB element's receive buffer that points to the next byte of data to be consumed by the remote peer in its own RMBE. The local peer cannot write into the remote peer's RMBE beyond this point without causing data loss. This cursor is also provided by the peer using a Connection Data Control message.

o 对等消费者游标:这是远程对等方的RMB元素的接收缓冲区中的换行偏移量,指向远程对等方在其自身RMBE中要消费的下一个数据字节。本地对等方无法写入远程对等方的RMBE超过此点,而不会导致数据丢失。该游标也由对等方使用连接数据控制消息提供。

Each TCP connection peer maintains its cursors for a TCP connection's RMBE in its local control structures. In other words, the peer who writes into a remote peer's RMBE provides its producer cursor to the peer whose RMBE it has written into. The peer who reads from its RMBE provides its consumer cursor to the writing peer. In this manner, the reads and writes between peers are kept coordinated.

每个TCP连接对等方在其本地控制结构中为TCP连接的RMBE维护其游标。换句话说,写入远程对等方的RMBE的对等方将其生产者游标提供给其已写入RMBE的对等方。从RMBE读取的对等方将其消费者光标提供给写入对等方。以这种方式,对等点之间的读写保持协调。

For example, referring to Figure 4, Peer B writes the hashed data into the receive buffer of Peer A's RMBE. After that write completes, Peer B uses a CDC message to update its producer cursor to Peer A, to indicate to Peer A how much data is available for Peer A to consume. The CDC message that Peer B sends to Peer A wakes up Peer A and notifies it that there is data to be consumed.

例如,参考图4,对等方B将散列数据写入对等方A的RMBE的接收缓冲区。写入完成后,对等方B使用CDC消息将其生产者游标更新到对等方a,以向对等方a指示有多少数据可供对等方a使用。对等方B发送给对等方A的CDC消息唤醒对等方A并通知其有数据要消耗。

Similarly, when Peer A consumes data written by Peer B, it uses a CDC message to update its consumer cursor to Peer B to let Peer B know how much data it has consumed, so Peer B knows how much space is available for further writes. If Peer B were to write enough data to Peer A that it would wrap the RMBE receive buffer and exceed the consumer cursor, data loss would result.

类似地,当对等方A使用对等方B写入的数据时,它使用CDC消息将其使用者光标更新到对等方B,以让对等方B知道它已消耗了多少数据,因此对等方B知道有多少空间可用于进一步写入。如果对等方B向对等方A写入足够的数据,使其能够包装RMBE接收缓冲区并超过使用者游标,则会导致数据丢失。

Note that this is a simplistic description of the control flows, and they are optimized to minimize the number of CDC messages required, as described in Section 4.7 ("RMB Data Flows").

请注意,这是对控制流的过于简单的描述,它们经过优化以最小化所需的CDC消息数量,如第4.7节(“RMB数据流”)所述。

      Peer A's RMBE Control Info            Peer B's RMBE Control Info
     +--------------------------+          +--------------------------+
     |                          |          |                          |
      /----Peer producer cursor |    +-----+-Peer consumer cursor     |
    /|                          |    |     |                          |
   | +--------------------------+    |     +--------------------------+
   |  Peer A's RMBE                  |
   | +--------------------------+    |
   | |            +------------------+
   | |            |             |
   | |            \/            |
   | |             +------------|
   | |-------------+/////////// |
   | |//RDMA data written by ///|
   | |/// Peer B that is ////// |
   | |/available to be consumed/|
   | |///////////////////////// |
   | |///////// +---------------|
   | |----------+/\             |
   | |            |             |
    \|            |             |
     \           /              |
     |\---------/               |
     |                          |
     |                          |
        
      Peer A's RMBE Control Info            Peer B's RMBE Control Info
     +--------------------------+          +--------------------------+
     |                          |          |                          |
      /----Peer producer cursor |    +-----+-Peer consumer cursor     |
    /|                          |    |     |                          |
   | +--------------------------+    |     +--------------------------+
   |  Peer A's RMBE                  |
   | +--------------------------+    |
   | |            +------------------+
   | |            |             |
   | |            \/            |
   | |             +------------|
   | |-------------+/////////// |
   | |//RDMA data written by ///|
   | |/// Peer B that is ////// |
   | |/available to be consumed/|
   | |///////////////////////// |
   | |///////// +---------------|
   | |----------+/\             |
   | |            |             |
    \|            |             |
     \           /              |
     |\---------/               |
     |                          |
     |                          |
        

Figure 4: RMBE Cursors

图4:RMBE游标

Additional flags and indicators are communicated between peers. In all cases, these flags and indicators are updated by the peer using CDC messages, which are sent using RoCE SendMsg. More details on these additional flags and indicators are described in Section 4.3 ("RMBE Control Information").

其他标志和指示器在对等方之间进行通信。在所有情况下,这些标志和指标都由对等方使用CDC消息更新,CDC消息使用RoCE SendMsg发送。第4.3节(“RMBE控制信息”)中描述了有关这些附加标志和指示器的更多详细信息。

2.2. SMC-R Link Groups
2.2. SMC-R链路组

SMC-R links are logically grouped together to form an SMC-R link group. The purpose of the link group is for supporting multiple links between the same two peers to provide for:

SMC-R链路在逻辑上分组在一起以形成SMC-R链路组。链路组的目的是支持同两个对等点之间的多个链路,以提供:

o Resilience: Provides transparent and dynamic switching of the link used by existing TCP connections during link failures, typically hardware related. TCP traffic using the failing link can be switched to an active link within the link group, thereby avoiding disruptions to application workloads.

o 弹性:在链路故障(通常与硬件相关)期间,为现有TCP连接使用的链路提供透明和动态切换。使用故障链路的TCP通信可以切换到链路组内的活动链路,从而避免中断应用程序工作负载。

o Link utilization: Provides an active/active link usage model allowing TCP traffic to be balanced across the links, which increases bandwidth and also avoids hardware imbalances and bottlenecks. Note that both adapter and switch utilization can become potential resource constraint issues.

o 链路利用率:提供一个主动/主动链路使用模型,允许跨链路平衡TCP流量,从而增加带宽,同时避免硬件失衡和瓶颈。请注意,适配器和交换机的利用率都可能成为潜在的资源约束问题。

SMC-R link group support is required. Resilience is not optional. However, the user can elect to provision a single RNIC (on one or both hosts).

需要SMC-R链路组支持。恢复力不是可选的。但是,用户可以选择提供单个RNIC(在一台或两台主机上)。

Multiple links that are formed between the same two peers fall into two distinct categories:

在相同的两个对等点之间形成的多个链路分为两个不同的类别:

1. Equal Links: Links providing equal access to the same RMB(s) at both endpoints, whereby all TCP connections associated with the links must have the same VLAN ID and have the same TCP server and TCP client roles or relationship.

1. 相等链接:在两个端点上提供对相同RMB的平等访问的链接,因此与链接关联的所有TCP连接必须具有相同的VLAN ID,并且具有相同的TCP服务器和TCP客户端角色或关系。

2. Unequal Links: Links providing access to unique, unrelated and isolated RMB(s) (i.e., for unique VLANs or unique and isolated application workloads, etc.) or having unique TCP server or client roles.

2. 不平等链路:提供对唯一、无关和隔离的RMB(即,针对唯一VLAN或唯一和隔离的应用程序工作负载等)的访问或具有唯一TCP服务器或客户端角色的链路。

Links that are logically grouped together forming an SMC-R link group must be equal links.

逻辑分组在一起形成SMC-R链路组的链路必须是相等的链路。

2.2.1. Link Group Types
2.2.1. 链接组类型

Equal links within a link group also have another "Link Group Type" attribute based on the link's associated underlying physical path. The following SMC-R link types are defined:

链接组中的相等链接还具有另一个基于链接的关联基础物理路径的“链接组类型”属性。定义了以下SMC-R链路类型:

1. Single link: the only active link within a link group

1. 单链接:链接组中唯一的活动链接

2. Parallel link: not allowed -- SMC-R links having the same physical RNIC at both hosts

2. 并行链路:不允许--SMC-R链路在两台主机上具有相同的物理RNIC

3. Asymmetric link: links that have unique RNIC adapters at one host but share a single adapter at the peer host

3. 非对称链路:在一台主机上具有唯一RNIC适配器但在对等主机上共享单个适配器的链路

4. Symmetric link: links that have unique RNIC adapters at both hosts

4. 对称链路:在两台主机上都有唯一RNIC适配器的链路

These link group types are further explained in the following figures and descriptions.

这些链接组类型在下面的图和描述中进一步解释。

Figure 2 above shows the single-link case. The single link illustrated in Figure 2 also establishes the SMC-R link group. Link groups are supposed to have multiple links, but when only one RNIC is available at both hosts then only a single link can be created. This is expected to be a transient case.

上面的图2显示了单链路情况。图2中所示的单个链路也建立了SMC-R链路组。链路组应具有多个链路,但当两台主机上只有一个RNIC可用时,则只能创建一个链路。预计这将是一个暂时的情况。

Figure 5 shows the symmetric-link case. Both hosts have unique and redundant RNIC adapters. This configuration meets the objectives for providing full RoCE redundancy required to provide the level of resilience required for high availability for SMC-R. While this configuration is not required, it is a strongly recommended "best practice" for the exploitation of SMC-R. Single and asymmetric links must be supported but are intended to provide for short-term transient conditions -- for example, during a temporary outage or recycle of an RNIC.

图5显示了对称链路的情况。两台主机都有唯一的冗余RNIC适配器。此配置满足提供SMC-R高可用性所需的完整RoCE冗余的目标。虽然不需要此配置,但强烈建议采用“最佳实践”对于SMC-R的利用,必须支持单链路和非对称链路,但旨在提供短期瞬态条件——例如,在RNIC的临时停机或再循环期间。

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                      |      |   |RToken Y|
     |       \/   +------+                      +------+  \/        |
     |+--------+         |                      |        +--------+ |
     ||        |         |                      |        |        | |
     || RMB    |         |                      |        | RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |       /\   +------+                      +------+  /\        |
     |RToken Z|   |      |     SMC-R Link 2     |      |   |RToken W|
     |        |   |RNIC 3|<-------------------->|RNIC 4|   |        |
     |       QP 9 |      |                      |      |  QP 65     |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                      |      |   |RToken Y|
     |       \/   +------+                      +------+  \/        |
     |+--------+         |                      |        +--------+ |
     ||        |         |                      |        |        | |
     || RMB    |         |                      |        | RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |       /\   +------+                      +------+  /\        |
     |RToken Z|   |      |     SMC-R Link 2     |      |   |RToken W|
     |        |   |RNIC 3|<-------------------->|RNIC 4|   |        |
     |       QP 9 |      |                      |      |  QP 65     |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        

Figure 5: Symmetric SMC-R Links

图5:对称SMC-R链路

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                   .->|      |   |RToken Y|
     |       \/   +------+                 .`   +------+  \/        |
     |+--------+         |               .`     |        +--------+ |
     ||        |         |             .`       |        |        | |
     || RMB    |         |           .`         |        | RMB    | |
     ||        |         |         .`SMC-R      |        |        | |
     |+--------+         |       .` Link 2      |        +--------+ |
     |       /\   +------+     .`               +------+            |
     |RToken Z|   |      |   .`                 |      |down or     |
     |        |   |RNIC 3|<-`                   |RNIC 4|unavailable |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2|  QP 64     |
     |RToken X|   |      |<-------------------->|      |   |        |
     |        |   |      |                   .->|      |   |RToken Y|
     |       \/   +------+                 .`   +------+  \/        |
     |+--------+         |               .`     |        +--------+ |
     ||        |         |             .`       |        |        | |
     || RMB    |         |           .`         |        | RMB    | |
     ||        |         |         .`SMC-R      |        |        | |
     |+--------+         |       .` Link 2      |        +--------+ |
     |       /\   +------+     .`               +------+            |
     |RToken Z|   |      |   .`                 |      |down or     |
     |        |   |RNIC 3|<-`                   |RNIC 4|unavailable |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        

Figure 6: Asymmetric SMC-R Links

图6:不对称SMC-R链路

In the example provided by Figure 6, Host X has two RNICs but Host Y only has one RNIC because RNIC 4 is not available. This configuration allows for the creation of an asymmetric link. While an asymmetric link will provide some resilience (for example, when RNIC 1 fails), ideally each host should provide two redundant RNICs. This should be a transient case, and when RNIC 4 becomes available, this configuration must transition to a symmetric-link configuration. This transition is accomplished by first creating the new symmetric link and then deleting the asymmetric link with reason code "Asymmetric link no longer needed" specified in the DELETE LINK LLC message.

在图6提供的示例中,主机X有两个RNIC,但主机Y只有一个RNIC,因为RNIC 4不可用。此配置允许创建非对称链路。虽然非对称链路将提供一定的恢复能力(例如,当RNIC 1出现故障时),但理想情况下,每个主机应提供两个冗余RNIC。这应该是暂时的情况,当RNIC 4可用时,此配置必须转换为对称链路配置。通过首先创建新的对称链接,然后删除原因代码为“不再需要不对称链接”的非对称链接(在DELETE link LLC消息中指定),可以完成此转换。

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+  SMC-R Link 1        +------+            |
     |       QP 8 |RNIC 1|<-------------------->|RNIC 2|  QP 64     |
     |RToken X|   |      |                      |      |   |        |
     |        |   |      |<-------------------->|      |   |RToken Y|
     |       \/   +------+  SMC-R Link 2        +------+  \/        |
     |+--------+   QP 9  |                      | QP 65  +--------+ |
     ||        |    |    |                      |  |     |        | |
     || RMB    |<-- +    |                      |  +---->| RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |            +------+                      +------+            |
     |     down or|      |                      |      |down or     |
     | unavailable|RNIC 3|                      |RNIC 4|unavailable |
     |            |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |                   |                      |                   |
     | Protection        |                      |    Protection     |
     | Domain X          |                      |    Domain Y       |
     |            +------+  SMC-R Link 1        +------+            |
     |       QP 8 |RNIC 1|<-------------------->|RNIC 2|  QP 64     |
     |RToken X|   |      |                      |      |   |        |
     |        |   |      |<-------------------->|      |   |RToken Y|
     |       \/   +------+  SMC-R Link 2        +------+  \/        |
     |+--------+   QP 9  |                      | QP 65  +--------+ |
     ||        |    |    |                      |  |     |        | |
     || RMB    |<-- +    |                      |  +---->| RMB    | |
     ||        |         |                      |        |        | |
     |+--------+         |                      |        +--------+ |
     |            +------+                      +------+            |
     |     down or|      |                      |      |down or     |
     | unavailable|RNIC 3|                      |RNIC 4|unavailable |
     |            |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        

Figure 7: SMC-R Parallel Links (Not Supported)

图7:SMC-R并联链路(不支持)

Figure 7 shows parallel links, which are two links in the link group that use the same hardware. This configuration is not permitted. Because SMC-R multiplexes multiple TCP connections over an SMC-R link and both links are using the exact same hardware, there is no additional redundancy or capacity benefit obtained from this configuration. In addition to providing no real benefit, this configuration adds the unnecessary overhead of additional queue pairs, generation of additional RKeys, etc.

图7显示了并行链路,这是链路组中使用相同硬件的两个链路。不允许此配置。由于SMC-R在SMC-R链路上多路复用多个TCP连接,并且两个链路使用完全相同的硬件,因此此配置不会带来额外的冗余或容量优势。除了不提供任何实际好处之外,此配置还增加了额外队列对的不必要开销、生成额外的RKEY等。

2.2.2. Maximum Number of Links in Link Group
2.2.2. 链接组中的最大链接数

The SMC-R protocol defines a maximum of eight symmetric SMC-R links within a single SMC-R link group. This allows for support for up to eight unique physical paths between peer hosts. However, in terms of meeting the basic requirements for redundancy, support for at least two symmetric links must be implemented. Supporting more than two links also simplifies implementation for practical matters relating to dynamically adding and removing links -- for example, starting a third SMC-R link prior to taking down one of the two existing links. Recall that all links within a link group must have equal access to all associated RMBs.

SMC-R协议在单个SMC-R链路组中最多定义八个对称SMC-R链路。这允许在对等主机之间支持多达八条唯一的物理路径。然而,为了满足冗余的基本要求,必须实现对至少两个对称链路的支持。支持两个以上的链接还简化了与动态添加和删除链接相关的实际问题的实现——例如,在关闭两个现有链接之一之前启动第三个SMC-R链接。回想一下,链接组中的所有链接都必须具有对所有相关RMB的平等访问权。

The SMC-R protocol allows an implementation to assign an implementation-specific and appropriate value for maximum symmetric links. The implementation value must not exceed the architecture limit of 8; also, the value must not be lower than 2, because the SMC-R protocol requires redundancy. This does not mean that two RNICs are physically required to enable SMC-R connectivity, but at least two RNICs for redundancy are strongly recommended.

SMC-R协议允许实现为最大对称链路分配特定于实现的适当值。实现值不得超过架构限制8;此外,该值不得低于2,因为SMC-R协议需要冗余。这并不意味着需要两个RNIC才能实现SMC-R连接,但强烈建议至少两个RNIC用于冗余。

The SMC-R peers exchange their implementation maximum link values during the link group establishment using the defined maximum link value in the CONFIRM LINK LLC command. Once the initial exchange completes, the value is set for the life of the link group. The maximum link value can be provided by both the server and client. The server must supply a value, whereas the client maximum link value is optional. When the client does not supply a value, it indicates that the client accepts the server-supplied maximum value. If the client provides a value, it cannot exceed the server-supplied maximum value. If the client passes a lower value, this lower value then becomes the final negotiated maximum number of symmetric links for this link group. Again, the minimum value is 2.

在链路组建立期间,SMC-R对等方使用CONFIRM link LLC命令中定义的最大链路值交换它们的实现最大链路值。初始交换完成后,将为链接组的生命周期设置该值。服务器和客户端都可以提供最大链接值。服务器必须提供一个值,而客户端最大链接值是可选的。当客户端不提供值时,表示客户端接受服务器提供的最大值。如果客户端提供值,则不能超过服务器提供的最大值。如果客户端传递较低的值,则该较低的值将成为此链路组的最终协商最大对称链路数。同样,最小值是2。

During run time, the client must never request that the server add a symmetric link to a link group that would exceed the negotiated maximum link value. Likewise, the server must never attempt to add a symmetric link to a link group that would exceed the negotiated maximum value.

在运行时期间,客户端决不能请求服务器向链路组添加超过协商的最大链路值的对称链路。同样,服务器决不能试图向链路组添加超过协商最大值的对称链路。

In terms of counting the number of active links within a link group, the initial link (or the only/last) link is always counted as 1. Then, as additional links are added, they are either symmetric or asymmetric links.

在计算链路组中活动链路的数量时,初始链路(或唯一/最后一个)始终计为1。然后,随着附加链接的添加,它们要么是对称链接,要么是非对称链接。

With regards to enforcing the maximum link rules, asymmetric links are an exception having a unique set of rules:

关于强制执行最大链接规则,非对称链接是一个例外,具有一组唯一的规则:

o Asymmetric links are always limited to one asymmetric link allowed per link group.

o 非对称链接始终限于每个链接组允许的一个非对称链接。

o Asymmetric links must not be counted in the maximum symmetric-link count calculation. When tracking the current count or enforcing the negotiated maximum number of links, an asymmetric link is not to be counted.

o 非对称链路不得计入最大对称链路计数计算中。跟踪当前计数或强制执行协商的最大链接数时,不计算非对称链接。

2.2.3. Forming and Managing Link Groups
2.2.3. 建立和管理链接组

SMC-R link groups are self-defining. The first SMC-R link in a link group is created using TCP option flows on the TCP three-way handshake followed by CLC message flows over the TCP connection. Subsequent SMC-R links in the link group are created by sending LLC messages over an SMC-R link that already exists in the link group. Once an SMC-R link group is created, no additional SMC-R links in that group are created using TCP and CLC negotiation. Because subsequent SMC-R links are created exclusively by sending LLC messages over an existing SMC-R link in a link group, the membership of SMC-R links in a link group is self-defining.

SMC-R链路组是自定义的。链路组中的第一个SMC-R链路是使用TCP三方握手上的TCP选项流创建的,然后是TCP连接上的CLC消息流。链路组中的后续SMC-R链路是通过在链路组中已经存在的SMC-R链路上发送LLC消息来创建的。创建SMC-R链路组后,不会使用TCP和CLC协商在该组中创建其他SMC-R链路。由于后续SMC-R链路是通过在链路组中的现有SMC-R链路上发送LLC消息专门创建的,因此链路组中SMC-R链路的成员身份是自定义的。

This architecture does not define a specific identifier for an SMC-R link group. This identification may be useful for network management and may be assigned in a platform-specific manner, or in an extension to this architecture.

此体系结构未定义SMC-R链路组的特定标识符。该标识可用于网络管理,并可以特定于平台的方式分配,或在该体系结构的扩展中分配。

In each SMC-R link group, one peer is the server for all TCP connections and the other peer is the client. If there are additional TCP connections between the peers that use SMC-R and have the client and server roles reversed, another SMC-R link group is set up between them with the opposite client-server relationship.

在每个SMC-R链路组中,一个对等方是所有TCP连接的服务器,另一个对等方是客户端。如果使用SMC-R的对等方之间存在其他TCP连接,并且客户端和服务器角色颠倒,则会在它们之间设置另一个SMC-R链路组,并使用相反的客户端-服务器关系。

This is required because there are specific responsibilities divided between the client and server in the management of an SMC-R link group.

这是必需的,因为在SMC-R链路组的管理中,客户机和服务器之间有特定的职责分工。

In this architecture, the decision of whether to use an existing SMC-R link group or create a new SMC-R link group for a TCP connection is made exclusively by the server.

在此体系结构中,是使用现有的SMC-R链路组还是为TCP连接创建新的SMC-R链路组的决定完全由服务器决定。

Management of the links in an SMC-R link group is also a server responsibility. The server is responsible for adding and deleting links in a link group. The client may request that the server take certain actions, but the final responsibility is the server's.

管理SMC-R链路组中的链路也是服务器的责任。服务器负责在链接组中添加和删除链接。客户机可能会要求服务器采取某些操作,但最终的责任是服务器的。

2.2.4. SMC-R Link Identifiers
2.2.4. SMC-R链路标识符

This architecture defines multiple identifiers to identify SMC-R links and peers.

该体系结构定义了多个标识符来标识SMC-R链路和对等点。

o Link number: This is a 1-byte value that identifies an SMC-R link within a link group. Both the server and the client use this number to distinguish an SMC-R link from other links within the same link group. It is only unique within a link group. In order to prevent timing windows that may occur when a server creates a new link while the client is still cleaning up a previously existing link, link numbers cannot be reused until the entire link numbering space has been exhausted.

o 链路编号:这是一个1字节的值,用于标识链路组中的SMC-R链路。服务器和客户端都使用此编号来区分SMC-R链路与同一链路组中的其他链路。它仅在链接组中是唯一的。为了防止在客户端仍在清理以前存在的链接时,服务器创建新链接时出现计时窗口,在耗尽整个链接编号空间之前,不能重用链接编号。

o Link user ID: This is an architecturally opaque 4-byte value that a peer uses to uniquely define an SMC-R link within its own space. This means that a link user ID is unique within one peer only. Each peer defines its own link user ID for a link. The peers exchange this information once during link setup, and it is never used architecturally again. The purpose of this identifier is for network management, display, and debugging. For example, an operator on a client could provide the operator on the server with the server's link user ID if he requires the server's operator to check on the operation of a link that the client is having trouble with.

o 链接用户ID:这是一个架构不透明的4字节值,对等方使用该值在其自己的空间内唯一定义SMC-R链接。这意味着链接用户ID仅在一个对等方中是唯一的。每个对等方为链接定义自己的链接用户ID。对等方在链接设置期间交换此信息一次,并且在体系结构上不再使用此信息。此标识符用于网络管理、显示和调试。例如,客户机上的操作员可以向服务器上的操作员提供服务器的链接用户ID,前提是他需要服务器的操作员检查客户机遇到问题的链接的操作。

o Peer ID: The SMC-R peer ID uniquely identifies a specific instance of a specific TCP/IP stack. It is required because in clustered and load-balancing environments, an IP address does not uniquely identify a TCP/IP stack. An RNIC's MAC/GID also doesn't uniquely or reliably identify a TCP/IP stack, because RNICs can go up and down and even be redeployed to other TCP/IP stacks in a multiple-partitioned or virtualized environment. The peer ID is not only unique per TCP/IP stack but is also unique per instance of a TCP/IP stack, meaning that if a TCP/IP stack is restarted, its peer ID changes.

o 对等ID:SMC-R对等ID唯一标识特定TCP/IP堆栈的特定实例。这是必需的,因为在集群和负载平衡环境中,IP地址不能唯一标识TCP/IP堆栈。RNIC的MAC/GID也不能唯一或可靠地标识TCP/IP堆栈,因为RNIC可以上下移动,甚至可以重新部署到多分区或虚拟化环境中的其他TCP/IP堆栈。对等ID不仅在每个TCP/IP堆栈中是唯一的,而且在每个TCP/IP堆栈实例中也是唯一的,这意味着如果重新启动TCP/IP堆栈,其对等ID将发生更改。

2.3. SMC-R Resilience and Load Balancing
2.3. SMC-R弹性和负载平衡

The SMC-R multilink architecture provides resilience for network high availability via failover capability to an alternate RoCE adapter.

SMC-R多链路体系结构通过向备用RoCE适配器提供故障切换功能,为网络高可用性提供恢复能力。

The SMC-R multilink architecture does not define primary, secondary, or alternate roles to the links. Instead, there are multiple active links representing multiple redundant RoCE paths over the same LAN.

SMC-R多链路体系结构未定义链路的主要、次要或备用角色。相反,存在多个活动链路,表示同一LAN上的多个冗余RoCE路径。

Assignment of TCP connections to links is unidirectional and asymmetric. This means that the client and server may each choose a separate link for their RDMA writes associated with a specific TCP connection.

TCP连接到链路的分配是单向和不对称的。这意味着客户机和服务器可以各自为与特定TCP连接关联的RDMA写入选择单独的链接。

If a hardware failure occurs or a QP failure associated with an individual link occurs, then the TCP connections that were associated with the failing link are dynamically and transparently switched to use another available link. The server or the client can detect a failure, immediately move their TCP connections, and then notify their peer via the DELETE LINK LLC command. While the client can notify the server of an apparent link failure with the DELETE LINK LLC command, the server performs the actual link deletion.

如果发生硬件故障或与单个链路相关联的QP故障,则与故障链路相关联的TCP连接将动态、透明地切换到使用另一个可用链路。服务器或客户端可以检测到故障,立即移动其TCP连接,然后通过DELETE-LINK-LLC命令通知其对等方。当客户端可以使用DELETE-link-LLC命令通知服务器明显的链路故障时,服务器执行实际的链路删除。

The movement of TCP connections to another link can be accomplished with minimal coordination between the peers. The TCP connection movement is also transparent to, and non-disruptive to, the TCP socket application workloads for most failure scenarios. After a failure, the surviving links and all associated hardware must handle the link group's workload.

TCP连接到另一个链路的移动可以通过对等方之间的最小协调来完成。对于大多数故障场景,TCP连接移动对TCP套接字应用程序工作负载是透明的,并且不会中断。发生故障后,幸存的链路和所有相关硬件必须处理链路组的工作负载。

As each SMC-R peer begins to move active TCP connections to another link, all current RDMA write operations must be allowed to complete. The moving peer then sends a signal to verify receipt of the last successful write by its peer. If this verification fails, the TCP connection must be reset. Once this verification is complete, all writes that failed may then be retried, in order, over the new link. Any data writes or CDC messages for which the sender did not receive write completion must be replayed before any subsequent data or CDC write operations are sent. LLC messages are not retried over the new link, because they are dependent on a known link configuration, which has just changed because of the failure. The initiator of an LLC message exchange that fails will be responsible for retrying once the link group configuration stabilizes.

当每个SMC-R对等机开始将活动TCP连接移动到另一个链路时,必须允许完成所有当前RDMA写入操作。然后,移动的对等方发送一个信号,以验证其对等方是否收到最后一次成功写入。如果验证失败,则必须重置TCP连接。验证完成后,所有失败的写入操作都可以通过新链接按顺序重试。在发送任何后续数据或CDC写入操作之前,必须重放发送方未收到写入完成的任何数据写入或CDC消息。LLC消息不会在新链路上重试,因为它们依赖于已知的链路配置,该配置刚刚因故障而更改。一旦链路组配置稳定,失败的LLC消息交换的发起人将负责重试。

When a new link becomes available and is re-added to the link group, each peer is free to rebalance its current TCP connections as needed or only assign new TCP connections to the newly added link. Both the server and client are free to manage TCP connections across the link group as needed. TCP connection movement does not have to be stimulated by a link failure.

当新链路可用并重新添加到链路组时,每个对等方都可以根据需要重新平衡其当前TCP连接,或者只向新添加的链路分配新的TCP连接。服务器和客户端都可以根据需要管理链路组中的TCP连接。TCP连接移动不必受到链路故障的刺激。

The SMC-R architecture also defines orderly versus disorderly failover. The type of failover is communicated in the LLC DELETE LINK command and is simply a means to indicate that the link has terminated (disorderly) or link termination is imminent (orderly). The orderly link deletion could be initiated via operator command or programmatically to bring down an idle link. For example,

SMC-R体系结构还定义了有序故障切换与无序故障切换。故障转移类型在LLC DELETE LINK命令中进行通信,它只是一种表示链路已终止(无序)或链路即将终止(有序)的方法。有序链接删除可以通过操作员命令启动,也可以通过编程方式关闭空闲链接。例如

an operator command could initiate orderly shutdown of an adapter for service. Implementation of the two types is based on implementation requirements and is beyond the scope of the SMC-R architecture.

操作员命令可以启动适配器的有序关闭以进行服务。这两种类型的实现基于实现需求,超出了SMC-R体系结构的范围。

3. SMC-R Rendezvous Architecture
3. SMC-R交会结构

"Rendezvous" is the process that SMC-R-capable peers use to dynamically discover each others' capabilities, negotiate SMC-R connections, set up SMC-R links and link groups, and manage those link groups. A key aspect of SMC-R Rendezvous is that it occurs dynamically and automatically, without requiring SMC-R link configuration to be defined by an administrator.

“会合”是支持SMC-R的对等方用来动态发现彼此的能力、协商SMC-R连接、建立SMC-R链路和链路组以及管理这些链路组的过程。SMC-R会合的一个关键方面是,它是动态和自动进行的,无需管理员定义SMC-R链路配置。

SMC-R Rendezvous starts with the TCP/IP three-way handshake, during which connection peers use TCP options to announce their SMC-R capabilities. If both endpoints are SMC-R capable, then Connection Layer Control (CLC) messages are exchanged between the peers' SMC-R layers over the newly established TCP connection to negotiate SMC-R credentials. The CLC message mechanism is analogous to the messages exchanged by SSL for its handshake processing.

SMC-R会合从TCP/IP三方握手开始,在此期间,连接对等方使用TCP选项宣布其SMC-R功能。如果两个端点都支持SMC-R,则通过新建立的TCP连接在对等方的SMC-R层之间交换连接层控制(CLC)消息,以协商SMC-R凭据。CLC消息机制类似于SSL交换的用于握手处理的消息。

If a new SMC-R link is being set up, Link Layer Control (LLC) messages are used to confirm RDMA connectivity. LLC messages are also used by the SMC-R layers at each peer to manage the links and link groups.

如果正在建立新的SMC-R链路,则链路层控制(LLC)消息用于确认RDMA连接。LLC消息也被每个对等点的SMC-R层用来管理链路和链路组。

Once an SMC-R link is set up or agreed to by the peers, the TCP sockets are passed to the peer applications, which use them as normal. The SMC-R layer, which resides under the sockets layer, transmits the socket data between peers over RDMA using the SMC-R protocol, bypassing the TCP/IP stack.

一旦建立了SMC-R链路或得到了对等方的同意,TCP套接字就会传递给对等应用程序,这些应用程序将正常使用它们。SMC-R层位于套接字层之下,通过使用SMC-R协议,绕过TCP/IP堆栈,通过RDMA在对等方之间传输套接字数据。

3.1. TCP Options
3.1. TCP选项

During the TCP/IP three-way handshake, the client and server indicate their support for SMC-R by including experimental TCP option 254 on the three-way handshake flows, in accordance with [RFC6994] ("Shared Use of Experimental TCP Options"). The Experiment Identifier (ExID) value used is the string "SMCR" in EBCDIC (IBM-1047) encoding (0xE2D4C3D9). This ExID has been registered in the "TCP Experimental Option Experiment Identifiers (TCP ExIDs)" registry maintained by IANA.

在TCP/IP三向握手过程中,客户机和服务器根据[RFC6994](“实验TCP选项的共享使用”)在三向握手流上包括实验TCP选项254,以表明其对SMC-R的支持。使用的实验标识符(ExID)值是EBCDIC(IBM-1047)编码(0xE2D4C3D9)中的字符串“SMCR”。此ExID已在IANA维护的“TCP实验选项实验标识符(TCP ExID)”注册表中注册。

After completion of the three-way TCP handshake, each peer queries its peer's options. If both peers set the TCP option on the three-way handshake, inline SMC-R negotiation occurs using CLC messages. If neither peer, or only one peer, sets the TCP option, SMC-R cannot be used for the TCP connection, and the TCP connection completes the setup using the IP fabric.

完成三向TCP握手后,每个对等方都会查询其对等方的选项。如果两个对等方都在三方握手上设置了TCP选项,则使用CLC消息进行内联SMC-R协商。如果没有一个对等方或只有一个对等方设置TCP选项,则SMC-R不能用于TCP连接,TCP连接使用IP结构完成设置。

3.2. Connection Layer Control (CLC) Messages
3.2. 连接层控制(CLC)消息

CLC messages are sent as data payload over the IP network using the TCP connection between SMC-R layers at the peers. They are analogous to the messages used to exchange parameters for SSL.

CLC消息作为数据有效负载通过IP网络发送,使用对等点SMC-R层之间的TCP连接。它们类似于用于为SSL交换参数的消息。

The use of CLC messages is detailed in the following sections. The following list provides a summary of the defined CLC messages and their purposes:

以下各节详细介绍了CLC消息的使用。以下列表提供了已定义CLC消息及其用途的摘要:

o SMC Proposal: Sent from the client to propose that this TCP connection is eligible to be moved to SMC-R. The client identifies itself and its subnet to the server and passes the SMC-R elements for a suggested RoCE path via the MAC and GID.

o SMC建议:从客户端发送,建议此TCP连接有资格移动到SMC-R。客户端将自己及其子网标识到服务器,并通过MAC和GID将SMC-R元素传递给建议的RoCE路径。

o SMC Accept: Sent from the server to accept the client's TCP connection SMC Proposal. The server responds to the client's proposal by identifying itself to the client and passing the elements of a RoCE path that the client can use to perform RDMA writes to the server. This consists of such SMC-R link elements as RoCE MAC, GID, and RMB information.

o SMC Accept:从服务器发送以接受客户端的TCP连接SMC建议。服务器通过向客户机标识自己并传递客户机可用于执行RDMA写入的RoCE路径元素来响应客户机的建议。这包括诸如RoCE MAC、GID和RMB信息等SMC-R链路元素。

o SMC Confirm: Sent from the client to confirm the server's acceptance of the SMC connection. The client responds to the server's acceptance by passing the elements of a RoCE path that the server can use to perform RDMA writes to the client. This consists of such SMC-R link elements as RoCE MAC, GID, and RMB information.

o SMC确认:从客户端发送,以确认服务器是否接受SMC连接。客户机通过传递RoCE路径的元素来响应服务器的接受,服务器可以使用RoCE路径向客户机执行RDMA写入。这包括诸如RoCE MAC、GID和RMB信息等SMC-R链路元素。

o SMC Decline: Sent from either the server or the client to reject the SMC connection, indicating the reason the peer must decline the SMC Proposal and allowing the TCP connection to revert back to IP connectivity.

o SMC拒绝:从服务器或客户端发送以拒绝SMC连接,表明对等方必须拒绝SMC建议的原因,并允许TCP连接恢复到IP连接。

3.3. LLC Messages
3.3. LLC消息

Link Layer Control (LLC) messages are sent between peer SMC-R layers over an SMC-R link to manage the link or the link group. LLC messages are sent using RoCE SendMsg and are 44 bytes long. The 44-byte size is based on what can fit into a RoCE Work Queue Element (WQE) without requiring the posting of receive buffers.

链路层控制(LLC)消息通过SMC-R链路在对等SMC-R层之间发送,以管理链路或链路组。LLC消息使用RoCE SendMsg发送,长度为44字节。44字节的大小基于什么可以放入RoCE工作队列元素(WQE),而不需要发送接收缓冲区。

LLC messages generally follow a request-reply semantic. Each message has a request flavor and a reply flavor, and each request must be confirmed with a reply, except where otherwise noted. The use of LLC messages is detailed in the following sections. The following list provides a summary of the defined LLC messages and their purposes:

LLC消息通常遵循请求-应答语义。每个消息都有一个请求风格和一个回复风格,除非另有说明,否则每个请求都必须通过回复进行确认。以下章节详细介绍了LLC消息的使用。以下列表提供了已定义LLC消息及其用途的摘要:

o ADD LINK: Used to add a new link to a link group. Sent from the server to the client to initiate addition of a new link to the link group, or from the client to the server to request that the server initiate addition of a new link.

o 添加链接:用于向链接组添加新链接。从服务器发送到客户端以启动向链接组添加新链接,或从客户端发送到服务器以请求服务器启动添加新链接。

o ADD LINK CONTINUATION: A continuation of ADD LINK that allows the ADD LINK to span multiple commands, because all of the link information cannot be contained in a single ADD LINK message.

o 添加链接延续:添加链接的延续,允许添加链接跨越多个命令,因为所有链接信息不能包含在单个添加链接消息中。

o CONFIRM LINK: Used to confirm that RoCE connectivity over a newly created SMC-R link is working correctly. Initiated by the server. Both this message and its reply must flow over the SMC-R link being confirmed.

o 确认链路:用于确认新创建的SMC-R链路上的RoCE连接是否正常工作。由服务器启动。此消息及其回复必须通过正在确认的SMC-R链路。

o DELETE LINK: When initiated by the server, deletes a specific link from the link group or deletes the entire link group. When initiated by the client, requests that the server delete a specific link or the entire link group.

o 删除链接:由服务器启动时,从链接组中删除特定链接或删除整个链接组。由客户端启动时,请求服务器删除特定链接或整个链接组。

o CONFIRM RKEY: Informs the peer on the SMC-R link of the addition of an RMB to the link group.

o 确认RKEY:通知SMC-R链路上的对等方向链路组添加人民币。

o CONFIRM RKEY CONTINUATION: A continuation of CONFIRM RKEY that allows the CONFIRM RKEY to span multiple commands, in the event that all of the information cannot be contained in a single CONFIRM RKEY message.

o CONFIRM RKEY CONTINUATION(确认RKEY延续):CONFIRM RKEY的延续,允许CONFIRM RKEY在所有信息不能包含在单个CONFIRM RKEY消息中的情况下跨越多个命令。

o DELETE RKEY: Informs the peer on the SMC-R link of the deletion of one or more RMBs from the link group.

o 删除RKEY:通知SMC-R链路上的对等方从链路组中删除一个或多个RMB。

o TEST LINK: Verifies that an already-active SMC-R link is active and healthy.

o 测试链路:验证已激活的SMC-R链路是否激活且正常。

o Optional LLC message: Any LLC message in which the two high-order bits of the opcode are b'10'. This optional message must be silently discarded by a receiving peer that does not support the opcode. No such messages are defined in this version of the architecture; however, the concept is defined to allow for toleration of possible advanced, optional functions.

o 可选LLC消息:操作码的两个高位为b'10'的任何LLC消息。此可选消息必须由不支持操作码的接收对等方以静默方式丢弃。此版本的体系结构中未定义此类消息;但是,该概念的定义允许容忍可能的高级可选功能。

CONFIRM LINK and TEST LINK are sensitive to which link they flow on and must flow on the link being confirmed or tested. The other flows may flow over any active link in the link group. When there are multiple links in a link group, a response to an LLC message must flow over the same link that the original message flowed over, with the following exceptions:

确认链接和测试链接对它们在哪个链接上流动非常敏感,并且必须在被确认或测试的链接上流动。其他流可以流过链路组中的任何活动链路。当链路组中存在多个链路时,对LLC消息的响应必须流经原始消息流经的同一链路,但以下例外情况除外:

o ADD LINK request from a server in response to an ADD LINK from a client.

o 来自服务器的添加链接请求,以响应来自客户端的添加链接。

o DELETE LINK request from a server in response to a DELETE LINK from a client.

o 响应客户端的删除链接,从服务器删除链接请求。

3.4. CDC Messages
3.4. CDC信息

Connection Data Control (CDC) messages are sent over the RoCE fabric between peers using RoCE SendMsg and are 44 bytes long. The 44-byte size is based on the size that can fit into a RoCE WQE without requiring the posting of receive buffers. CDC messages are used to describe the socket application data passed via RDMA write operations, as well as TCP connection state information, including producer cursors and consumer cursors, RMBE state information, and failover data validation.

连接数据控制(CDC)消息使用RoCE SendMsg通过RoCE结构在对等方之间发送,长度为44字节。44字节的大小是基于可以装入RoCE WQE的大小,而无需过帐接收缓冲区。CDC消息用于描述通过RDMA写入操作传递的套接字应用程序数据,以及TCP连接状态信息,包括生产者游标和消费者游标、RMBE状态信息和故障转移数据验证。

3.5. Rendezvous Flows
3.5. 交会流

Rendezvous information for SMC-R is exchanged as TCP options on the TCP three-way handshake flows to indicate capability, followed by inline TCP negotiation messages to actually do the SMC-R setup. Formats of all rendezvous options and messages discussed in this section are detailed in Appendix A.

SMC-R的会合信息作为TCP三向握手流上的TCP选项进行交换,以指示能力,然后是内联TCP协商消息,以实际执行SMC-R设置。本节讨论的所有会合选项和消息的格式详见附录A。

3.5.1. First Contact
3.5.1. 第一次接触

First contact between RoCE peers occurs when a new SMC-R link group is being set up. This could be because no SMC-R links already exist between the peers, or the server decides to create a new SMC-R link group in parallel with an existing one.

当建立新的SMC-R链路组时,RoCE对等方之间发生第一次接触。这可能是因为对等点之间不存在SMC-R链路,或者服务器决定创建一个与现有SMC-R链路组并行的新SMC-R链路组。

3.5.1.1. Pre-negotiation of TCP Options
3.5.1.1. TCP选项的预协商

The client and server indicate their SMC-R capability to each other using TCP option 254 on the TCP three-way handshake flows.

客户机和服务器使用TCP三向握手流上的TCP选项254相互指示其SMC-R能力。

A client who wishes to do SMC-R will include TCP option 254 using an ExID equal to the EBCDIC (codepage IBM-1047) encoding of "SMCR" on its SYN flow.

希望进行SMC-R的客户机将包括TCP选项254,其ExID等于其SYN流上“SMCR”的EBCDIC(代码页IBM-1047)编码。

A server that supports SMC-R will include TCP option 254 with the ExID value of EBCDIC "SMCR" on its SYN-ACK flow. Because the server is listening for connections and does not know where client connections will come from, the server implementation may choose to unconditionally include this TCP option if it supports SMC-R. This may be required for server implementations where extensions to the TCP stack are not practical. For server implementations that can add code to examine and react to packets during the three-way handshake, the server should only include the SMC-R TCP option on the SYN-ACK if the client included it on its SYN packet.

支持SMC-R的服务器将在其SYN-ACK流上包含ExID值为EBCDIC“SMCR”的TCP选项254。由于服务器正在侦听连接,并且不知道客户端连接来自何处,因此,如果服务器实现支持SMC-R,则可能会选择无条件地包括此TCP选项。对于无法扩展TCP堆栈的服务器实现,这可能是必需的。对于可以在三向握手期间添加代码以检查和响应数据包的服务器实现,如果客户端在其SYN数据包中包含SMC-R TCP选项,则服务器应仅在SYN-ACK上包含该选项。

A client who supports SMC-R and meets the three conditions outlined above may optionally include the TCP option for SMC-R on its ACK flow, regardless of whether or not the server included it on its SYN-ACK flow. Some TCP/IP stacks may have to include it if the SMC-R layer cannot modify the options on the socket until the three-way handshake completes. Proprietary servers should not include this option on the ACK flow, since including it on the SYN flow was sufficient to indicate the client's capabilities.

支持SMC-R并满足上述三个条件的客户端可以选择在其ACK流中包含SMC-R的TCP选项,而不管服务器是否在其SYN-ACK流中包含该选项。如果SMC-R层在三方握手完成之前无法修改套接字上的选项,则某些TCP/IP堆栈可能必须包含它。专有服务器不应在ACK流中包含此选项,因为在SYN流中包含此选项足以指示客户端的功能。

Once the initial three-way TCP handshake is completed, each peer examines the socket options. SMC-R implementations may do this by examining what was actually provided on the SYN and SYN-ACK packets or by performing a getsockopt() operation to determine the options sent by the peer. If neither peer, or only one peer, specified the TCP option for SMC-R, then SMC-R cannot be used on this connection and it proceeds using normal IP flows and processing.

初始的三向TCP握手完成后,每个对等方都会检查套接字选项。SMC-R实现可以通过检查SYN和SYN-ACK数据包上实际提供的内容,或者通过执行getsockopt()操作来确定对等方发送的选项来实现这一点。如果没有一个对等方或只有一个对等方为SMC-R指定了TCP选项,则SMC-R不能用于此连接,它将使用正常的IP流和处理。

If both peers specified the TCP option for SMC-R, then the TCP connection is not started yet and the peers proceed to SMC-R negotiation using inline data flows. The socket is not yet turned over to the applications; instead, the respective SMC layers exchange CLC messages over the newly formed TCP connection.

如果两个对等方都为SMC-R指定了TCP选项,则TCP连接尚未启动,对等方使用内联数据流继续进行SMC-R协商。插座尚未移交给应用程序;相反,各个SMC层通过新形成的TCP连接交换CLC消息。

3.5.1.2. Client Proposal
3.5.1.2. 客户提案

If SMC-R is supported by both peers, the client sends an SMC Proposal CLC message to the server. It is not immediately apparent on this flow from client to server whether this is a new or existing SMC-R link, because in clustered environments a single IP address may represent multiple hosts. This type of cluster virtual IP address can be owned by a network-based or host-based Layer 4 load balancer that distributes incoming TCP connections across a cluster of servers/hosts. For purposes of high availability, other clustered environments may also support the movement of a virtual IP address dynamically from one host in the cluster to another. In summary, the client cannot predetermine that a connection is targeting the same host by simply matching the destination IP address for outgoing TCP

如果两个对等方都支持SMC-R,则客户端向服务器发送SMC建议CLC消息。在从客户端到服务器的流程中,这是新的还是现有的SMC-R链路并不明显,因为在集群环境中,单个IP地址可能代表多个主机。这种类型的群集虚拟IP地址可由基于网络或基于主机的第4层负载平衡器拥有,该负载平衡器跨服务器/主机群集分发传入的TCP连接。出于高可用性的目的,其他群集环境也可能支持虚拟IP地址从群集中的一台主机动态移动到另一台主机。总之,客户端无法通过简单地匹配传出TCP的目标IP地址来预先确定连接的目标是同一主机

connections. Therefore, it cannot predetermine the SMC-R link that will be used for a new TCP connection. This information will be dynamically learned, and the appropriate actions will be taken as the SMC-R negotiation handshake unfolds.

连接。因此,它无法预先确定将用于新TCP连接的SMC-R链路。该信息将动态学习,并在SMC-R协商握手过程中采取适当的措施。

In the SMC-R proposal message, the initiator (client) proposes the use of SMC-R by including its peer ID, GID, and MAC addresses, as well as the IP subnet number of the outgoing interface (if IPv4) or the IP prefix list for the network over which the proposal is sent (if IPv6). At this point in the flow, the client makes no local commitments of resources for SMC-R.

在SMC-R建议消息中,发起方(客户端)建议使用SMC-R,包括其对等ID、GID和MAC地址,以及传出接口的IP子网号(如果是IPv4)或发送建议的网络的IP前缀列表(如果是IPv6)。在流程的这一点上,客户没有对SMC-R的资源作出本地承诺。

When the server receives the SMC Proposal CLC message, it uses the peer ID provided by the client, plus subnet or prefix information provided by the client, to determine if it already has a usable SMC-R link with this SMC-R peer. If there are one or more existing SMC-R links with this SMC-R peer, the server then decides which SMC-R link it will use for this TCP connection. See Sections 3.5.2 and 3.5.3 for the cases of reusing an existing SMC-R link or creating a parallel SMC-R link group between SMC-R peers.

当服务器接收到SMC建议CLC消息时,它使用客户端提供的对等ID,加上客户端提供的子网或前缀信息,以确定它是否已经具有与此SMC-R对等的可用SMC-R链路。如果有一个或多个现有SMC-R链路与此SMC-R对等机连接,则服务器将决定将使用哪个SMC-R链路进行此TCP连接。有关重用现有SMC-R链路或在SMC-R对等点之间创建并行SMC-R链路组的情况,请参见第3.5.2节和第3.5.3节。

If this is a first contact between SMC-R peers, the server must validate that it is on the same LAN as the client before continuing. For IPv4, the server does this by verifying that it has an interface with an IP subnet number that matches the subnet number sent by the client in the SMC Proposal. For IPv6, it does this by verifying that it is directly attached to at least one IP prefix that was listed by the client in its SMC Proposal message.

如果这是SMC-R对等方之间的第一次接触,则服务器必须验证它是否与客户端位于同一LAN上,然后才能继续。对于IPv4,服务器通过验证其具有与SMC提案中客户端发送的子网号匹配的IP子网号的接口来执行此操作。对于IPv6,它通过验证它是否直接附加到客户端在其SMC建议消息中列出的至少一个IP前缀来实现这一点。

If the server agrees to use SMC-R, the server begins the setup of a new SMC-R link by allocating local QP and RMB resources (setting its QP state to INIT) and providing its full SMC-R information in an SMC Accept CLC message to the client over the TCP connection, along with a flag set indicating that this is a first contact flow. While the SMC Accept message could flow over any IP route back to the client depending upon Layer 3 IP routing, the SMC-R credentials provided must be for the common subnet or prefix between the server and client, as determined above. If the server cannot or does not want to do SMC-R with the client, it sends an SMC Decline CLC message to the client, and the connection data may begin flowing using normal TCP/IP flows.

如果服务器同意使用SMC-R,则服务器通过分配本地QP和RMB资源(将其QP状态设置为INIT)并通过TCP连接向客户端提供SMC Accept CLC消息中的完整SMC-R信息,以及指示这是第一个联系人流的标志集,开始设置新的SMC-R链路。尽管SMC Accept消息可以通过任何IP路由返回到客户端,具体取决于第3层IP路由,但提供的SMC-R凭据必须用于服务器和客户端之间的公共子网或前缀,如上所述。如果服务器不能或不想对客户端执行SMC-R,则会向客户端发送SMC DELENCE CLC消息,并且连接数据可能会开始使用正常的TCP/IP流流动。

3.5.1.3. Server Acceptance
3.5.1.3. 服务器接受

When the client receives the SMC Accept from the server, it determines whether this is a new or existing SMC-R link, using the combination of the following: the first contact flag, its MAC/GID and the MAC/GID returned by the server, the VLAN over which the connection is setting up, and the QP number provided by the server.

当客户端从服务器接收到SMC Accept时,它将使用以下组合来确定这是新的还是现有的SMC-R链路:第一个联系人标志、其MAC/GID和服务器返回的MAC/GID、建立连接的VLAN以及服务器提供的QP号。

If it is an existing SMC-R link and the client agrees to use that link for the TCP connection, see Section 3.5.2 ("Subsequent Contact") below. If it is a new SMC-R link between peers that already have an SMC-R link, then the server is starting a new SMC-R link group.

如果是现有SMC-R链路,且客户同意将该链路用于TCP连接,请参见下文第3.5.2节(“后续联系”)。如果已具有SMC-R链路的对等方之间存在新的SMC-R链路,则服务器将启动新的SMC-R链路组。

Assuming that either (1) this is a first contact between peers or (2) the server is starting a new SMC-R link group, the client now allocates local QP and RMB resources for the SMC-R link (setting the QP state to RTR (ready to receive)), associates them with the server QP as learned from the SMC Accept CLC message, and sends an SMC Confirm CLC message to the server over the TCP connection with its SMC-R link information included. The client also starts a timer to wait for the server to confirm the reliably connected queue pair, as described below.

假设(1)这是对等方之间的第一次接触,或者(2)服务器正在启动新的SMC-R链路组,客户端现在为SMC-R链路分配本地QP和RMB资源(将QP状态设置为RTR(准备接收)),将它们与从SMC Accept CLC消息中得知的服务器QP相关联,并通过TCP连接向服务器发送SMC确认CLC消息,其中包含SMC-R链路信息。客户端还启动计时器,等待服务器确认可靠连接的队列对,如下所述。

3.5.1.4. Client Confirmation
3.5.1.4. 客户确认

Upon receipt of the client's SMC Confirm CLC message, the server associates its QP for this SMC-R link with the client's QP as learned from the SMC Confirm CLC message and sets its QP state to RTS (ready to send). The client and the server now have reliably connected queue pairs.

在收到客户机的SMC确认CLC消息后,服务器将此SMC-R链路的QP与从SMC确认CLC消息得知的客户机QP相关联,并将其QP状态设置为RTS(准备发送)。客户端和服务器现在具有可靠连接的队列对。

3.5.1.5. Link (QP) Confirmation
3.5.1.5. 链接(QP)确认

Since setting up the SMC-R link and its QPs did not require any network flows on the RoCE fabric, the client and server must now confirm connectivity over the RoCE fabric. To accomplish this, the server will send a CONFIRM LINK Link Layer Control (LLC) message to the client over the newly created SMC-R link, using the RoCE fabric. The CONFIRM LINK LLC message will provide the server's MAC, GID, and QP information for the connection, allow each partner to communicate the maximum number of links it can tolerate in this link group (the "link limit"), and will additionally provide two link IDs:

由于设置SMC-R链路及其QPs不需要RoCE结构上的任何网络流,因此客户端和服务器现在必须确认RoCE结构上的连接。为此,服务器将使用RoCE结构通过新创建的SMC-R链路向客户端发送确认链路链路层控制(LLC)消息。CONFIRM LINK LLC消息将为连接提供服务器的MAC、GID和QP信息,允许每个合作伙伴在此链路组中通信其可容忍的最大链路数(“链路限制”),并将另外提供两个链路ID:

o a 1-byte server-assigned link number that is used by both peers to identify the link within the link group and is only unique within a link group.

o 一种由服务器分配的1字节链路号,由两个对等方用于标识链路组内的链路,并且仅在链路组内是唯一的。

o a 4-byte link user ID. This opaque value is assigned by the server for the server's local use and is provided to the client for management purposes -- for example, to use in network management displays and products.

o 一个4字节的链接用户ID。该不透明值由服务器分配用于服务器本地使用,并提供给客户端用于管理目的——例如,用于网络管理显示器和产品。

When the server sends this message, it will set a timer for receiving confirmation from the client.

当服务器发送此消息时,它将设置从客户端接收确认的计时器。

When the client receives the server's confirmation in the form of a CONFIRM LINK LLC message, it will cancel the confirmation timer it set when it sent the SMC Confirm message. The client will also advance its QP state to RTS and respond over the RoCE fabric with a CONFIRM LINK response LLC message that (1) provides its MAC, GID, QP number, and link limit, (2) confirms the 1-byte link number sent by the server, and (3) provides its own 4-byte link user ID to the server.

当客户端接收到服务器以确认链接LLC消息形式发出的确认时,它将取消在发送SMC确认消息时设置的确认计时器。客户端还将其QP状态提前到RTS,并通过RoCE结构响应确认链路响应LLC消息,该消息(1)提供其MAC、GID、QP号和链路限制,(2)确认服务器发送的1字节链路号,以及(3)向服务器提供其自己的4字节链路用户ID。

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|                      |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|                      |MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+      (Subnet S1)     +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |            +------+                      +------+            |
    |            |RNIC 3|                      |RNIC 4|            |
    |            |MAC MC|                      |MAC MD|            |
    |            |GID GC|                      |GID GD|            |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|                      |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|                      |MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+      (Subnet S1)     +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |            +------+                      +------+            |
    |            |RNIC 3|                      |RNIC 4|            |
    |            |MAC MC|                      |MAC MD|            |
    |            |GID GC|                      |GID GD|            |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
                     SYN TCP options(254,"SMCR")
        <---------------------------------------------------------
        
                     SYN TCP options(254,"SMCR")
        <---------------------------------------------------------
        
                     SYN-ACK TCP options(254,"SMCR")
        --------------------------------------------------------->
        
                     SYN-ACK TCP options(254,"SMCR")
        --------------------------------------------------------->
        
                     ACK [TCP options(254,"SMCR")]
        <--------------------------------------------------------
        
                     ACK [TCP options(254,"SMCR")]
        <--------------------------------------------------------
        
                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------
        
                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------
        
    SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->
        
    SMC Accept(PS1,first contact,MA,GA,MTU,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->
        
         SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
         <--------------------------------------------------------
        
         SMC Confirm(PC1,MB,GB,MTU,QP64,RToken=Y,RMB element index)
         <--------------------------------------------------------
        
       CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
        .........................................................>
        
       CONFIRM LINK(MA,GA,QP8, link lim, server link user ID, linknum)
        .........................................................>
        
    CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
        <........................................................
        
    CONFIRM LINK rsp(MB,GB,QP64, link lim, client link user ID, linknum)
        <........................................................
        
                           Legend:
                    ------------   TCP/IP and CLC flows
                    ............   RoCE (LLC) flows
           Square brackets ("[ ]") indicate optional information
        
                           Legend:
                    ------------   TCP/IP and CLC flows
                    ............   RoCE (LLC) flows
           Square brackets ("[ ]") indicate optional information
        

Figure 8: First Contact Rendezvous Flows

图8:首次接触交会流

Technically, the data for the TCP connection could now flow over the RoCE path. However, if this is a first contact, there is no alternate for this recently established RoCE path. Since in the current architecture there is no failover from RoCE to IP once connection data starts flowing, this means that a failure of this path would disrupt the TCP connection, meaning that the level of redundancy and failover is less than that provided by IP. If the network has alternate RoCE paths available, they would not be usable at this point. This situation would be unacceptable.

从技术上讲,TCP连接的数据现在可以通过RoCE路径流动。但是,如果这是第一次接触,那么对于最近建立的RoCE路径,没有替代方案。由于在当前体系结构中,一旦连接数据开始流动,就没有从RoCE到IP的故障转移,这意味着此路径的故障将中断TCP连接,这意味着冗余和故障转移的级别低于IP提供的级别。如果网络有备用RoCE路径可用,则此时它们将不可用。这种情况是不能接受的。

3.5.1.6. Second SMC-R Link Setup
3.5.1.6. 第二个SMC-R链路设置

Because of the unacceptable situation described above, TCP data will not be allowed to flow on the newly established SMC-R link until a second path has been set up, or at least attempted.

由于上述不可接受的情况,TCP数据将不允许在新建立的SMC-R链路上流动,直到建立了第二条路径,或者至少尝试了第二条路径。

If the server has a second RNIC available on the same LAN, it attempts to set up the second SMC-R link over that second RNIC. If it only has one RNIC available on the LAN, it will attempt to set up the second SMC-R link over that one RNIC. In the latter case, the server is attempting to set up an asymmetric link, in case the client does have a second RNIC on the LAN.

如果服务器在同一LAN上有第二个可用的RNIC,它将尝试通过该第二个RNIC建立第二个SMC-R链路。如果LAN上只有一个可用的RNIC,它将尝试通过该RNIC建立第二条SMC-R链路。在后一种情况下,服务器正在尝试设置非对称链路,以防客户端在LAN上有第二个RNIC。

In either case, the server allocates a new QP over the RNIC it is attempting to use for the second link and assigns a link number to the new link; the server also creates an RToken for the RMB over this second QP (note that this means that the first and second QP each have their own RToken to represent the same RMB). The server provides this information, as well as the MAC and GID of the RNIC over which it is attempting to set up the second link, in an ADD LINK LLC message that it sends to the client over the SMC-R link that is already set up.

在任何一种情况下,服务器通过其试图用于第二链路的RNIC分配新的QP,并为新链路分配链路号;服务器还通过第二个QP为RMB创建一个RToken(注意,这意味着第一个和第二个QP各自有自己的RToken来表示相同的RMB)。服务器在通过已设置的SMC-R链路发送给客户端的ADD link LLC消息中提供此信息,以及试图设置第二条链路的RNIC的MAC和GID。

3.5.1.6.1. Client Processing of ADD LINK LLC Message from Server
3.5.1.6.1. 来自服务器的添加链接LLC消息的客户端处理

When the client receives the server's ADD LINK LLC message, it examines the GID and MAC provided by the server to determine whether the server is attempting to use the same server-side RNIC as the existing SMC-R link or a different one.

当客户端收到服务器的ADD LINK LLC消息时,它会检查服务器提供的GID和MAC,以确定服务器试图使用与现有SMC-R链路相同的服务器端RNIC还是不同的服务器端RNIC。

If the server is attempting to use the same server-side RNIC as the existing SMC-R link, then the client verifies that it has a second RNIC on the same LAN. If it does not, the client rejects the ADD LINK request from the server, because the resulting link would be a parallel link, which is not supported within a link group. If the client does have a second RNIC on the same LAN, it accepts the request, and an asymmetric link will be set up.

如果服务器试图使用与现有SMC-R链路相同的服务器端RNIC,则客户端将验证其在同一LAN上是否有第二个RNIC。如果没有,客户端将拒绝来自服务器的添加链接请求,因为生成的链接将是并行链接,而链接组中不支持并行链接。如果客户机在同一LAN上有第二个RNIC,它将接受该请求,并将设置非对称链路。

If the server is using a different server-side RNIC from the existing SMC-R link, then the client will accept the request and a second SMC-R link will be set up in this SMC-R link group. If the client has a second RNIC on the same LAN, that second RNIC will be used for the second SMC-R link, creating symmetric links. If the client does not have a second RNIC on the same LAN, it will use the same RNIC as was used for the initial SMC-R link, resulting in the setup of an asymmetric link in the SMC-R link group.

如果服务器使用与现有SMC-R链路不同的服务器端RNIC,则客户端将接受请求,并在此SMC-R链路组中设置第二条SMC-R链路。如果客户端在同一LAN上有第二个RNIC,则该第二个RNIC将用于第二个SMC-R链路,从而创建对称链路。如果客户端在同一LAN上没有第二个RNIC,它将使用与初始SMC-R链路相同的RNIC,从而在SMC-R链路组中设置非对称链路。

In either case, when the client accepts the server's ADD LINK request, it allocates a new QP on the chosen RNIC and creates an RKey over that new QP for the client-side RMB for the SMC-R link group, then sends an ADD LINK reply LLC message to the server providing that information as well as echoing the link number that was sent by the server.

在任何一种情况下,当客户端接受服务器的添加链接请求时,它会在所选RNIC上分配一个新的QP,并为SMC-R链接组的客户端RMB在该新QP上创建一个RKey,然后向服务器发送一个添加链接回复LLC消息,提供该信息并回显服务器发送的链接号。

If the client rejects the server's ADD LINK request, it sends an ADD LINK reply LLC message to the server with the reason code for the rejection.

如果客户端拒绝服务器的添加链接请求,它将向服务器发送一条添加链接回复LLC消息,其中包含拒绝的原因代码。

3.5.1.6.2. Server Processing of ADD LINK Reply LLC Message from Client
3.5.1.6.2. 服务器处理来自客户端的添加链接回复LLC消息

If the client sends a negative response to the server or no reply is received, the server frees the RoCE resources it had allocated for the new link. Having a single link in an SMC-R link group is undesirable. The server's recovery is detailed in Appendix C.8 ("Failure to Add Second SMC-R Link to a Link Group").

如果客户端向服务器发送否定响应或未收到响应,则服务器将释放为新链接分配的RoCE资源。在SMC-R链路组中具有单个链路是不可取的。服务器的恢复在附录C.8(“未能将第二个SMC-R链路添加到链路组”)中有详细说明。

If the client sends a positive reply to the server with MAC/GID/QP/RKey information, the server associates its QP for the new SMC-R link to the QP that the client provided. Now, the new SMC-R link is in the same situation that the first was in after the client sent its ACK packet -- there is a reliably connected queue pair over the new RoCE path, but there have been no RoCE flows to confirm that it's actually usable. So, at this point, the client and server will exchange CONFIRM LINK LLC messages just like they did on the first SMC-R link.

如果客户端向服务器发送带有MAC/GID/QP/RKey信息的肯定回复,则服务器将其新SMC-R链路的QP与客户端提供的QP相关联。现在,新的SMC-R链路与客户端发送其ACK数据包后第一个链路的情况相同——在新的RoCE路径上有一个可靠连接的队列对,但没有RoCE流来确认它是否实际可用。因此,此时,客户端和服务器将交换CONFIRM LINK LLC消息,就像在第一个SMC-R链路上一样。

If either peer receives a failure during this second CONFIRM LINK LLC exchange (either an immediate failure -- which implies that the message did not reach the partner -- or a timeout), it sends a DELETE LINK LLC message to the partner over the first (and now only) link in the link group. This DELETE LINK LLC message must be acknowledged before data can flow on the single link in the link group.

如果任一对等方在第二次确认链路LLC交换期间接收到故障(即时故障——这意味着消息未到达合作伙伴——或超时),则它将通过链路组中的第一条(现在仅限于)链路向合作伙伴发送删除链路LLC消息。必须先确认此DELETE LINK LLC消息,然后数据才能在链路组中的单个链路上流动。

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|      SMC-R Link 1    |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    ||        |         |                      |        |        | |
    || RMB    |         |                      |        | RMB    | |
    ||        |         |                      |        |        | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|      SMC-R Link 2    |RNIC 4|  |         |
    |RToken Z|   |MAC MC|<-------------------->|MAC MD|  |RToken W |
    |       QP 9 |GID GC|      (being added)   |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|      SMC-R Link 1    |RNIC 2|  QP 64     |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|                      |GID GB|   |RToken Y|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    ||        |         |                      |        |        | |
    || RMB    |         |                      |        | RMB    | |
    ||        |         |                      |        |        | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|      SMC-R Link 2    |RNIC 4|  |         |
    |RToken Z|   |MAC MC|<-------------------->|MAC MD|  |RToken W |
    |       QP 9 |GID GC|      (being added)   |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        

First SMC-R link setup as shown in Figure 8 <-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->

如图8所示的第一个SMC-R链路设置

            ADD LINK request(QP9,MC,GC, link number = 2)
            ............................................>
        
            ADD LINK request(QP9,MC,GC, link number = 2)
            ............................................>
        
            ADD LINK response(QP65,MD,GD, link number = 2)
            <............................................
        
            ADD LINK response(QP65,MD,GD, link number = 2)
            <............................................
        
            ADD LINK CONTINUATION request(RToken=Z)
            ............................................>
        
            ADD LINK CONTINUATION request(RToken=Z)
            ............................................>
        
           ADD LINK CONTINUATION response(RToken=W)
            <............................................
        
           ADD LINK CONTINUATION response(RToken=W)
            <............................................
        
         CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
            .............................................>
        
         CONFIRM LINK(MC,GC,QP9, link number = 2, link user ID)
            .............................................>
        
      CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
            <.............................................
        
      CONFIRM LINK response(MD,GD,QP65, link number = 2, link user ID)
            <.............................................
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        

Figure 9: First Contact, Second Link Setup

图9:第一个触点,第二个链路设置

3.5.1.6.3. Exchange of RKeys on Second SMC-R Link
3.5.1.6.3. 在第二条SMC-R链路上交换RKEY

Note that in the scenario described here -- first contact -- there is only one RMB RKey to exchange on the second SMC-R link, and it is exchanged in the ADD LINK CONTINUATION request and reply. In scenarios other than first contact -- for example, adding a new SMC-R link to a longstanding link group with multiple RMBs -- additional flows will be required to exchange additional RMB RKeys. See Section 3.5.5.2.3 ("Adding a New SMC-R Link to a Link Group with Multiple RMBs") for more details on these flows.

注意,在这里描述的场景中——第一个联系人——第二个SMC-R链路上只有一个RMB RKey要交换,它在添加链路继续请求和应答中交换。在第一次接触以外的情况下——例如,将新的SMC-R链路添加到具有多个RMB的长期链路组——需要额外的流来交换额外的RMB RKEY。有关这些流量的更多详情,请参见第3.5.5.2.3节(“向具有多个RMB的链路组添加新的SMC-R链路”)。

3.5.1.6.4. Aborting SMC-R and Falling Back to IP
3.5.1.6.4. 中止SMC-R并返回IP

If both partners don't provide the SMC-R TCP option during the three-way TCP handshake, the connection falls back to normal TCP/IP. During the SMC-R negotiation that occurs after the three-way TCP handshake, either partner may break off SMC-R by sending an SMC Decline CLC message. The SMC Decline CLC message may be sent in place of any expected message and may also be sent during the CONFIRM LINK LLC exchange if there is a failure before any application data has flowed over the RoCE fabric. For more details on exactly when an SMC Decline can flow during link group setup, see Appendices C.1 ("SMC Decline during CLC Negotiation") and C.2 ("SMC Decline during LLC Negotiation").

如果在三向TCP握手过程中,两个合作伙伴都不提供SMC-R TCP选项,则连接会退回到正常的TCP/IP。在三方TCP握手后发生的SMC-R协商期间,任何一方均可通过发送SMC拒绝CLC消息中断SMC-R。SMC拒绝CLC消息可以代替任何预期消息发送,并且如果在任何应用程序数据流过RoCE结构之前出现故障,也可以在CONFIRM LINK LLC交换期间发送。有关链路组设置期间SMC拒绝的确切时间,请参见附录C.1(“CLC谈判期间的SMC拒绝”)和C.2(“LLC谈判期间的SMC拒绝”)。

If this fallback to IP happens while setting up a new SMC-R link group, the RoCE resources allocated for this SMC-R link group relationship are torn down, and it will be retried as a new SMC-R link group next time a connection starts between these peers with SMC-R proposed. Note that if this happens because one side doesn't support SMC-R, there will be very little to tear down, as the TCP option will have failed to flow on either the initial SYN or the SYN-ACK before either side had reserved any local RoCE resources.

如果在建立新的SMC-R链路组时发生这种IP回退,则分配给此SMC-R链路组关系的RoCE资源将被删除,并将在下一次这些具有SMC-R建议的对等方之间启动连接时作为新的SMC-R链路组重试。请注意,如果发生这种情况是因为一方不支持SMC-R,那么就没有什么可拆除的,因为在任何一方保留任何本地RoCE资源之前,TCP选项将无法在初始SYN或SYN-ACK上流动。

3.5.2. Subsequent Contact
3.5.2. 后续联系

"Subsequent contact" means setting up a new TCP connection between two peers that already have an SMC-R link group between them and reusing the existing SMC-R link group. In this case, it is not necessary to allocate new QPs. However, it is possible that a new RMB has been allocated for this TCP connection, if the previous TCP connection used the last element available in the previously used RMB, or for any other implementation-dependent reason. For this reason, and for convenience and error checking, the same TCP option 254, followed by the inline negotiation method described for initial contact, will be used for subsequent contact, but the processing differs in some ways. That processing is described below.

“后续联系”是指在两个已经有SMC-R链路组的对等方之间建立新的TCP连接,并重用现有的SMC-R链路组。在这种情况下,不需要分配新的QP。但是,如果以前的TCP连接使用了以前使用的RMB中可用的最后一个元素,或者由于任何其他依赖于实现的原因,则可能已为此TCP连接分配了新的RMB。出于这个原因,并且为了方便和错误检查,后续的联系将使用相同的TCP选项254,然后是为初始联系描述的内联协商方法,但处理在某些方面有所不同。下面描述该处理。

3.5.2.1. SMC-R Proposal
3.5.2.1. SMC-R提案

When the client begins the inline negotiation with the server, it does not know if this is a first contact or a subsequent contact. The client cannot know this information until it sees the server's peer ID, to determine whether or not it already has an SMC-R link with this peer that it can use. There are several reasons why it is not sufficient to use the partner IP address, subnet, VLAN, or other IP information to make this determination. The most obvious reason is distributed systems: if the server IP address is actually a virtual IP address representing a distributed cluster, the actual host serving this TCP connection may not be the same as the host that served the last TCP connection to this same IP address.

当客户机开始与服务器进行内联协商时,它不知道这是第一个联系人还是后续联系人。在看到服务器的对等ID之前,客户端无法知道此信息,以确定它是否已经与该对等机建立了SMC-R链接,可以使用该链接。使用合作伙伴IP地址、子网、VLAN或其他IP信息进行此确定是不够的,原因有很多。最明显的原因是分布式系统:如果服务器IP地址实际上是表示分布式集群的虚拟IP地址,则为该TCP连接提供服务的实际主机可能与为该IP地址的上一个TCP连接提供服务的主机不同。

After the TCP three-way handshake, assuming that both partners indicate SMC-R capability, the client builds and sends the SMC Proposal CLC message to the server in exactly the same manner as it does in the "first contact" case, and in fact at this point doesn't know if it's a first contact or a subsequent contact. As in the "first contact" case, the client sends its peer ID value, suggested RNIC MAC/GID, and IP subnet or prefix information.

在TCP三方握手之后,假设双方都表示SMC-R能力,客户机构建SMC建议CLC消息并将其发送到服务器,其方式与“第一次接触”的方式完全相同,事实上,此时不知道是第一次接触还是后续接触。与“第一个联系人”的情况一样,客户端发送其对等ID值、建议的RNIC MAC/GID以及IP子网或前缀信息。

Upon receiving the client's proposal, the server looks up the provided peer ID to determine if it already has a usable SMC-R link group with this peer. If it does already have a usable SMC-R link group, the server then needs to decide whether it will use the existing SMC-R link group or create a new link group. For the case of the new link group, see Section 3.5.3 ("First Contact Variation: Creating a Parallel Link Group") below.

在收到客户机的建议后,服务器将查找提供的对等ID,以确定它是否已经具有与该对等机的可用SMC-R链接组。如果已经有可用的SMC-R链接组,则服务器需要决定是使用现有的SMC-R链接组还是创建新的链接组。关于新连接组的情况,请参见下文第3.5.3节(“首次接触变化:创建并联连接组”)。

For this discussion, assume that the server decides to use the existing SMC-R link group for the TCP connection, which is expected to be the most common case. The server is responsible for making this decision. The server then needs to communicate that information to the client, but it is not necessary to allocate, associate, and confirm QPs for the chosen SMC-R link. All that remains to be done is to set up RMB space for this TCP connection.

在本讨论中,假设服务器决定将现有SMC-R链路组用于TCP连接,这是最常见的情况。服务器负责做出此决定。然后,服务器需要将该信息传达给客户机,但无需为所选SMC-R链路分配、关联和确认QP。剩下要做的就是为这个TCP连接设置RMB空间。

If one of the RMBs already in use for this SMC-R link group has an available element that uses the appropriate buffer size, the server merely chooses one for this TCP connection and then sends an SMC Accept CLC message providing the full RoCE information for the chosen SMC-R link to the client, using the same format as the SMC Accept CLC message described in Section 3.5.1 ("First Contact") above.

如果已用于此SMC-R链路组的其中一个RMB具有使用适当缓冲区大小的可用元素,则服务器仅为此TCP连接选择一个,然后向客户端发送SMC Accept CLC消息,提供所选SMC-R链路的完整RoCE信息,使用与上文第3.5.1节(“第一次接触”)所述SMC接受CLC信息相同的格式。

The server may choose to use the SMC-R link that matches the suggested MAC/GID provided by the client in the SMC Proposal for its RDMA writes but is not obligated to do so. The final decision on which specific SMC-R link to assign a TCP connection to is an independent server and client decision.

服务器可以选择使用SMC-R链路,该链路与SMC提案中客户为其RDMA写入提供的建议MAC/GID相匹配,但没有义务这样做。将TCP连接分配给哪个特定SMC-R链路的最终决定是独立的服务器和客户端决定。

It may be necessary for the server to allocate a new RMB for this connection. The reasons for this are implementation dependent and could include the following:

服务器可能需要为此连接分配新的RMB。其原因取决于实施情况,可能包括:

o no available space in existing RMB or RMBs, or

o 现有RMB或RMB中无可用空间,或

o desire to allocate a new RMB that uses a different buffer size from the ones already created, or

o 希望分配一个新的人民币,使用与已创建的人民币不同的缓冲区大小,或

o any other implementation-dependent reason

o 任何其他依赖于实现的原因

In this case, the server will allocate the new RMB and then perform the flows described in Section 3.5.5.2.1 ("Adding a New RMB to an SMC-R Link Group"). Once that processing is complete, the server then provides the full RoCE information, including the new RKey, for this connection in an SMC Confirm CLC message to the client.

在这种情况下,服务器将分配新的RMB,然后执行第3.5.5.2.1节(“向SMC-R链路组添加新RMB”)中描述的流程。处理完成后,服务器将在SMC确认CLC消息中向客户端提供完整的RoCE信息,包括新的RKey。

3.5.2.2. SMC-R Acceptance
3.5.2.2. SMC-R验收

Upon receiving the SMC Accept CLC message from the server, the client examines the RoCE information provided by the server to determine whether this is a first contact for a new SMC-R link group or a subsequent contact for an existing SMC-R link group. It is a subsequent contact if the server-side peer ID, GID, MAC, and QP number provided in the packet match a known SMC-R link, and the first contact flag is not set. If this is not the case -- for example, the GID and MAC match but the QP is new -- then the server is creating a new, parallel SMC-R link group, and this is treated as a first contact.

在从服务器接收到SMC Accept CLC消息后,客户端检查服务器提供的RoCE信息,以确定这是新SMC-R链路组的第一个联系人还是现有SMC-R链路组的后续联系人。如果包中提供的服务器端对等ID、GID、MAC和QP号码与已知SMC-R链路匹配,并且未设置第一个联系人标志,则为后续联系人。如果情况并非如此——例如,GID和MAC匹配,但QP是新的——那么服务器将创建一个新的并行SMC-R链路组,并将其视为第一个联系人。

A different RMB RToken does not indicate a first contact, as the server may have allocated a new RMB or may be using several RMBs for this SMC-R link. The client needs the server's RMB information only for its RDMA writes to the server, and since there is no requirement for symmetric RMBs, this information is simply control information for the RDMA writes on this SMC-R link.

不同的RMB RToken并不表示第一次联系,因为服务器可能已经分配了一个新的RMB,或者可能正在为此SMC-R链路使用多个RMB。客户端只需要服务器的RMB信息才能将其RDMA写入服务器,并且由于不需要对称RMB,因此该信息只是此SMC-R链路上RDMA写入的控制信息。

The client must validate that the RMB element being provided by the server is not in use by another TCP connection on this SMC-R link group. This validation must validate the new <rtoken, index> across

客户端必须验证服务器提供的RMB元素是否未被此SMC-R链路组上的另一个TCP连接使用。此验证必须验证新的<rtoken,index>

all known <rtoken, index> on this link group. See Section 4.4.2 ("RMB Element Reuse and Conflict Resolution") for the case in which the server tries to use an RMB element that is already in use on this link group.

此链接组上所有已知的<rtoken,index>。请参阅第4.4.2节(“RMB元素重用和冲突解决”),了解服务器尝试使用此链接组上已在使用的RMB元素的情况。

Once the client has determined that this TCP connection is a subsequent contact over an existing SMC-R link, it performs an RMB allocation process similar to what the server did: it either (1) allocates an element from an RMB already associated with this SMC-R link or (2) allocates a new RMB, associates it with this SMC-R link, and then chooses an element out of it.

一旦客户端确定此TCP连接是通过现有SMC-R链路的后续联系人,它将执行与服务器所做的类似的RMB分配过程:(1)从已与此SMC-R链路关联的RMB分配元素,或(2)分配新RMB,将其与此SMC-R链路关联,然后从中选择一个元素。

If the client allocates a new RMB for this TCP connection, it performs the processing described in Section 3.5.5.2.1 ("Adding a New RMB to an SMC-R Link Group"). Once that processing is complete, the client provides its full RoCE information for this TCP connection in an SMC Confirm CLC message.

如果客户端为此TCP连接分配了新的RMB,它将执行第3.5.5.2.1节(“向SMC-R链路组添加新RMB”)中描述的处理。处理完成后,客户端将在SMC确认CLC消息中提供此TCP连接的完整RoCE信息。

Because an SMC-R link with a verified connected QP already exists and is being reused, there is no need for verification or alternate QP selection flows or timers.

由于具有已验证连接的QP的SMC-R链路已经存在并正在重新使用,因此不需要验证或备用QP选择流或计时器。

3.5.2.3. SMC-R Confirmation
3.5.2.3. SMC-R确认

When the server receives the client's SMC Confirm CLC message on a subsequent contact, it verifies the following:

当服务器在后续联系人上收到客户机的SMC确认CLC消息时,它将验证以下内容:

o The RMB element provided by the client is not already in use by another TCP connection on this SMC-R link group (see Section 4.4.2 ("RMB Element Reuse and Conflict Resolution") for the case in which it is).

o 客户端提供的RMB元素尚未被此SMC-R链路组上的另一个TCP连接使用(请参阅第4.4.2节(“RMB元素重用和冲突解决”)。

o The MAC/GID/QP information provided by the client matches an active link within the link group. The client is free to select any valid/active link. The client is not required to select the same link as the server.

o 客户端提供的MAC/GID/QP信息与链路组内的活动链路相匹配。客户端可以自由选择任何有效/活动链接。客户端无需选择与服务器相同的链接。

If this validation passes, the server stores the client's RMB information for this connection, and the RoCE setup of the TCP connection is complete.

如果此验证通过,服务器将存储此连接的客户端RMB信息,TCP连接的RoCE设置完成。

3.5.2.4. TCP Data Flow Race with SMC Confirm CLC Message
3.5.2.4. TCP数据流与SMC确认CLC消息的竞争

On a subsequent contact TCP/IP connection, a peer may send data as soon as it has received the peer RMB information for the connection. There are no additional RoCE confirmation flows, since the QPs on the SMC-R link are already reliably connected and verified.

在随后的联系人TCP/IP连接上,对等方可以在收到连接的对等方RMB信息后立即发送数据。没有其他RoCE确认流,因为SMC-R链路上的QP已经可靠连接和验证。

In the majority of cases, the first data will flow from the client to the server. The client must send the SMC Confirm CLC message before sending any connection data over the chosen SMC-R link; however, the client need not wait for confirmation of this message, and in fact there will be no such confirmation. Since the server is required to have the RMB fully set up and ready to receive data from the client before sending an SMC Accept CLC message, the client can begin sending data over the SMC-R link immediately upon completing the send of the SMC Confirm CLC message.

在大多数情况下,第一批数据将从客户机流向服务器。在通过所选SMC-R链路发送任何连接数据之前,客户端必须发送SMC确认CLC消息;但是,客户端不需要等待此消息的确认,事实上也不会有这样的确认。由于服务器需要在发送SMC Accept CLC消息之前完全设置RMB并准备好接收来自客户端的数据,因此客户端可以在完成SMC确认CLC消息的发送后立即开始通过SMC-R链路发送数据。

It is possible that data from the client will arrive at the server-side RMB before the SMC Confirm CLC message from the client has been processed. In this case, the server must handle this race condition and not provide the arrived TCP data to the socket application until the SMC Confirm CLC message has been received and fully processed, opening the socket.

在处理来自客户端的SMC确认CLC消息之前,来自客户端的数据可能会到达服务器端RMB。在这种情况下,服务器必须处理此竞争条件,并且在收到SMC确认CLC消息并完全处理该消息之前,不向套接字应用程序提供到达的TCP数据,从而打开套接字。

If the server has initial data to send to the client that is not a response to the client (this case should be rare), it can send the data immediately upon receiving and processing the SMC Confirm CLC message from the client. The client must have opened the TCP socket to the client application upon sending the SMC Confirm CLC message so the client will be ready to process data from the server.

如果服务器要向客户端发送的初始数据不是对客户端的响应(这种情况应该很少),则它可以在接收和处理来自客户端的SMC确认CLC消息后立即发送数据。在发送SMC确认CLC消息时,客户端必须已打开客户端应用程序的TCP套接字,以便客户端准备好处理来自服务器的数据。

3.5.3. First Contact Variation: Creating a Parallel Link Group
3.5.3. 第一次接触变体:创建平行链接组

Recall that parallel SMC-R links within an SMC-R link group are not supported. These are multiple SMC-R links within a link group that use the same network path. However, multiple SMC-R link groups between the same peers are supported. This means that if multiple SMC-R links over the same RoCE path are desired, it is necessary to use multiple SMC-R link groups. While not a recommended practice, this could be done for platform-specific reasons, like QP separation of different workloads. Only the server can drive the creation of multiple SMC-R link groups between peers.

回想一下,SMC-R链路组内的并行SMC-R链路不受支持。这些是链路组中使用相同网络路径的多个SMC-R链路。但是,支持相同对等方之间的多个SMC-R链路组。这意味着,如果需要在同一RoCE路径上的多个SMC-R链路,则有必要使用多个SMC-R链路组。虽然不是推荐的做法,但这可能是由于平台特定的原因,例如不同工作负载的QP分离。只有服务器可以驱动在对等方之间创建多个SMC-R链路组。

At a high level, when the server decides to create an additional SMC-R link group with a client with which it already has an SMC-R link group, the flows are basically the same as the normal "first contact" case described above. The following text provides more detail and clarification of processing in this case.

在较高级别上,当服务器决定与已具有SMC-R链路组的客户端创建额外的SMC-R链路组时,流程基本上与上述正常“首次接触”情况相同。以下文本提供了本案例中处理的更多细节和说明。

When the server receives the SMC Proposal CLC message from the client and, using the MAC/GID information, determines that it already has an SMC-R link group with this client, the server can either reuse the existing SMC-R link group (detailed in Section 3.5.2 ("Subsequent Contact") above) or create a new SMC-R link group in addition to the existing one.

当服务器从客户端接收到SMC建议CLC消息,并使用MAC/GID信息确定其已与该客户端建立了SMC-R链路组时,服务器可以重用现有的SMC-R链路组(详见上文第3.5.2节(“后续联系人”)或在现有SMC-R链路组的基础上创建新的SMC-R链路组。

If the server decides to create a new SMC-R link group, it does the same processing it would have done for first contact: allocate QP and RMB resources as well as alternate QP resources, and communicate the QP and RMB information to the client in the SMC Accept CLC message with the first contact flag set.

如果服务器决定创建一个新的SMC-R链路组,它将执行与第一个联系人相同的处理:分配QP和RMB资源以及备用QP资源,并在SMC Accept CLC消息中将QP和RMB信息与设置了第一个联系人标志的客户机通信。

When the client receives the server's SMC Accept CLC message with the new QP information and the first contact flag set, it knows that the server is creating a new SMC-R link group even though it already has an SMC-R link group with the server. In this case, the client will also allocate a new QP for this new SMC-R link, allocate an RMB for it, and generate an RKey for it.

当客户端接收到服务器的SMC Accept CLC消息,其中包含新的QP信息和设置的第一个联系人标志时,它知道服务器正在创建一个新的SMC-R链路组,即使它已经与服务器有一个SMC-R链路组。在这种情况下,客户还将为此新SMC-R链路分配新的QP,为其分配RMB,并为其生成RKey。

Note that multiple SMC-R link groups between the same peers must access different RMB resources, so new RMBs will be required. Using the same RMBs that are in use in another SMC-R link group is not permitted.

请注意,同一对等方之间的多个SMC-R链路组必须访问不同的RMB资源,因此需要新的RMB。不允许使用其他SMC-R链路组中使用的相同RMB。

The client then associates its new QP with the server's new QP and sends its SMC Confirm CLC message back to the server providing the new QP/RMB information, and then sets its confirmation timer for the new SMC-R link.

然后,客户端将其新QP与服务器的新QP关联,并将其SMC确认CLC消息发送回提供新QP/RMB信息的服务器,然后为新SMC-R链路设置其确认计时器。

When the server receives the client's SMC Confirm CLC message, it associates its QP with the client's QP as learned from the SMC Confirm CLC message and sends a confirmation LLC message. The rest of the flow, with the confirmation QP and setup of additional SMC-R links, unfolds just like the "first contact" case.

当服务器接收到客户机的SMC确认CLC消息时,它将其QP与从SMC确认CLC消息中获悉的客户机QP相关联,并发送确认LLC消息。流程的其余部分,随着确认QP和额外SMC-R链接的设置,将像“首次接触”案例一样展开。

3.5.4. Normal SMC-R Link Termination
3.5.4. 正常SMC-R链路终端

The normal socket API trigger points are used by the SMC-R layer to initiate SMC-R connection termination flows. The main design point for SMC-R normal connection flows is to use the SMC-R protocol to first shut down the SMC-R connection and free up any SMC-R RDMA resources, and then allow the normal TCP connection termination protocol (i.e., FIN processing) to drive cleanup of the TCP connection that exists on the IP fabric. This design point is very important in ensuring that RDMA resources such as the RMBEs are only freed and reused when both SMC-R endpoints are completely done with their RDMA write operations to the partner's RMBE.

SMC-R层使用普通套接字API触发点来启动SMC-R连接终止流。SMC-R正常连接流的主要设计点是使用SMC-R协议首先关闭SMC-R连接并释放任何SMC-R RDMA资源,然后允许正常TCP连接终止协议(即FIN处理)驱动IP结构上存在的TCP连接清理。此设计点对于确保只有当两个SMC-R端点完全完成了对合作伙伴RMBE的RDMA写入操作时,RDMA资源(如RMBE)才会被释放和重用非常重要。

When the last TCP connection over an SMC-R link group terminates, the link group can be terminated. Similar to creation of SMC-R links and link groups, the primary responsibility for determining that normal termination is needed and initiating it lies with the server.

当SMC-R链路组上的最后一个TCP连接终止时,可以终止该链路组。与创建SMC-R链路和链路组类似,确定是否需要正常终止并启动它的主要责任在于服务器。

Implementations may opt to set timers to keep SMC-R link groups up for a specified time after the last TCP connection ends, to avoid churn in cases where TCP connections come and go regularly.

实现可能会选择设置计时器,以便在最后一次TCP连接结束后将SMC-R链路组保持在指定的时间内,以避免在TCP连接定期进出的情况下发生混乱。

The link or link group may also be terminated as a result of a command initiated by the operator. This command can be entered at either the client or the server. If entered at the client, the client requests that the server perform link or link group termination, and the responsibility for doing so ultimately lies with the server.

链路或链路组也可因操作员启动的命令而终止。可以在客户端或服务器上输入此命令。如果在客户端输入,客户端将请求服务器执行链路或链路组终止,而这样做的责任最终在于服务器。

When the server determines that the SMC-R link group is to be terminated, it sends a DELETE LINK LLC message to the client, with a flag set indicating that all links in the link group are to be terminated. After receiving confirmation from the adapter that the DELETE LINK LLC message has been sent, the server can clean up its end of the link group (QPs, RMBs, etc.). Upon receipt of the DELETE LINK message from the server, the client must immediately comply and clean up its end of the link group. Any TCP connections that the client believes to be active on the link group must be immediately terminated.

当服务器确定要终止SMC-R链路组时,它向客户端发送一条DELETE link LLC消息,并设置一个标志,指示链路组中的所有链路都要终止。从适配器接收到已发送DELETE LINK LLC消息的确认后,服务器可以清理其链路组端(QPs、RMBs等)。从服务器收到删除链接消息后,客户端必须立即遵守并清理其链接组的末端。客户端认为在链路组上处于活动状态的任何TCP连接都必须立即终止。

The client can request that the server delete the link group as well. The client does this by sending a DELETE LINK message to the server, indicating that cleanup of all links is requested. The server must comply by sending a DELETE LINK to the client and processing as described in the previous paragraph. If there are TCP connections active on the link group when the server receives this request, they are immediately terminated by sending a RST flow over the IP fabric.

客户端也可以请求服务器删除链接组。客户端通过向服务器发送删除链接消息来完成此操作,该消息指示请求清理所有链接。服务器必须通过向客户端发送删除链接并按照上一段所述进行处理来遵守。如果服务器收到此请求时链路组上有活动的TCP连接,则会通过在IP结构上发送RST流立即终止这些连接。

3.5.5. Link Group Management Flows
3.5.5. 链接组管理流
3.5.5.1. Adding and Deleting Links in an SMC-R Link Group
3.5.5.1. 在SMC-R链接组中添加和删除链接

The server has the lead role in managing the composition of the link group. Links are added to the link group by the server. The client may notify the server of new conditions that may result in the server adding a new link, but the server is ultimately responsible. In general, links are deleted from the link group by the server; however, in certain error cases the client may inform the server that a link must be deleted and treat it as deleted without waiting for action from the server. These flows are detailed in the sections that follow.

服务器在管理链接组的组成方面起主导作用。链接由服务器添加到链接组。客户端可能会将可能导致服务器添加新链接的新情况通知服务器,但最终由服务器负责。通常,服务器会从链接组中删除链接;但是,在某些错误情况下,客户端可能会通知服务器必须删除链接,并将其视为已删除,而无需等待服务器的操作。这些流程将在下面的章节中详细介绍。

3.5.5.1.1. Server-Initiated ADD LINK Processing
3.5.5.1.1. 服务器启动的添加链接处理

As described in previous sections, the server initiates an ADD LINK exchange to create redundancy in a newly created link group. Once a link group is established, the server may also initiate ADD LINK for other reasons, including:

如前几节所述,服务器启动添加链接交换以在新创建的链接组中创建冗余。一旦建立了链接组,服务器还可能出于其他原因启动添加链接,包括:

o Availability of additional resources on the server host to support an additional SMC-R link. This may include the provisioning of an additional RNIC, more storage becoming available to support additional QP resources, operator command, or any other implementation-dependent reason. Note that in order to be available for an existing link group a new RNIC must be attached to the same RoCE LAN that the link group is using.

o 服务器主机上额外资源的可用性,以支持额外的SMC-R链路。这可能包括提供额外的RNIC、更多的存储以支持额外的QP资源、操作员命令或任何其他依赖于实现的原因。请注意,为了可用于现有链路组,必须将新RNIC连接到链路组正在使用的同一RoCE LAN。

o Receipt of notification from the client that additional resources on the client are available to support an additional SMC-R link. See Section 3.5.5.1.2 ("Client-Initiated ADD LINK Processing").

o 收到客户通知,客户机上的其他资源可用于支持额外的SMC-R链接。参见第3.5.5.1.2节(“客户发起的添加链接处理”)。

Server-initiated ADD LINK processing in an established SMC-R link group is the same as the ADD LINK processing described in Section 3.5.1.6 ("Second SMC-R Link Setup"), with the following changes:

在已建立的SMC-R链路组中,服务器启动的添加链路处理与第3.5.1.6节(“第二个SMC-R链路设置”)中描述的添加链路处理相同,但有以下更改:

o If an asymmetric SMC-R link already exists in the link group, a second asymmetric link will not be created. Only one asymmetric link is permitted in a link group.

o 如果链路组中已存在非对称SMC-R链路,则不会创建第二个非对称链路。链路组中只允许一个非对称链路。

o TCP data flow on already-existing link(s) in the link group is not halted or otherwise affected during the process of setting up the additional link.

o 在设置附加链路的过程中,链路组中已有链路上的TCP数据流不会停止或受到其他影响。

The server will not initiate ADD LINK processing if the link group already has the maximum number of links negotiated by the partners.

如果链接组已具有合作伙伴协商的最大链接数,服务器将不会启动添加链接处理。

3.5.5.1.2. Client-Initiated ADD LINK Processing
3.5.5.1.2. 客户端启动的添加链接处理

If an additional RNIC becomes available for an existing SMC-R link group on the client's side, the client notifies the server by sending an ADD LINK request LLC message to the server. Unlike an ADD LINK request sent by the server to the client, this ADD LINK request merely informs the server that the client has a new RNIC. If the link group lacks redundancy or has redundancy only on an asymmetric link with a single RNIC on the client side, the server must initiate an ADD LINK exchange in response to this message, to create or improve the link group's redundancy.

如果客户端的现有SMC-R链路组可以使用额外的RNIC,则客户端通过向服务器发送添加链路请求LLC消息来通知服务器。与服务器向客户端发送的添加链接请求不同,此添加链接请求仅通知服务器客户端有一个新的RNIC。如果链路组缺少冗余或仅在客户端具有单个RNIC的非对称链路上具有冗余,则服务器必须启动添加链路交换以响应此消息,以创建或改进链路组的冗余。

If the link group already has symmetric-link redundancy but has fewer than the negotiated maximum number of links, the server may respond by initiating an ADD LINK exchange to create a new link using the client's new resource but is not required to do so.

如果链路组已具有对称链路冗余,但链路数少于协商的最大链路数,则服务器可以通过启动添加链路交换来响应,以使用客户端的新资源创建新链路,但不需要这样做。

If the link group already has the negotiated maximum number of links, the server must ignore the client's ADD LINK request LLC message.

如果链接组已具有协商的最大链接数,则服务器必须忽略客户端的添加链接请求LLC消息。

Because the server is not required to respond to the client's ADD LINK LLC message in all cases, the client must not wait for a response or throw an error if one does not come.

由于服务器在所有情况下都不需要响应客户端的ADD LINK LLC消息,因此客户端不能等待响应或在没有响应时抛出错误。

3.5.5.1.3. Server-Initiated DELETE LINK Processing
3.5.5.1.3. 服务器启动的删除链接处理

Reasons that a server may delete a link include the following:

服务器可能删除链接的原因包括:

o The link has not been used for TCP connections for an implementation-defined time interval, and deleting the link will not cause the link group to lack redundancy.

o 在实现定义的时间间隔内,该链路尚未用于TCP连接,删除该链路不会导致链路组缺少冗余。

o Errors in resources supporting the link occur. These errors may include, but are not limited to, RNIC errors, QP errors, and software errors.

o 支持链接的资源中出现错误。这些错误可能包括但不限于RNIC错误、QP错误和软件错误。

o The RNIC supporting this SMC-R link is being taken down, either because of an error case or because of an operator or software command.

o 支持此SMC-R链路的RNIC因错误情况或操作员或软件命令而被关闭。

If a link being deleted is supporting TCP connections and there are one or more surviving links in the link group, the TCP connections are moved to the surviving links. For more information on this processing, see Section 2.3 ("SMC-R Resilience and Load Balancing").

如果要删除的链接支持TCP连接,并且链接组中有一个或多个尚存链接,则TCP连接将移动到尚存链接。有关此处理的更多信息,请参阅第2.3节(“SMC-R弹性和负载平衡”)。

The server deletes a link from the link group by sending a DELETE LINK request LLC message to the client over any of the usable links in the link group. Because the DELETE LINK LLC message specifies which link is to be deleted, it may flow over any link in the link group. The server must not clean up its RoCE resources for the link until the client responds.

服务器通过链路组中的任何可用链路向客户端发送删除链路请求LLC消息,从链路组中删除链路。由于DELETE LINK LLC消息指定要删除的链接,因此它可能会流过链接组中的任何链接。在客户端响应之前,服务器不得清理其链接的RoCE资源。

The client responds to the server's DELETE LINK request LLC message by sending the server a DELETE LINK response LLC message. The client must respond positively; it cannot decline to delete the link. Once the server has received the client's DELETE LINK response, both sides may clean up their resources for the link.

客户端通过向服务器发送删除链接响应LLC消息来响应服务器的删除链接请求LLC消息。客户必须积极响应;它不能拒绝删除链接。一旦服务器收到客户机的删除链接响应,双方都可以清理他们的链接资源。

Either a positive write completion or some other indication from the RNIC on the client's side is sufficient to indicate to the client that the server has received the DELETE LINK response.

客户端RNIC的肯定写入完成或其他指示足以向客户端指示服务器已收到删除链接响应。

         Host X                                     Host Y
    +-------------------+                      +-------------------+
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
    |RToken X|   |Failed|<--X----X----X----X-->|      |            |
    |        |   |      |                      |      |            |
    |       \/   +------+                      +------+            |
    |+--------+         |                      |                   |
    || Deleted|         |                      |                   |
    || RMB    |         |                      |                   |
    ||        |         |                      |                   |
    |+--------+         |                      |                   |
    |       /\   +------+                      +------+            |
    |RToken Z|   |      |     SMC-R Link 2     |      |            |
    |        |   |RNIC 3|<-------------------->|RNIC 4|            |
    |       QP 64|      |                      |      | QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
         Host X                                     Host Y
    +-------------------+                      +-------------------+
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
    |RToken X|   |Failed|<--X----X----X----X-->|      |            |
    |        |   |      |                      |      |            |
    |       \/   +------+                      +------+            |
    |+--------+         |                      |                   |
    || Deleted|         |                      |                   |
    || RMB    |         |                      |                   |
    ||        |         |                      |                   |
    |+--------+         |                      |                   |
    |       /\   +------+                      +------+            |
    |RToken Z|   |      |     SMC-R Link 2     |      |            |
    |        |   |RNIC 3|<-------------------->|RNIC 4|            |
    |       QP 64|      |                      |      | QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
          DELETE LINK(request, link number = 1,
                ................................................>
                       reason code = RNIC failure)
        
          DELETE LINK(request, link number = 1,
                ................................................>
                       reason code = RNIC failure)
        
          DELETE LINK(response, link number = 1)
               <................................................
        
          DELETE LINK(response, link number = 1)
               <................................................
        

(Note: Architecturally, this exchange can flow over either SMC-R link but most likely flows over Link 2, since the RNIC for Link 1 has failed.)

(注意:在体系结构上,此交换可以通过任一SMC-R链路进行,但最有可能通过链路2进行,因为链路1的RNIC发生故障。)

Figure 10: Server-Initiated DELETE LINK Flow

图10:服务器启动的删除链接流

3.5.5.1.4. Client-Initiated DELETE LINK Request
3.5.5.1.4. 客户端启动的删除链接请求

The client may request that the server delete a link for the same reasons that the server may delete a link, except for inactivity timeout.

客户端可能会请求服务器删除链接,原因与服务器删除链接的原因相同,但非活动超时除外。

Because the client depends on the server to delete links, there are two types of delete requests from client to server:

由于客户端依赖于服务器来删除链接,因此从客户端到服务器有两种类型的删除请求:

o Orderly: The client is requesting that the server delete the link when able. This would result from an operator command to bring down the RNIC or some other nonfatal reason. In this case, the server is required to delete the link but may not do it right away.

o 有序:客户端请求服务器在可以删除链接时删除链接。这可能是由于操作员命令关闭RNIC或其他非致命原因造成的。在这种情况下,服务器需要删除链接,但可能不会立即删除。

o Disorderly: The server must delete the link right away, because the client has experienced a fatal error with the link.

o 无序:服务器必须立即删除链接,因为客户端遇到了链接的致命错误。

In either case, the server responds by initiating a DELETE LINK exchange with the client, as described in the previous section. The difference between the two is whether the server must do so immediately or can delay for an opportunity to gracefully delete the link.

在任何一种情况下,服务器都会通过启动与客户端的删除链接交换来响应,如前一节所述。两者之间的区别在于服务器是否必须立即执行此操作,或者是否可以延迟以获得适当删除链接的机会。

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<---X--X--X--X--X--X->|Failed|            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<---X--X--X--X--X--X->|Failed|            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
           DELETE LINK(request, link number = 1, disorderly,
                <...............................................
                       reason code = RNIC failure)
        
           DELETE LINK(request, link number = 1, disorderly,
                <...............................................
                       reason code = RNIC failure)
        
           DELETE LINK(request, link number = 1,
                 ................................................>
                        reason code = RNIC failure)
        
           DELETE LINK(request, link number = 1,
                 ................................................>
                        reason code = RNIC failure)
        
           DELETE LINK(response, link number = 1)
                <................................................
        
           DELETE LINK(response, link number = 1)
                <................................................
        

(Note: Architecturally, this exchange can flow over either SMC-R link but most likely flows over Link 2, since the RNIC for Link 1 has failed.)

(注意:在体系结构上,此交换可以通过任一SMC-R链路进行,但最有可能通过链路2进行,因为链路1的RNIC发生故障。)

Figure 11: Client-Initiated DELETE LINK Flow

图11:客户端启动的删除链接流

3.5.5.2. Managing Multiple RKeys over Multiple SMC-R Links in a Link Group

3.5.5.2. 在链路组中的多个SMC-R链路上管理多个RKEY

After the initial contact sequence completes and the number of TCP connections increases, it is possible that the SMC peers could add more RMBs to the link group. Recall that each peer independently manages its RMBs. Also recall that an RMB's RToken is specific to a QP, which means that when there are multiple SMC-R links in a link group, each RMB accessed with the link group requires a separate RToken for each SMC-R link in the group.

初始接触序列完成且TCP连接数量增加后,SMC对等方可能会向链路组添加更多RMB。回想一下,每个对等方都独立管理其RMBs。还记得RMB的RToken是特定于QP的,这意味着当链路组中存在多个SMC-R链路时,使用链路组访问的每个RMB都需要为组中的每个SMC-R链路提供单独的RToken。

Each RMB that is added to a link must be added to all links within the link group. The set of RMBs created for the link is called the "RToken set". The RTokens must be exchanged with the peer. As RMBs are added and deleted, the RToken set must remain in sync.

添加到链接的每个人民币必须添加到链接组内的所有链接。为链接创建的RMB集称为“RToken集”。必须与对等方交换RTOKEN。随着RMB的添加和删除,RToken集合必须保持同步。

3.5.5.2.1. Adding a New RMB to an SMC-R Link Group
3.5.5.2.1. 向SMC-R链接组添加新人民币

A new RMB can be added to an SMC-R link group on either the client side or the server side. When an additional RMB is added to an existing SMC-R link group, that RMB must be associated with the QPs for each link in the link group. Therefore, when an RMB is added to an SMC-R link group, its RMB RToken for each SMC-R link's QP must be communicated to the peer.

可以在客户端或服务器端向SMC-R链路组添加新的RMB。当向现有SMC-R链路组添加额外的RMB时,该RMB必须与链路组中每个链路的QP相关联。因此,当向SMC-R链路组添加RMB时,必须将每个SMC-R链路的QP的RMB RToken传送给对等方。

The tokens for a new RMB added to an existing SMC-R link group are communicated using CONFIRM RKEY LLC messages, as shown in Figure 12. The RToken set is specified as pairs: an SMC-R link number, paired with the new RMB's RToken over that SMC-R link. To preserve failover capability, any TCP connection that uses a newly added RMB cannot go active until all RTokens for the RMB have been communicated for all of the links in the link group.

添加到现有SMC-R链路组的新人民币代币使用确认RKEY LLC消息进行通信,如图12所示。RToken集合被指定为配对:SMC-R链路号,与新RMB在该SMC-R链路上的RToken配对。为了保持故障切换能力,任何使用新添加的RMB的TCP连接都不能处于活动状态,直到为链路组中的所有链路通信了RMB的所有RTOKEN。

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || New    |         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || New    |         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 64|      |                      |      | QP 65      |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
           CONFIRM RKEY(request, Add,
                 ................................................>
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
        
           CONFIRM RKEY(request, Add,
                 ................................................>
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
        
           CONFIRM RKEY(response, Add,
                <................................................
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
        
           CONFIRM RKEY(response, Add,
                <................................................
                      RToken set((Link 1,RToken X),(Link 2,RToken Z)))
        

(Note: This exchange can flow over either SMC-R link.)

(注意:此交换可以通过任一SMC-R链路进行。)

Figure 12: Add RMB to Existing Link Group

图12:向现有链接组添加RMB

Implementations may choose to proactively add RMBs to link groups in anticipation of need. For example, an implementation may add a new RMB when a certain usage threshold (e.g., percentage used) for all of its existing RMBs has been exceeded.

实施可能会根据需要主动向链接组添加RMB。例如,当超过其所有现有RMB的特定使用阈值(例如,使用百分比)时,实现可以添加新RMB。

A new RMB may also be added to an existing link group on an as-needed basis -- for example, when a new TCP connection is added to the link group but there are no available RMB elements. In this case, the CLC exchange is paused while the peer that requires the new RMB adds it. An example of this is illustrated in Figure 13.

还可以根据需要将新的RMB添加到现有链路组中——例如,当向链路组添加新的TCP连接但没有可用的RMB元素时。在这种情况下,CLC交换将暂停,而需要新人民币的对等方将其添加。图13中举例说明了这一点。

       Host X -- Server                            Host Y -- Client
    +-------------------+                      +--------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1    |
    |            +------+                      +------+             |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64      |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |         |
    |        |   |GID GA|                      |GID GB|   |RToken Y2|
    |       \/   +------+                      +------+  \/         |
    |+--------+         |                      |        +--------+  |
    ||        |         |   Subnet S1          |        | New    |  |
    || RMB    |         |                      |        | RMB    |  |
    |+--------+         |                      |        +--------+  |
    |       /\   +------+                      +------+  /\         |
    |        |   |RNIC 3|    SMC-R Link 2      |RNIC 4|   |RToken W2|
    |        |   |MAC MC|<-------------------->|MAC MD|   |         |
    |       QP 9 |GID GC|                      |GID GD|  QP 65      |
    |            +------+                      +------+             |
    +-------------------+                      +--------------------+
        
       Host X -- Server                            Host Y -- Client
    +-------------------+                      +--------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1    |
    |            +------+                      +------+             |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64      |
    |RToken X|   |MAC MA|<-------------------->|MAC MB|   |         |
    |        |   |GID GA|                      |GID GB|   |RToken Y2|
    |       \/   +------+                      +------+  \/         |
    |+--------+         |                      |        +--------+  |
    ||        |         |   Subnet S1          |        | New    |  |
    || RMB    |         |                      |        | RMB    |  |
    |+--------+         |                      |        +--------+  |
    |       /\   +------+                      +------+  /\         |
    |        |   |RNIC 3|    SMC-R Link 2      |RNIC 4|   |RToken W2|
    |        |   |MAC MC|<-------------------->|MAC MD|   |         |
    |       QP 9 |GID GC|                      |GID GD|  QP 65      |
    |            +------+                      +------+             |
    +-------------------+                      +--------------------+
        
           SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
        <--------------------------------------------------------->
        
           SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
        <--------------------------------------------------------->
        
                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------
        
                    SMC Proposal(PC1,MB,GB,S1)
        <--------------------------------------------------------
        
      SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->
        
      SMC Accept(PS1,not 1st contact,MA,GA,QP8,RToken=X,RMB elem index)
        --------------------------------------------------------->
        
          CONFIRM RKEY(request, Add,
        <........................................................
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
        
          CONFIRM RKEY(request, Add,
        <........................................................
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
        
          CONFIRM RKEY(response, Add,
         ........................................................>
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
        
          CONFIRM RKEY(response, Add,
         ........................................................>
                  RToken set((Link 1,RToken Y2),(Link 2,RToken W2)))
        
          SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
        <--------------------------------------------------------
        
          SMC Confirm(PC1,MB,GB,QP64,RToken=Y2, RMB element index)
        <--------------------------------------------------------
        
                         Legend:
                  ------------   TCP/IP and CLC flows
                  ............   RoCE (LLC) flows
        
                         Legend:
                  ------------   TCP/IP and CLC flows
                  ............   RoCE (LLC) flows
        

Figure 13: Client Adds RMB during TCP Connection Setup

图13:客户端在TCP连接设置期间添加RMB

3.5.5.2.2. Deleting an RMB from an SMC-R Link Group
3.5.5.2.2. 从SMC-R链接组中删除RMB

Either peer can delete one or more of its RMBs as long as it is not being used for any TCP connections. Ideally, an SMC-R peer would use a timer to avoid freeing an RMB immediately after the last TCP connection stops using it, to keep the RMB available for later TCP connections and avoid thrashing with addition and deletion of RMBs. Once an SMC-R peer decides to delete an RMB, it sends a DELETE RKEY LLC message to its peer. It can then free the RMB once it receives a response from the peer. Multiple RMBs can be deleted in a DELETE RKEY exchange.

任何一个对等方都可以删除一个或多个RMB,只要它不用于任何TCP连接。理想情况下,SMC-R对等方将使用计时器,以避免在最后一个TCP连接停止使用RMB后立即释放RMB,使RMB可用于以后的TCP连接,并避免因RMB的添加和删除而受到冲击。一旦SMC-R对等方决定删除RMB,它将向其对等方发送删除RKEY LLC消息。一旦收到对方的回复,它就可以释放人民币。在删除RKEY交换中可以删除多个RMB。

Note that in a DELETE RKEY message, it is not necessary to specify the full RToken for a deleted RMB. The RMB's RKey over one link in the link group is sufficient to specify which RMB is being deleted.

请注意,在删除RKEY消息中,没有必要为已删除的RMB指定完整的RToken。链接组中一个链接的RMB RKey足以指定要删除的RMB。

          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
          Host X                                     Host Y
     +-------------------+                      +-------------------+
     |            +------+                      +------+            |
     |       QP 8 |RNIC 1|     SMC-R Link 1     |RNIC 2| QP 9       |
     |RToken X|   |      |<-------------------->|      |            |
     |        |   |      |                      |      |            |
     |       \/   +------+                      +------+            |
     |+--------+         |                      |                   |
     || Deleted|         |                      |                   |
     || RMB    |         |                      |                   |
     ||        |         |                      |                   |
     |+--------+         |                      |                   |
     |       /\   +------+                      +------+            |
     |RToken Z|   |      |     SMC-R Link 2     |      |            |
     |        |   |RNIC 3|<-------------------->|RNIC 4|            |
     |       QP 9 |      |                      |      |            |
     |            +------+                      +------+            |
     +-------------------+                      +-------------------+
        
           DELETE RKEY(request, RKey list(RKey X))
                 ................................................>
        
           DELETE RKEY(request, RKey list(RKey X))
                 ................................................>
        
           DELETE RKEY(response, RKey list(RKey X))
                <................................................
        
           DELETE RKEY(response, RKey list(RKey X))
                <................................................
        

(Note: This exchange can flow over either SMC-R link.)

(注意:此交换可以通过任一SMC-R链路进行。)

Figure 14: Delete RMB from SMC-R Link Group

图14:从SMC-R链接组删除人民币

3.5.5.2.3. Adding a New SMC-R Link to a Link Group with Multiple RMBs
3.5.5.2.3. 向具有多个RMB的链接组添加新的SMC-R链接

When a new SMC-R link is added to an existing link group, there could be multiple RMBs on each side already associated with the link group. There could also be a different number of RMBs on one side than on the other, because each peer manages its RMBs independently. Each of these RMBs will require a new RToken to be used on the new SMC-R link, and those new RTokens must then be communicated to the peer. This requires two-way communication, as the server will have to communicate its RTokens to the client and vice versa.

当新的SMC-R链路添加到现有链路组时,每侧可能有多个RMB已与链路组关联。一方的RMB数量也可能与另一方不同,因为每个对等方都独立管理其RMB。这些RMB中的每一个都需要在新的SMC-R链路上使用新的RToken,然后必须将这些新的RToken传达给对等方。这需要双向通信,因为服务器必须将其RTOKEN与客户端通信,反之亦然。

RTokens are communicated between peers in pairs. Each RToken pair consists of:

RTOKEN在对等方之间成对通信。每个RToken对包括:

o The RToken for the RMB, as is already known on an existing SMC-R link in the link group.

o 如链路组中现有SMC-R链路所知,人民币的RToken。

o The RToken for the same RMB, to be used on the new SMC-R link.

o 新SMC-R链路上使用相同人民币的RToken。

These pairs are required to ensure that each peer knows which RTokens across QPs are equivalent.

需要这些对来确保每个对等方知道QP中哪些RTOKEN是等效的。

The ADD LINK request and response LLC messages do not have enough space to contain any RToken pairs. ADD LINK CONTINUATION LLC messages are used to communicate these pairs, as shown in Figure 15. The ADD LINK CONTINUATION LLC messages are sent on the same SMC-R link that the ADD LINK LLC messages were sent over, and in both the ADD LINK and ADD LINK CONTINUATION LLC messages the first RToken in each RToken pair will be the RToken for the RMB as known on the SMC-R link over which the LLC message is being sent.

添加链接请求和响应LLC消息没有足够的空间来包含任何RToken对。ADD LINK CONTINUATION LLC消息用于通信这些对,如图15所示。ADD-LINK CONTINUATION LLC消息在发送ADD-LINK LLC消息的同一SMC-R链路上发送,在ADD-LINK和ADD-LINK CONTINUATION LLC消息中,每个RToken对中的第一个RToken将是发送LLC消息的SMC-R链路上已知的RMB的RToken。

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    |RKey set|   |MAC MA|<-------------------->|MAC MB|   |RKey set|
    |X,Y,Z   |   |GID GA|                      |GID GB|   |Q,R,S,T |
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || 3 RMBs |         |                      |        | 4 RMBs | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |RKey set|   |RNIC 3|    SMC-R Link 2      |RNIC 4|  | RKey set|
    |U,V,W   |   |MAC MC|<-------------------->|MAC MD|  | L,M,N,P |
    |       QP 9 |GID GC|    (being added)     |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    |RKey set|   |MAC MA|<-------------------->|MAC MB|   |RKey set|
    |X,Y,Z   |   |GID GA|                      |GID GB|   |Q,R,S,T |
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || 3 RMBs |         |                      |        | 4 RMBs | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |RKey set|   |RNIC 3|    SMC-R Link 2      |RNIC 4|  | RKey set|
    |U,V,W   |   |MAC MC|<-------------------->|MAC MD|  | L,M,N,P |
    |       QP 9 |GID GC|    (being added)     |GID GD| QP 65      |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
            ADD LINK request (QP9,MC,GC, link number = 2)
            ............................................>
        
            ADD LINK request (QP9,MC,GC, link number = 2)
            ............................................>
        
            ADD LINK response (QP65,MD,GD, link number = 2)
            <............................................
        
            ADD LINK response (QP65,MD,GD, link number = 2)
            <............................................
        
    ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
             ............................................>
        
    ADD LINK CONTINUATION req(RToken pairs=((X,U),(Y,V),(Z,W)))
             ............................................>
        
    ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
             <.............................................
        
    ADD LINK CONTINUATION rsp(RToken pairs=((Q,L),(R,M),(S,N),(T,P)))
             <.............................................
        
           CONFIRM LINK req/rsp exchange on Link 2
            <.............................................>
        
           CONFIRM LINK req/rsp exchange on Link 2
            <.............................................>
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        

Figure 15: Exchanging RKeys when a New Link Is Added to a Link Group

图15:将新链接添加到链接组时交换RKEY

3.5.5.3. Serialization of LLC Exchanges, and Collisions
3.5.5.3. LLC交换的序列化和冲突

LLC flows can be divided into two main groups for serialization considerations.

出于序列化考虑,LLC流可分为两个主要组。

The first group is LLC messages that are independent and can flow at any time. These are one-time, unsolicited messages that either do not have a required response or have a simple response that does not interfere with the operations of another group of messages. These messages are as follows:

第一组是LLC消息,它们是独立的,可以在任何时候流动。这些是一次性的、未经请求的消息,它们要么没有必需的响应,要么具有不干扰另一组消息操作的简单响应。这些信息如下:

o TEST LINK from either the client or the server: This message requires a TEST LINK response to be returned but does not affect the configuration of the link group or the RKeys.

o 来自客户端或服务器的测试链接:此消息要求返回测试链接响应,但不影响链接组或RKEY的配置。

o ADD LINK from the client to the server: This message is provided as an "FYI" to the server to let it know that the client has an additional RNIC available. The server is not required to act upon or respond to this message.

o 添加从客户端到服务器的链接:此消息作为服务器的“FYI”提供,以告知客户端有额外的RNIC可用。服务器无需对此消息进行操作或响应。

o DELETE LINK from the client to the server: This message informs the server that either (1) the client has experienced an error or problem that requires a link or link group to be terminated or (2) an operator has commanded that a link or link group be terminated. The server does not respond directly to the message; rather, it initiates a DELETE LINK exchange as a result of receiving it.

o 删除从客户端到服务器的链接:此消息通知服务器:(1)客户端遇到需要终止链接或链接组的错误或问题,或者(2)操作员已命令终止链接或链接组。服务器不直接响应消息;相反,它会在收到删除链接后启动删除链接交换。

o DELETE LINK from the server to the client, with the "delete entire link group" flag set: This message informs the client that the entire link group is being deleted.

o 删除从服务器到客户端的链接,并设置“删除整个链接组”标志:此消息通知客户端整个链接组正在被删除。

The second group is LLC messages that are part of an exchange of LLC messages that affects link group configuration; this exchange must complete before another exchange of LLC messages that affects link group configuration can be processed. When a peer knows that one of these exchanges is in progress, it must not start another exchange. These exchanges are as follows:

第二组是LLC消息,其是影响链路组配置的LLC消息交换的一部分;必须先完成此交换,然后才能处理影响链路组配置的LLC消息的另一次交换。当对等方知道其中一个交换正在进行时,它不得启动另一个交换。这些交流如下:

o ADD LINK / ADD LINK response / ADD LINK CONTINUATION / ADD LINK CONTINUATION response / CONFIRM LINK / CONFIRM LINK response: This exchange, by adding a new link, changes the configuration of the link group.

o 添加链接/添加链接响应/添加链接延续/添加链接延续响应/确认链接/确认链接响应:此交换通过添加新链接更改链接组的配置。

o DELETE LINK / DELETE LINK response initiated by the server, without the "delete entire link group" flag set: This exchange, by deleting a link, changes the configuration of the link group.

o 删除链接/删除服务器启动的链接响应,不设置“删除整个链接组”标志:此交换通过删除链接更改链接组的配置。

o CONFIRM RKEY / CONFIRM RKEY response or DELETE RKEY / DELETE RKEY response: This exchange changes the RMB configuration of the link group. RKeys cannot change while links are being added or deleted (while an ADD LINK or DELETE LINK is in progress). However, CONFIRM RKEY and DELETE RKEY are unique in that both the client and server can independently manage (add or remove) their own RMBs. This allows each peer to concurrently change their RKeys and therefore concurrently send CONFIRM RKEY or DELETE RKEY requests. The concurrent CONFIRM RKEY or DELETE RKEY requests can be independently processed and do not represent a collision.

o 确认RKEY/确认RKEY应答或删除RKEY/删除RKEY应答:本次交易所改变链路组的人民币配置。添加或删除链接时(添加链接或删除链接正在进行时),RKEY无法更改。但是,确认RKEY和删除RKEY是唯一的,因为客户端和服务器都可以独立管理(添加或删除)自己的RMB。这允许每个对等方同时更改其RKEY,从而同时发送确认RKEY或删除RKEY请求。并发确认RKEY或删除RKEY请求可以独立处理,并且不表示冲突。

Because the server is in control of the configuration of the link group, many timing windows and collisions are avoided, but there are still some that must be handled.

由于服务器控制链接组的配置,因此避免了许多计时窗口和冲突,但仍有一些必须处理。

3.5.5.3.1. Collisions with ADD LINK / CONFIRM LINK Exchange
3.5.5.3.1. 与添加链接/确认链接交换的冲突

Colliding LLC message: TEST LINK

碰撞LLC消息:测试链接

Action to resolve: Send immediate TEST LINK reply.

要解决的操作:立即发送测试链接回复。

Colliding LLC message: ADD LINK from client to server

冲突LLC消息:添加从客户端到服务器的链接

Action to resolve: Server ignores the ADD LINK message. When client receives server's ADD LINK, client will consider that message to be in response to its ADD LINK message and the flow works. Since both client and server know not to start this exchange if an ADD LINK operation is already underway, this can only occur if the client sends this message before receiving the server's ADD LINK and this message crosses with the server's ADD LINK message; therefore, the server's ADD LINK arrives at the client immediately after the client sent this message.

要解决的操作:服务器忽略添加链接消息。当客户端接收到服务器的添加链接时,客户端将认为该消息是响应其添加链接消息和流程工作的。由于客户端和服务器都知道,如果添加链接操作已在进行,则不会启动此交换,因此只有当客户端在接收服务器的添加链接之前发送此消息,并且此消息与服务器的添加链接消息交叉时,才会发生此情况;因此,服务器的ADD链接在客户端发送此消息后立即到达客户端。

Colliding LLC message: DELETE LINK from client to server, specific link specified

冲突LLC消息:删除从客户端到服务器的链接,指定特定链接

Action to resolve: Server queues the DELETE LINK message and processes it after the ADD LINK exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails.

要解决的操作:服务器将删除链接消息排入队列,并在添加链接交换完成后对其进行处理。如果是有序链路终止,则可以等到交换继续后再进行。如果是无序的,并且受影响的链路是当前exchange正在使用的链路,则当此exchange中的消息失败时,服务器将发现中断。

Colliding LLC message: DELETE LINK from client to server, entire link group to be deleted

冲突LLC消息:删除从客户端到服务器的链接,删除整个链接组

Action to resolve: Immediately clean up the link group.

要解决的操作:立即清理链接组。

Colliding LLC message: CONFIRM RKEY from client

碰撞LLC消息:确认来自客户的RKEY

Action to resolve: Send a negative CONFIRM RKEY response to the client. Once the current exchange finishes, client will have to recompute its RKey set to include the new link and then start a new CONFIRM RKEY exchange.

解决措施:向客户端发送否定的确认RKEY响应。当前交换完成后,客户端必须重新计算其RKey集以包含新链接,然后启动新的RKey交换。

3.5.5.3.2. Collisions during DELETE LINK Exchange
3.5.5.3.2. 删除链接交换期间的冲突

Colliding LLC message: TEST LINK from either peer

冲突LLC消息:来自任一对等方的测试链接

Action to resolve: Send immediate TEST LINK response.

要解决的操作:立即发送测试链接响应。

Colliding LLC message: ADD LINK from client to server

冲突LLC消息:添加从客户端到服务器的链接

Action to resolve: Server queues the ADD LINK and processes it after the current exchange completes.

要解决的操作:服务器将添加链接排入队列,并在当前exchange完成后对其进行处理。

Colliding LLC message: DELETE LINK from client to server (specific link)

冲突LLC消息:删除从客户端到服务器的链接(特定链接)

Action to resolve: Server queues the DELETE LINK message and processes it after the current exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails.

要解决的操作:服务器将删除链接消息排入队列,并在当前exchange完成后对其进行处理。如果是有序链路终止,则可以等到交换继续后再进行。如果是无序的,并且受影响的链路是当前exchange正在使用的链路,则当此exchange中的消息失败时,服务器将发现中断。

Colliding LLC message: DELETE LINK from either client or server, deleting the entire link group

冲突LLC消息:从客户端或服务器删除链接,删除整个链接组

Action to resolve: Immediately clean up the link group.

要解决的操作:立即清理链接组。

Colliding LLC message: CONFIRM RKEY from client to server

碰撞LLC消息:确认从客户端到服务器的RKEY

Action to resolve: Send a negative CONFIRM RKEY response to the client. Once the current exchange finishes, client will have to recompute its RKey set to include the new link and then start a new CONFIRM RKEY exchange.

解决措施:向客户端发送否定的确认RKEY响应。当前交换完成后,客户端必须重新计算其RKey集以包含新链接,然后启动新的RKey交换。

3.5.5.3.3. Collisions during CONFIRM RKEY Exchange
3.5.5.3.3. 交换过程中的碰撞

Colliding LLC message: TEST LINK

碰撞LLC消息:测试链接

Action to resolve: Send immediate TEST LINK reply.

要解决的操作:立即发送测试链接回复。

Colliding LLC message: ADD LINK from client to server

冲突LLC消息:添加从客户端到服务器的链接

Action to resolve: Queue the ADD LINK, and process it after the current exchange completes.

要解决的操作:将添加链接排入队列,并在当前交换完成后对其进行处理。

Colliding LLC message: ADD LINK from server to client (CONFIRM RKEY exchange was initiated by the client, and it crossed with the server initiating an ADD LINK exchange)

冲突LLC消息:添加从服务器到客户端的链接(确认RKEY交换是由客户端发起的,并且它与启动添加链接交换的服务器交叉)

Action to resolve: Process the ADD LINK. Client will receive a negative CONFIRM RKEY from the server and will have to redo this CONFIRM RKEY exchange after the ADD LINK exchange completes.

要解决的操作:处理添加链接。客户端将从服务器接收到一个否定的确认RKEY,并且必须在添加链接交换完成后重新执行此确认RKEY交换。

Colliding LLC message: DELETE LINK from client to server, specific link to be deleted (CONFIRM RKEY exchange was initiated by the server, and it crossed with the client's DELETE LINK request)

冲突LLC消息:删除客户端到服务器的链接,要删除的特定链接(确认RKEY交换是由服务器发起的,并且与客户端的删除链接请求交叉)

Action to resolve: Server queues the DELETE LINK message and processes it after the CONFIRM RKEY exchange completes. If it is an orderly link termination, it can wait until after this exchange continues. If it is disorderly and the link affected is the one that the current exchange is using, the server will discover the outage when a message in this exchange fails.

要解决的操作:服务器将删除链接消息排入队列,并在确认RKEY交换完成后对其进行处理。如果是有序链路终止,则可以等到交换继续后再进行。如果是无序的,并且受影响的链路是当前exchange正在使用的链路,则当此exchange中的消息失败时,服务器将发现中断。

Colliding LLC message: DELETE LINK from server to client, specific link deleted (CONFIRM RKEY exchange was initiated by the client, and it crossed with the server's DELETE LINK)

冲突LLC消息:删除从服务器到客户端的链接,删除特定链接(确认RKEY交换是由客户端发起的,并且与服务器的删除链接交叉)

Action to resolve: Process the DELETE LINK. Client will receive a negative CONFIRM RKEY from the server and will have to redo this CONFIRM RKEY exchange after the ADD LINK exchange completes.

要解决的操作:处理删除链接。客户端将从服务器接收到一个否定的确认RKEY,并且必须在添加链接交换完成后重新执行此确认RKEY交换。

Colliding LLC message: DELETE LINK from either client or server, entire link group deleted

冲突LLC消息:从客户端或服务器删除链接,删除整个链接组

Action to resolve: Immediately clean up the link group.

要解决的操作:立即清理链接组。

Colliding LLC message: CONFIRM LINK from the peer that did not start the current CONFIRM LINK exchange

冲突LLC消息:来自未启动当前确认链接交换的对等方的确认链接

Action to resolve: Queue the request, and process it after the current exchange completes.

要解决的操作:将请求排队,并在当前交换完成后处理它。

4. SMC-R Memory-Sharing Architecture
4. SMC-R内存共享体系结构
4.1. RMB Element Allocation Considerations
4.1. 人民币要素配置考虑

Each TCP connection using SMC-R must be allocated an RMBE by each SMC-R peer. This allocation is performed by each endpoint independently to allow each endpoint to select an RMBE that best matches the characteristics on its TCP socket endpoint. The RMBE associated with a TCP socket endpoint must have a receive buffer that is at least as large as the TCP receive buffer size in effect for that connection. The receive buffer size can be determined by what is specified explicitly by the application using setsockopt() or implicitly via the system-configured default value. This will allow sufficient data to be RDMA-written by the SMC-R peer to fill an entire receive buffer size's worth of data on a given data flow. Given that each RMB must have fixed-length RMBEs, this implies that an SMC-R endpoint may need to maintain multiple RMBs of various sizes for SMC-R connections on a given SMC-R link and can then select an RMBE that most closely fits a connection.

使用SMC-R的每个TCP连接必须由每个SMC-R对等方分配一个RMBE。此分配由每个端点独立执行,以允许每个端点选择与其TCP套接字端点上的特征最匹配的RMBE。与TCP套接字端点关联的RMBE必须具有至少与该连接有效的TCP接收缓冲区大小相同的接收缓冲区。接收缓冲区大小可以由应用程序使用setsockopt()显式指定或通过系统配置的默认值隐式指定。这将允许SMC-R对等机通过RDMA写入足够的数据,以填充给定数据流上整个接收缓冲区大小的数据。鉴于每个RMB必须具有固定长度的RMBE,这意味着SMC-R端点可能需要为给定SMC-R链路上的SMC-R连接维护多个不同大小的RMBE,然后可以选择最适合连接的RMBE。

4.2. RMB and RMBE Format
4.2. RMB和RMBE格式

An RMB is a virtual memory buffer whose backing real memory is pinned. The RMB is subdivided into a whole number of equal-sized RMB Elements (RMBEs). Each RMBE begins with a 4-byte eye catcher for diagnostic and service purposes, followed by the receive data buffer. The contents of this diagnostic eye catcher are implementation dependent and should be used by the local SMC-R peer to check for overlay errors by verifying an intact eye catcher with every RMBE access.

RMB是一个虚拟内存缓冲区,其支持的实际内存是固定的。人民币被细分为一系列大小相等的人民币要素(RMBE)。每个RMBE都以一个用于诊断和服务目的的4字节眼球捕捉器开始,然后是接收数据缓冲区。此诊断吸引眼球器的内容取决于实现,本地SMC-R对等方应使用该内容通过验证每个RMBE访问的完整吸引眼球器来检查覆盖错误。

The RMBE is a wrapping receive buffer for receiving RDMA writes from the peer. Cursors, as described below, are exchanged between peers to manage and track RDMA writes and local data reads from the RMBE for a TCP connection.

RMBE是一个包装接收缓冲区,用于从对等方接收RDMA写入。如下所述,在对等方之间交换游标,以管理和跟踪来自TCP连接RMBE的RDMA写入和本地数据读取。

4.3. RMBE Control Information
4.3. RMBE控制信息

RMBE control information consists of consumer cursors, producer cursors, wrap counts, CDC message sequence numbers, control flags such as urgent data and "writer blocked" indicators, and TCP connection information such as termination flags. This information is exchanged between SMC-R peers using CDC messages, which are passed using RoCE SendMsg. A TCP/IP stack implementing SMC-R must receive and store this information in its internal data structures, as it is used to manage the RMBE and its data buffer.

RMBE控制信息包括使用者游标、生产者游标、包裹计数、CDC消息序列号、控制标志(如紧急数据和“写入程序阻塞”指示器)以及TCP连接信息(如终止标志)。此信息使用CDC消息在SMC-R对等方之间交换,CDC消息使用RoCE SendMsg传递。实现SMC-R的TCP/IP堆栈必须在其内部数据结构中接收并存储此信息,因为它用于管理RMBE及其数据缓冲区。

The format and contents of the CDC message are described in detail in Appendix A.4 ("Connection Data Control (CDC) Message Format"). The following is a high-level description of what this control information contains.

CDC消息的格式和内容详见附录A.4(“连接数据控制(CDC)消息格式”)。以下是此控制信息所包含内容的高级描述。

o Connection state flags such as sending done, connection closed, failover data validation, and abnormal close.

o 连接状态标志,如发送完成、连接关闭、故障转移数据验证和异常关闭。

o A sequence number that is managed by the sender. This sequence number starts at 1, is increased each send, and wraps to 0. This sequence number tracks the CDC message sent and is not related to the number of bytes sent. It is used for failover data validation.

o 由发送方管理的序列号。此序列号从1开始,每次发送时增加,并换行为0。此序列号跟踪发送的CDC消息,与发送的字节数无关。它用于故障转移数据验证。

o Producer cursor: a wrapping offset into the receiver's RMBE data area. Set by the peer that is writing into the RMBE, it points to where the writing peer will write the next byte of data into an RMBE. This cursor is accompanied by a wrap sequence number to help the RMBE owner (the receiver) identify full window size wrapping writes. Note that this cursor must account for (i.e., skip over) the RMBE eye catcher that is in the beginning of the data area.

o 生产者光标:接收器RMBE数据区域的换行偏移量。由写入RMBE的对等方设置,它指向写入对等方将下一字节数据写入RMBE的位置。此光标附带一个换行序号,以帮助RMBE所有者(接收方)识别完整窗口大小的换行写入。请注意,此光标必须说明(即,跳过)位于数据区域开头的RMBE醒目标记。

o Consumer cursor: a wrapping offset into the receiver's RMBE data area. Set by the owner of the RMBE (the peer that is reading from it), this cursor points to the offset of the next byte of data to be consumed by the peer in its own RMBE. The sender cannot write beyond this cursor into the receiver's RMBE without causing data loss. Like the producer cursor, this is accompanied by a wrap count to help the writer identify full window size wrapping reads. Note that this cursor must account for (i.e., skip over) the RMBE eye catcher that is in the beginning of the data area.

o 消费者光标:接收器RMBE数据区域的换行偏移量。由RMBE的所有者(从中读取的对等方)设置,该光标指向对等方在其自身的RMBE中要使用的下一字节数据的偏移量。发送方无法在不造成数据丢失的情况下将超出此光标的内容写入接收方的RMBE。与producer光标一样,这也伴随着换行计数,以帮助编写器识别完整窗口大小的换行读取。请注意,此光标必须说明(即,跳过)位于数据区域开头的RMBE醒目标记。

o Data flags such as urgent data, writer blocked indicator, and cursor update requests.

o 数据标志,如紧急数据、写入程序阻塞指示器和游标更新请求。

4.4. Use of RMBEs
4.4. 人民币汇率的使用
4.4.1. Initializing and Accessing RMBEs
4.4.1. 初始化和访问RMBEs

The RMBE eye catcher is initialized by the RMB owner prior to assigning it to a specific TCP connection and communicating its RMB index to the SMC-R partner. After an RMBE index is communicated to the SMC-R partner, the RMBE can only be referenced in "read-only mode" by the owner, and all updates to it are performed by the remote SMC-R partner via RDMA write operations.

RMBE eye catcher由RMB所有者在将其分配给特定TCP连接并将其RMB索引传递给SMC-R合作伙伴之前初始化。RMBE索引传送给SMC-R合作伙伴后,所有者只能在“只读模式”下引用RMBE,远程SMC-R合作伙伴通过RDMA写入操作对其执行所有更新。

Initialization of an RMBE must include the following:

RMBE的初始化必须包括以下内容:

o Zeroing out the entire RMBE receive buffer, which helps minimize data integrity issues (e.g., data from a previous connection somehow being presented to the current connection).

o 将整个RMBE接收缓冲区归零,这有助于最大限度地减少数据完整性问题(例如,来自前一个连接的数据以某种方式呈现给当前连接)。

o Setting the beginning RMBE eye catcher. This eye catcher plays an important role in helping detect accidental overlays of the RMBE. The RMB owner should always validate these eye catchers before each new reference to the RMBE. If the eye catchers are found to be corrupted, the local host must reset the TCP connection associated with this RMBE and log the appropriate diagnostic information.

o 设置开始的RMBE吸引眼球。这种引人注目的装置在帮助检测RMBE的意外覆盖方面起着重要作用。RMB所有者应始终在每次新引用RMBE之前验证这些引人注目的内容。如果发现眼球捕捉器已损坏,本地主机必须重置与此RMBE关联的TCP连接,并记录相应的诊断信息。

4.4.2. RMB Element Reuse and Conflict Resolution
4.4.2. RMB元素重用与冲突解决

RMB elements can be reused once their associated TCP and SMC-R connections are terminated. Under normal and abnormal SMC-R connection termination processing, both SMC-R peers must explicitly acknowledge that they are done using an RMBE before that element can be freed and reassigned to another SMC-R connection instance. For more details on SMC-R connection termination, refer to Section 4.8.

一旦RMB元素的相关TCP和SMC-R连接终止,就可以重用它们。在正常和异常SMC-R连接终止处理下,两个SMC-R对等方必须明确确认它们是使用RMBE完成的,然后才能释放该元素并将其重新分配给另一个SMC-R连接实例。有关SMC-R连接终端的更多详细信息,请参阅第4.8节。

However, there are some error scenarios where this two-way explicit acknowledgment may not be completed. In these scenarios, an RMBE owner may choose to reassign this RMBE to a new SMC-R connection instance on this SMC-R link group. When this occurs, the partner SMC-R peer must detect this condition during SMC-R Rendezvous processing when presented with an RMBE that it believes is already in use for a different SMC-R connection. In this case, the SMC-R peer must abort the existing SMC-R connection associated with this RMBE. The abort processing resets the TCP connection (if it is still active), but it must not attempt to perform any RDMA writes to this RMBE and must also ignore any data sitting in the local RMBE associated with the existing connection. It then proceeds to free up the local RMBE and notify the local application that the connection is being abnormally reset.

但是,在某些错误场景中,此双向显式确认可能无法完成。在这些情况下,RMBE所有者可以选择将此RMBE重新分配给此SMC-R链路组上的新SMC-R连接实例。当出现这种情况时,合作伙伴SMC-R对等方必须在SMC-R会合处理过程中检测到这种情况,前提是它认为RMBE已用于不同的SMC-R连接。在这种情况下,SMC-R对等方必须中止与此RMBE关联的现有SMC-R连接。中止处理将重置TCP连接(如果它仍然处于活动状态),但它不得尝试对该RMBE执行任何RDMA写入,并且还必须忽略与现有连接关联的本地RMBE中的任何数据。然后,它继续释放本地RMBE,并通知本地应用程序连接正在异常重置。

The remote SMC-R peer then proceeds to normal processing for this new SMC-R connection.

然后,远程SMC-R对等方继续对此新SMC-R连接进行正常处理。

4.5. SMC-R Protocol Considerations
4.5. SMC-R协议注意事项

The following sections describe considerations for the SMC-R protocol as compared to TCP.

以下各节描述了SMC-R协议与TCP相比的注意事项。

4.5.1. SMC-R Protocol Optimized Window Size Updates
4.5.1. SMC-R协议优化窗口大小更新

An SMC-R receiver host sends its consumer cursor information to the sender to convey the progress that the receiving application has made in consuming the sent data. The difference between the writer's producer cursor and the associated receiver's consumer cursor indicates the window size available for the sender to write into. This is somewhat similar to TCP window update processing and therefore has some similar considerations, such as silly window syndrome avoidance, whereby TCP has an optimization that minimizes the overhead of very small, unproductive window size updates associated with suboptimal socket applications consuming very small amounts of data on every receive() invocation. For SMC-R, the receiver only updates its consumer cursor via a unique CDC message under the following conditions:

SMC-R接收方主机将其使用者光标信息发送给发送方,以传达接收应用程序在使用所发送数据方面取得的进展。写入程序的生产者光标和关联的接收方的消费者光标之间的差值表示发送方可写入的窗口大小。这在某种程度上类似于TCP窗口更新处理,因此有一些类似的注意事项,例如避免愚蠢的窗口综合症,因此TCP进行了优化,将非常小的,与次优套接字应用程序相关的非生产性窗口大小更新在每次receive()调用时消耗非常少量的数据。对于SMC-R,接收方仅在以下条件下通过唯一的CDC消息更新其消费者光标:

o The current window size (from a sender's perspective) is less than half of the receive buffer space, and the consumer cursor update will result in a minimum increase in the window size of 10% of the receive buffer space. Some examples:

o 当前窗口大小(从发送方的角度来看)小于接收缓冲区空间的一半,使用者光标更新将导致窗口大小至少增加接收缓冲区空间的10%。一些例子:

a. Receive buffer size: 64K, current window size (from a sender's perspective): 50K. No need to update the consumer cursor. Plenty of space is available for the sender.

a. 接收缓冲区大小:64K,当前窗口大小(从发送者的角度来看):50K。无需更新使用者光标。有足够的空间供发件人使用。

b. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 31K. No need to update the consumer cursor; even though the sender's window size is < 1/2 of the 64K, the window update would only increase that by 1K, which is < 1/10th of the 64K buffer size.

b. 接收缓冲区大小:64K,当前窗口大小(从发送方的角度来看):30K,从接收方的角度来看当前窗口大小:31K。无需更新消费者光标;即使发送方的窗口大小小于64K的1/2,窗口更新也只会将其增加1K,即小于64K缓冲区大小的1/10。

c. Receive buffer size: 64K, current window size (from a sender's perspective): 30K, current window size from a receiver's perspective: 64K. The receiver updates the consumer cursor (sender's window size is < 1/2 of the 64K; the window update would increase that by > 6.4K).

c. 接收缓冲区大小:64K,当前窗口大小(从发送方的角度来看):30K,从接收方的角度来看当前窗口大小:64K。接收方更新使用者光标(发送方的窗口大小小于64K的1/2;窗口更新将使其增加>6.4K)。

o The receiver must always include a consumer cursor update whenever it sends a CDC message to the partner for another flow (i.e., send flow in the opposite direction). This allows the window size update to be delivered with no additional overhead. This is somewhat similar to TCP DelayAck processing and quite effective for request/response data patterns.

o 每当接收器向另一个流的合作伙伴发送CDC消息时(即,向相反方向发送流),接收器必须始终包含消费者光标更新。这允许在不增加额外开销的情况下交付窗口大小更新。这有点类似于TCP DelayAck处理,对于请求/响应数据模式非常有效。

o If a peer has set the B-bit in a CDC message, then any consumption of data by the receiver causes a CDC message to be sent, updating the consumer cursor until a CDC message with that bit cleared is received from the peer.

o 如果对等方已在CDC消息中设置了B位,则接收方的任何数据消耗都会导致发送CDC消息,更新使用者光标,直到从对等方接收到清除该位的CDC消息。

o The optimized window size updates are overridden when the sender sets the Consumer Cursor Update Requested flag in a CDC message to the receiver. When this indicator is on, the consumer must send a consumer cursor update immediately when data is consumed by the local application or if the cursor has not been updated for a while (i.e., local copy of the consumer cursor does not match the last consumer cursor value sent to the partner). This allows the sender to perform optional diagnostics for detecting a stalled receiver application (data has been sent but not consumed). It is recommended that the Consumer Cursor Update Requested flag only be sent for diagnostic procedures, as it may result in non-optimal data path performance.

o 当发送方在CDC消息中向接收方设置消费者游标更新请求标志时,将覆盖优化的窗口大小更新。当此指示灯亮起时,当本地应用程序使用数据时,或者如果光标有一段时间没有更新(即,消费者光标的本地副本与发送给合作伙伴的上一个消费者光标值不匹配),消费者必须立即发送消费者光标更新。这允许发送方执行可选诊断以检测暂停的接收方应用程序(数据已发送但未消耗)。建议仅为诊断过程发送使用者游标更新请求标志,因为这可能会导致非最佳数据路径性能。

4.5.2. Small Data Sends
4.5.2. 小数据发送

The SMC-R protocol makes no special provisions for handling small data segments sent across a stream socket. Data is always sent if sufficient window space is available. In contrast to the TCP Nagle algorithm, there are no special provisions in SMC-R for coalescing small data segments.

SMC-R协议对于处理通过流套接字发送的小数据段没有特殊规定。如果有足够的窗口空间,则始终发送数据。与TCP Nagle算法相比,SMC-R中没有关于合并小数据段的特殊规定。

An implementation of SMC-R can be configured to optimize its sending processing by coalescing outbound data for a given SMC-R connection so that it can reduce the number of RDMA write operations it performs, in a fashion similar to Nagle's algorithm. However, any such coalescing would require a timer on the sending host that would ensure that data was eventually sent. Also, the sending host would have to opt out of this processing if Nagle's algorithm had been disabled (programmatically or via system configuration).

SMC-R的实现可以配置为通过合并给定SMC-R连接的出站数据来优化其发送处理,以便以类似于Nagle算法的方式减少其执行的RDMA写入操作的数量。但是,任何此类合并都需要发送主机上的计时器,以确保数据最终被发送。此外,如果Nagle的算法被禁用(通过编程或通过系统配置),发送主机将不得不选择退出此处理。

4.5.3. TCP Keepalive Processing
4.5.3. TCP保留处理

TCP keepalive processing allows applications to direct the local TCP/IP host to periodically "test" the viability of an idle TCP connection. Since SMC-R connections have a TCP representation along with an SMC-R representation, there are unique keepalive processing considerations:

TCP keepalive处理允许应用程序指示本地TCP/IP主机定期“测试”空闲TCP连接的可行性。由于SMC-R连接具有TCP表示和SMC-R表示,因此存在独特的保留处理注意事项:

o SMC-R-layer keepalive processing: If keepalive is enabled for an SMC-R connection, the local host maintains a keepalive timer that reflects how long an SMC-R connection has been idle. The local host also maintains a timestamp of last activity for each SMC-R link (for any SMC-R connection on that link). When it is determined that an SMC-R connection has been idle longer than the keepalive interval, the host checks to see whether or not the SMC-R link has been idle for a duration longer than the keepalive timeout. If both conditions are met, the local host then performs a TEST LINK LLC command to test the viability of the SMC-R link over the RoCE fabric (RC-QPs). If a TEST LINK LLC command response is received within a reasonable amount of time, then the link is considered viable, and all connections using this link are considered viable as well. If, however, a response is not received in a reasonable amount of time or there's a failure in sending the TEST LINK LLC command, then this is considered a failure in the SMC-R link, and failover processing to an alternate SMC-R link must be triggered. If no alternate SMC-R link exists in the SMC-R link group, then all of the SMC-R connections on this link are abnormally terminated by resetting the TCP connections represented by these SMC-R connections. Given that multiple SMC-R connections can share the same SMC-R link, implementing an SMC-R link-level probe using the TEST LINK LLC command will help reduce the amount of unproductive keepalive traffic for SMC-R connections; as long as some SMC-R connections on a given SMC-R link are active (i.e., have had I/O activity within the keepalive interval), then there is no need to perform additional link viability testing.

o SMC-R层keepalive处理:如果为SMC-R连接启用了keepalive,则本地主机将维护一个keepalive计时器,该计时器反映SMC-R连接空闲的时间。本地主机还为每个SMC-R链路(该链路上的任何SMC-R连接)维护最后活动的时间戳。当确定SMC-R连接的空闲时间长于keepalive间隔时,主机将检查SMC-R链路的空闲时间是否长于keepalive超时。如果两个条件都满足,则本地主机随后执行TEST LINK LLC命令,以测试RoCE结构(RC-QPs)上的SMC-R链路的可行性。如果在合理的时间内收到测试链路LLC命令响应,则认为该链路是可行的,并且使用该链路的所有连接也是可行的。但是,如果在合理的时间内未收到响应或发送TEST LINK LLC命令时出现故障,则这被视为SMC-R链路中的故障,必须触发到备用SMC-R链路的故障转移处理。如果SMC-R链路组中不存在备用SMC-R链路,则通过重置由这些SMC-R连接表示的TCP连接,此链路上的所有SMC-R连接都会异常终止。鉴于多个SMC-R连接可以共享同一个SMC-R链路,使用TEST link LLC命令实现SMC-R链路级探测将有助于减少SMC-R连接的非生产性保持通信量;只要给定SMC-R链路上的某些SMC-R连接处于活动状态(即,在keepalive间隔内具有i/O活动),则无需执行额外的链路可行性测试。

o TCP-layer keepalive processing: Traditional TCP "keepalive" packets are not as relevant for SMC-R connections, given that the TCP path is not used for these connections once the SMC-R Rendezvous processing is completed. All SMC-R connections by default have associated TCP connections that are idle. Are TCP keepalive probes still needed for these connections? There are two main scenarios to consider:

o TCP层keepalive处理:传统的TCP“keepalive”数据包与SMC-R连接不相关,因为一旦SMC-R会合处理完成,TCP路径就不用于这些连接。默认情况下,所有SMC-R连接都具有空闲的关联TCP连接。这些连接是否仍然需要TCP keepalive探测器?有两种主要情况需要考虑:

1. TCP keepalives that are used to determine whether or not the peer TCP endpoint is still active. This is not needed for SMC-R connections, as the SMC-R-level keepalives mentioned above will determine whether or not the remote endpoint connections are still active.

1. TCP keepalives,用于确定对等TCP端点是否仍处于活动状态。SMC-R连接不需要这样做,因为上面提到的SMC-R级别keepalives将确定远程端点连接是否仍然处于活动状态。

2. TCP keepalives that are used to ensure that TCP connections traversing an intermediate proxy maintain an active state. For example, stateful firewalls typically maintain state representing every valid TCP connection that traverses the firewall. These types of firewalls are known to expire idle connections by removing their state in the firewall to conserve memory. TCP keepalives are often used in this scenario to prevent firewalls from timing out otherwise idle connections. When using SMC-R, both endpoints must reside in the same Layer 2 network (i.e., the same subnet). As a result, firewalls cannot be injected in the path between two SMC-R endpoints. However, other intermediate proxies, such as TCP/IP-layer load balancers, may be injected in the path of two SMC-R endpoints. These types of load balancers also maintain connection state so that they can forward TCP connection traffic to the appropriate cluster endpoint. When using SMC-R, these TCP connections will appear to be completely idle, making them susceptible to potential timeouts at the load-balancing proxy. As a result, for this scenario, TCP keepalives may still be relevant.

2. TCP keepalives,用于确保通过中间代理的TCP连接保持活动状态。例如,有状态防火墙通常维护表示穿越防火墙的每个有效TCP连接的状态。已知这些类型的防火墙通过删除空闲连接在防火墙中的状态以节省内存而使其过期。在这种情况下,TCP keepalives通常用于防止防火墙超时,否则会导致空闲连接超时。使用SMC-R时,两个端点必须位于同一第2层网络(即同一子网)中。因此,无法在两个SMC-R端点之间的路径中注入防火墙。然而,其他中间代理,例如TCP/IP层负载平衡器,可以注入两个SMC-R端点的路径中。这些类型的负载平衡器还保持连接状态,以便将TCP连接流量转发到适当的群集端点。当使用SMC-R时,这些TCP连接将显示为完全空闲,从而使它们容易受到负载平衡代理的潜在超时的影响。因此,对于这种情况,TCP keepalives可能仍然是相关的。

The following are the TCP-level keepalive processing requirements for SMC-R-enabled hosts:

以下是启用SMC-R的主机的TCP级别保留处理要求:

o SMC-R peers should allow TCP keepalives to flow on the TCP path of SMC-R connections based on existing TCP keepalive configuration and programming options. However, it is strongly recommended that platforms provide the ability to specify very granular keepalive timers (for example, single-digit-second timers) and should consider providing a configuration option that limits the minimum keepalive timer that will be used for TCP-layer keepalives on SMC-R connections. This is important to minimize the amount of TCP keepalive packets transmitted in the network for SMC-R connections.

o SMC-R对等方应基于现有的TCP keepalive配置和编程选项,允许TCP keepalive在SMC-R连接的TCP路径上流动。然而,强烈建议平台提供能够指定非常颗粒状的KeaPoT定时器(例如,单位数第二定时器)的能力,并且应该考虑提供一种配置选项,该限制选项限制了用于SMC-R连接的TCP层密钥的最小保持定时器。这对于最小化SMC-R连接网络中传输的TCP保留数据包的数量非常重要。

o SMC-R peers must always respond to inbound TCP-layer keepalives (by sending ACKs for these packets) even if the connection is using SMC-R. Typically, once a TCP connection has completed the SMC-R Rendezvous processing and is using SMC-R for data flows, no new inbound TCP segments are expected on that TCP connection, other than TCP termination segments (FIN, RST, etc.). TCP keepalives are the one exception that must be supported. Also, since TCP keepalive probes do not carry any application-layer data, this has no adverse impact on the application's inbound data stream.

o SMC-R对等方必须始终响应入站TCP层保留(通过发送这些数据包的ACK),即使连接使用SMC-R。通常,一旦TCP连接完成SMC-R集合处理并使用SMC-R进行数据流,则该TCP连接上不需要新的入站TCP段,除了TCP终止段(FIN、RST等)。TCP keepalives是必须支持的一个例外。此外,由于TCP keepalive探测不携带任何应用层数据,因此这对应用程序的入站数据流没有不利影响。

4.6. TCP Connection Failover between SMC-R Links
4.6. SMC-R链路之间的TCP连接故障切换

A peer may change which SMC-R link within a link group it sends its writes over in the event of a link failure. Since each peer independently chooses which link to send writes over for a specific TCP connection, this process is done independently by each peer.

当发生链路故障时,对等方可以更改其在链路组中发送写操作的SMC-R链路。由于每个对等方独立选择要为特定TCP连接发送写操作的链接,因此该过程由每个对等方独立完成。

4.6.1. Validating Data Integrity
4.6.1. 验证数据完整性

Even though RoCE is a reliable transport, there is a small subset of failure modes that could cause unrecoverable loss of data. When an RNIC acknowledges receipt of an RDMA write to its peer, that creates a write completion event to the sending peer, which allows the sender to release any buffers it is holding for that write. In normal operation and in most failures, this operation is reliable.

尽管RoCE是一种可靠的传输方式,但仍有一小部分故障模式可能导致无法恢复的数据丢失。当RNIC确认收到对其对等方的RDMA写入时,将向发送对等方创建写入完成事件,从而允许发送方释放其为该写入保留的任何缓冲区。在正常操作和大多数故障中,此操作是可靠的。

However, there are failure modes possible in which a receiving RNIC has acknowledged an RDMA write but then was not able to place the received data into its host memory -- for example, a sudden, disorderly failure of the interface between the RNIC and the host. While rare, these types of events must be guarded against to ensure data integrity. The process for switching SMC-R links during failover, as described in this section, guards against this possibility and is mandatory.

然而,也有可能出现以下故障模式:接收RNIC已确认RDMA写入,但随后无法将接收到的数据放入其主机内存——例如,RNIC与主机之间的接口突然出现无序故障。虽然很少发生,但必须防范这些类型的事件,以确保数据完整性。故障切换期间切换SMC-R链路的过程(如本节所述)可防止这种可能性,并且是强制性的。

Each peer must track the current state of the CDC sequence numbers for a TCP connection. The sender must keep track of the sequence number of the CDC message that described the last write acknowledged by the peer RNIC, or Sequence Sent (SS). In other words, SS describes the last write that the sender believes its peer has successfully received. The receiver must keep track of the sequence number of the CDC message that described the last write that it has successfully received (i.e., the data has been successfully placed into an RMBE), or Sequence Received (SR).

每个对等方必须跟踪TCP连接的CDC序列号的当前状态。发送方必须跟踪描述对等RNIC确认的最后一次写入的CDC消息的序列号,或序列发送(SS)。换句话说,SS描述了发送方认为其对等方已成功接收的最后一次写入。接收方必须跟踪描述其已成功接收的最后一次写入(即,数据已成功放入RMBE)或已接收序列(SR)的CDC消息的序列号。

When an RNIC fails and the sender changes SMC-R links, the sender must first send a CDC message with the F-bit (failover validation indicator; see Appendix A.4) set over the new SMC-R link. This is the failover data validation message. The sequence number in this CDC message is equal to SS. The CDC message key, the length, and the SMC-R alert token are the only other fields in this CDC message that are significant. No reply is expected from this validation message, and once the sender has sent it, the sender may resume sending on the new SMC-R link as described in Section 4.6.2.

当RNIC发生故障且发送方更改SMC-R链路时,发送方必须首先通过新SMC-R链路发送设置了F位(故障转移验证指示器;请参见附录a.4)的CDC消息。这是故障转移数据验证消息。此CDC消息中的序列号等于SS。CDC消息密钥、长度和SMC-R警报令牌是此CDC消息中唯一重要的其他字段。此验证消息不需要回复,一旦发送方发送了该消息,发送方可以按照第4.6.2节所述,在新的SMC-R链路上恢复发送。

Upon receipt of the failover validation message, the receiver must verify that its SR value for the TCP connection is equal to or greater than the sequence number in the failover validation message. If so, no further action is required, and the TCP connection resumes on the new SMC-R link. If SR is less than the sequence number value in the validation message, data has been lost, and the receiver must immediately reset the TCP connection.

在收到故障转移验证消息后,接收方必须验证其TCP连接的SR值是否等于或大于故障转移验证消息中的序列号。如果是这样,则无需进一步操作,TCP连接将在新的SMC-R链路上恢复。如果SR小于验证消息中的序号值,则表示数据已丢失,接收器必须立即重置TCP连接。

4.6.2. Resuming the TCP Connection on a New SMC-R Link
4.6.2. 在新SMC-R链路上恢复TCP连接

When a connection is moved to a new SMC-R link and the failover validation message has been sent, the sender can immediately resume normal transmission. In order to preserve the application message stream, the sender must replay any RDMA writes (and their associated CDC messages) that were in progress or failed when the previous SMC-R link failed, before sending new data on the new SMC-R link. The sender has two options for accomplishing this:

当连接移动到新的SMC-R链路且故障转移验证消息已发送时,发送方可以立即恢复正常传输。为了保留应用程序消息流,发送方必须在新SMC-R链路上发送新数据之前,重放在上一个SMC-R链路失败时正在进行或失败的任何RDMA写入(及其关联的CDC消息)。发送方有两个选项来完成此操作:

o Preserve the sequence numbers "as is": Retry all failed and pending operations as they were originally done, including reposting all associated RDMA write operations and their associated CDC messages without making any changes. Then resume sending new data using new sequence numbers.

o 保留序列号“原样”:按原样重试所有失败和挂起的操作,包括重新发布所有关联的RDMA写入操作及其关联的CDC消息,而不做任何更改。然后使用新序列号继续发送新数据。

o Combine pending messages and possibly add new data: Combine failed and pending messages into a single new write with a new sequence number. This allows the sender to combine pending messages into fewer operations. As a further optimization, this write can also include new data, as long as all failed and pending data are also included. If this approach is taken, the sequence number must be increased beyond the last failed or pending sequence number.

o 合并挂起的消息并可能添加新数据:将失败的消息和挂起的消息合并到具有新序列号的单个新写入中。这允许发送方将挂起的消息合并到较少的操作中。作为进一步的优化,此写入还可以包括新数据,只要还包括所有失败和挂起的数据。如果采用这种方法,则序列号必须增加到最后一个失败或挂起的序列号之外。

4.7. RMB Data Flows
4.7. 人民币数据流

The following sections describe the RDMA wire flows for the SMC-R protocol after a TCP connection has switched into SMC-R mode (i.e., SMC-R Rendezvous processing is complete and a pair of RMB elements has been assigned and communicated by the SMC-R peers). The ladder diagrams below include the following:

以下章节描述了TCP连接切换到SMC-R模式(即,SMC-R会合处理完成,SMC-R对等方分配和通信了一对RMB元素)后,SMC-R协议的RDMA线流。下面的梯形图包括以下内容:

o RMBE control information kept by each peer. Only a subset of the information is depicted, specifically only the fields that reflect the stream of data written by Host A and read by Host B.

o 每个对等方保存的RMBE控制信息。仅描述了信息的一个子集,特别是仅描述了反映由主机a写入和由主机B读取的数据流的字段。

o Time line 0-x, which shows the wire flows in a time-relative fashion.

o 时间线0-x,以时间相对方式显示导线流。

o Note that RMBE control information is only shown in a time interval if its value changed (otherwise, assume that the value is unchanged from the previously depicted value).

o 注意,如果RMBE控制信息的值发生变化,则仅在时间间隔内显示该信息(否则,假设该值与之前描述的值相同)。

o The local copy of the producer cursors and consumer cursors that is maintained by each host is not depicted in these figures. Note that the cursor values in the diagram reflect the necessity of skipping over the eye catcher in the RMBE data area. They start and wrap at 4, not 0.

o 这些图中未描述由每个主机维护的生产者游标和消费者游标的本地副本。请注意,图表中的光标值反映了跳过RMBE数据区域中引人注目的内容的必要性。它们在4点开始和结束,而不是0点。

4.7.1. Scenario 1: Send Flow, Window Size Unconstrained
4.7.1. 场景1:发送流,窗口大小不受限制
            SMC Host A                             SMC Host B
           RMBE A Info                            RMBE B Info
       (Consumer Cursors)                      (Producer Cursors)
   Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
   4        0         0                  0    4        0          0
   0        0         1 ---------------> 1    0        0          0
                        RDMA-WR Data
                          (4:1003)
   4        0         2 ...............> 2    1004     0          0
                        CDC Message
        
            SMC Host A                             SMC Host B
           RMBE A Info                            RMBE B Info
       (Consumer Cursors)                      (Producer Cursors)
   Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
   4        0         0                  0    4        0          0
   0        0         1 ---------------> 1    0        0          0
                        RDMA-WR Data
                          (4:1003)
   4        0         2 ...............> 2    1004     0          0
                        CDC Message
        

Figure 16: Scenario 1: Send Flow, Window Size Unconstrained

图16:场景1:发送流,窗口大小不受限制

Scenario assumptions:

情景假设:

o Kernel implementation.

o 内核实现。

o New SMC-R connection; no data has been sent on the connection.

o 新型SMC-R接头;尚未在连接上发送任何数据。

o Host A: Application issues send for 1000 bytes to Host B.

o 主机A:应用程序问题发送1000字节到主机B。

o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes.

o 主机B:RMBE接收缓冲区大小为10000;应用程序已发出10000字节的recv。

Flow description:

流程描述:

1. The application issues a send() for 1000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-1003 (to skip the 4-byte eye catcher in the RMBE data area). Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.

1. 应用程序发出1000字节的send();SMC-R层将数据复制到内核发送缓冲区。然后,它安排RDMA写入操作,将数据移动到对等方的RMBE接收缓冲区的相对位置4-1003(跳过RMBE数据区域中的4字节眼球捕捉器)。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。

2. Host A sends a CDC message to update the producer cursor to byte 1004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application. Host B, once notified of the completion of the previous RDMA operation, locates the RMBE associated with the RMBE alert token that was included in the message and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. It will use the producer cursor as an indicator of how much data is available to be delivered to the local application. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). Note that a message to the peer updating the consumer cursor is not needed at this time, as the window size is unconstrained (> 1/2 of the receive buffer size). The window size is calculated by taking the difference between the producer cursor and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).

2. 主机A发送CDC消息以将生产者游标更新为字节1004。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。主机B在收到上一个RDMA操作完成的通知后,定位与消息中包含的RMBE警报令牌相关联的RMBE,并继续执行正常的接收端处理,唤醒挂起的应用程序读取线程,将数据复制到应用程序的接收缓冲区,等等。它将使用生产者光标作为一个指示器,指示有多少数据可交付给本地应用程序。此处理完成后,SMC-R层还将更新其本地使用者光标以匹配生产者光标(即,指示所有数据已被消耗)。请注意,此时不需要向对等方发送更新使用者光标的消息,因为窗口大小不受限制(>接收缓冲区大小的1/2)。窗口大小通过以RMBEs(10000-1004=8996)表示的生产者光标和消费者光标之间的差值来计算。

4.7.2. Scenario 2: Send/Receive Flow, Window Size Unconstrained
4.7.2. 场景2:发送/接收流,窗口大小不受限制
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    0        0         1 ---------------> 1    0        0          0
                         RDMA-WR Data
                           (4:1003)
    4        0         2 ...............> 2    1004     0          0
                         CDC Message
        
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    0        0         1 ---------------> 1    0        0          0
                         RDMA-WR Data
                           (4:1003)
    4        0         2 ...............> 2    1004     0          0
                         CDC Message
        
    0        0         3 <--------------  3    1004     0          0
                         RDMA-WR Data
                           (4:503)
    1004     0         4 <..............  4    1004     0          0
                          CDC Message
        
    0        0         3 <--------------  3    1004     0          0
                         RDMA-WR Data
                           (4:503)
    1004     0         4 <..............  4    1004     0          0
                          CDC Message
        
    Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained
        
    Figure 17: Scenario 2: Send/Receive Flow, Window Size Unconstrained
        

Scenario assumptions:

情景假设:

o New SMC-R connection; no data has been sent on the connection.

o 新型SMC-R接头;尚未在连接上发送任何数据。

o Host A: Application issues send for 1000 bytes to Host B.

o 主机A:应用程序问题发送1000字节到主机B。

o Host B: RMBE receive buffer size is 10,000; application has already issued a recv for 10,000 bytes. Once the receive is completed, the application sends a 500-byte response to Host A.

o 主机B:RMBE接收缓冲区大小为10000;应用程序已发出10000字节的recv。接收完成后,应用程序向主机a发送500字节的响应。

Flow description:

流程描述:

1. The application issues a send() for 1000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-1003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.

1. 应用程序发出1000字节的send();SMC-R层将数据复制到内核发送缓冲区。然后,它安排一个RDMA写入操作,将数据移动到对等方的RMBE接收缓冲区的相对位置4-1003。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。

2. Host A sends a CDC message to update the producer cursor to byte 1004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.

2. 主机A发送CDC消息以将生产者游标更新为字节1004。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。

3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). Note that an update of the consumer cursor to the peer is not needed at this time, as the window size is unconstrained (> 1/2 of the receive buffer size). The application then performs a send() for 500 bytes to Host A. The SMC-R layer will copy the data into a kernel buffer and then schedule an RDMA write into the partner's RMBE receive buffer. Note that this RDMA write operation includes no immediate data or notification to Host A.

3. 主机B在收到之前的CDC消息的通知后,定位与RMBE警报令牌关联的RMBE,并继续执行正常的接收端处理、唤醒挂起的应用程序读取线程、将数据复制到应用程序的接收缓冲区等。此处理完成后,SMC-R层还将更新其本地使用者光标,以匹配生产者光标(即,指示所有数据已被使用)。请注意,此时不需要将使用者游标更新到对等方,因为窗口大小不受限制(>接收缓冲区大小的1/2)。然后,应用程序向主机a执行500字节的send()。SMC-R层将数据复制到内核缓冲区,然后将RDMA写入伙伴的RMBE接收缓冲区。请注意,此RDMA写入操作不包括立即发送给主机A的数据或通知。

4. Host B sends a CDC message to update the partner's RMBE control information with the latest producer cursor (set to 503 and not shown in the diagram above) and to also inform the peer that the consumer cursor value is now 1004. It also updates the local current consumer cursor and the last sent consumer cursor to 1004. This CDC message includes notification, since we are updating our producer cursor; this requires attention by the peer host.

4. 主机B发送CDC消息,用最新的生产者光标(设置为503,上图中未显示)更新合作伙伴的RMBE控制信息,并通知对等方消费者光标值现在为1004。它还将本地当前使用者光标和上次发送的使用者光标更新为1004。此CDC消息包含通知,因为我们正在更新生产者游标;这需要对等主机的注意。

4.7.3. Scenario 3: Send Flow, Window Size Constrained
4.7.3. 场景3:发送流,窗口大小受限
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    4        0         1 ---------------> 1    4        0          0
                         RDMA-WR Data
                           (4:3003)
    4        0         2 ...............> 2    3004     0          0
                         CDC Message
    4        0         3                  3    3004     0          0
    4        0         4 ---------------> 4    3004     0          0
                         RDMA-WR Data
                           (3004:7003)
    4        0         5 ................> 5   7004     0          0
                         CDC Message
    7004     0         6 <................ 6   7004     0          0
                         CDC Message
        
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    4        0         0                  0    4        0          0
    4        0         1 ---------------> 1    4        0          0
                         RDMA-WR Data
                           (4:3003)
    4        0         2 ...............> 2    3004     0          0
                         CDC Message
    4        0         3                  3    3004     0          0
    4        0         4 ---------------> 4    3004     0          0
                         RDMA-WR Data
                           (3004:7003)
    4        0         5 ................> 5   7004     0          0
                         CDC Message
    7004     0         6 <................ 6   7004     0          0
                         CDC Message
        

Figure 18: Scenario 3: Send Flow, Window Size Constrained

图18:场景3:发送流,窗口大小受限

Scenario assumptions:

情景假设:

o New SMC-R connection; no data has been sent on this connection.

o 新型SMC-R接头;尚未在此连接上发送任何数据。

o Host A: Application issues send for 3000 bytes to Host B and then another send for 4000 bytes.

o 主机A:应用程序发送3000字节到主机B,然后再发送4000字节。

o Host B: RMBE receive buffer size is 10,000. Application has already issued a recv for 10,000 bytes.

o 主机B:RMBE接收缓冲区大小为10000。应用程序已发出10000字节的recv。

Flow description:

流程描述:

1. The application issues a send() for 3000 bytes; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 4-3003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.

1. 应用程序发出3000字节的send();SMC-R层将数据复制到内核发送缓冲区。然后,它安排一个RDMA写入操作,将数据移动到对等方的RMBE接收缓冲区的相对位置4-3003。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。

2. Host A sends a CDC message to update its producer cursor to byte 3003. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.

2. 主机A发送CDC消息,将其生产者游标更新为字节3003。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。

3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). It will not, however, update the partner with this information, as the window size is not constrained (10,000 - 3000 = 7000 bytes of available space). The application on Host B also issues a new recv() for 10,000 bytes.

3. 主机B在收到之前的CDC消息的通知后,定位与RMBE警报令牌关联的RMBE,并继续执行正常的接收端处理、唤醒挂起的应用程序读取线程、将数据复制到应用程序的接收缓冲区等。此处理完成后,SMC-R层还将更新其本地使用者光标,以匹配生产者光标(即,指示所有数据已被使用)。但是,它不会用此信息更新合作伙伴,因为窗口大小不受限制(10000-3000=7000字节的可用空间)。主机B上的应用程序也会发出一个新的recv(),长度为10000字节。

4. On Host A, the application issues a send() for 4000 bytes. The SMC-R layer copies the data into a kernel buffer and schedules an async RDMA write into the peer's RMBE receive buffer at relative position 3003-7004. Note that no alert is provided to Host B for this flow.

4. 在主机A上,应用程序发出4000字节的send()。SMC-R层将数据复制到内核缓冲区,并在相对位置3003-7004将异步RDMA写入对等方的RMBE接收缓冲区。请注意,没有为此流向主机B提供警报。

5. Host A sends a CDC message to update the producer cursor to byte 7004. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.

5. 主机A发送CDC消息,将生产者游标更新为字节7004。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。

6. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. After this processing is complete, the SMC-R layer will also update its local consumer cursor to match the producer cursor (i.e., indicating that all data has been consumed). It will then determine whether or not it needs to update the consumer cursor to the peer. The available window size is now 3000 (10,000 - (producer cursor - last sent consumer cursor)), which is < 1/2 of the receive buffer size (10,000/2 = 5000), and the advance of the window size is > 10% of the window size (1000). Therefore, a CDC message is issued to update the consumer cursor to Peer A.

6. 主机B在收到之前的CDC消息的通知后,定位与RMBE警报令牌关联的RMBE,并继续执行正常的接收端处理、唤醒挂起的应用程序读取线程、将数据复制到应用程序的接收缓冲区等。此处理完成后,SMC-R层还将更新其本地使用者光标,以匹配生产者光标(即,指示所有数据已被使用)。然后,它将确定是否需要将使用者游标更新到对等方。现在可用的窗口大小为3000(10000-(生产者游标-上次发送的消费者游标)),小于接收缓冲区大小的1/2(10000/2=5000),窗口大小的提前量大于窗口大小(1000)的10%。因此,会发出一条CDC消息,将消费者光标更新到对等点a。

4.7.4. Scenario 4: Large Send, Flow Control, Full Window Size Writes
4.7.4. 场景4:大发送、流控制、全窗口大小写入
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    1004     1         0                  0    1004     1          0
    1004     1         1 ---------------> 1    1004     1          0
                         RDMA-WR Data
                           (1004:9999)
    1004     1         2 ---------------> 2    1004     1          0
                         RDMA-WR Data
                           (4:1003)
    1004     1         3 ...............> 3    1004     2          Wrt
                         CDC Message                               Blk
        
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flags
    1004     1         0                  0    1004     1          0
    1004     1         1 ---------------> 1    1004     1          0
                         RDMA-WR Data
                           (1004:9999)
    1004     1         2 ---------------> 2    1004     1          0
                         RDMA-WR Data
                           (4:1003)
    1004     1         3 ...............> 3    1004     2          Wrt
                         CDC Message                               Blk
        
    1004     2         4 <............... 4    1004     2          Wrt
                         CDC Message                               Blk
        
    1004     2         4 <............... 4    1004     2          Wrt
                         CDC Message                               Blk
        
    1004     2         5 ---------------> 5    1004     2          Wrt
                         RDMA-WR Data                              Blk
                           (1004:9999)
    1004     2         6 ---------------> 6    1004     2          Wrt
                         RDMA-WR Data                              Blk
                          (4:1003)
    1004     2         7 ...............> 7    1004     3          Wrt
                         CDC Message                               Blk
        
    1004     2         5 ---------------> 5    1004     2          Wrt
                         RDMA-WR Data                              Blk
                           (1004:9999)
    1004     2         6 ---------------> 6    1004     2          Wrt
                         RDMA-WR Data                              Blk
                          (4:1003)
    1004     2         7 ...............> 7    1004     3          Wrt
                         CDC Message                               Blk
        
    1004     3         8 <............... 8    1004     3          Wrt
                         CDC Message                               Blk
        
    1004     3         8 <............... 8    1004     3          Wrt
                         CDC Message                               Blk
        

Figure 19: Scenario 4: Large Send, Flow Control, Full Window Size Writes

图19:场景4:大发送、流控制、全窗口大小写入

Scenario assumptions:

情景假设:

o Kernel implementation.

o 内核实现。

o Existing SMC-R connection, Host B's receive window size is fully open (peer consumer cursor = peer producer cursor).

o 现有SMC-R连接,主机B的接收窗口大小完全打开(对等使用者游标=对等生产者游标)。

o Host A: Application issues send for 20,000 bytes to Host B.

o 主机A:应用程序问题发送20000字节到主机B。

o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes.

o 主机B:RMBE接收缓冲区大小为10000;应用程序已发出10000字节的recv。

Flow description:

流程描述:

1. The application issues a send() for 20,000 bytes; the SMC-R layer copies data into a kernel send buffer (assumes that send buffer space of 20,000 is available for this connection). It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 1004-9999. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.

1. 应用程序发出一个20000字节的send();SMC-R层将数据复制到内核发送缓冲区(假设此连接有20000个发送缓冲区空间)。然后,它安排一个RDMA写入操作,将数据移动到对等方的RMBE接收缓冲区的相对位置1004-9999。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。

2. Host A then schedules an RDMA write operation to fill the remaining 1000 bytes of available space in the peer's RMBE receive buffer, at relative position 4-1003. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation. Also note that an implementation of SMC-R may optimize this processing by combining steps 1 and 2 into a single RDMA write operation (with two different data sources).

2. 然后,主机A安排RDMA写入操作,以填充对等方RMBE接收缓冲区中相对位置4-1003处剩余的1000字节可用空间。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。还请注意,SMC-R的实现可以通过将步骤1和2组合成单个RDMA写入操作(使用两个不同的数据源)来优化该处理。

3. Host A sends a CDC message to update the producer cursor to byte 1004. Since the entire receive buffer space is filled, the producer writer blocked flag (the "Wrt Blk" indicator (flag) in Figure 19) is set and the producer cursor wrap sequence number (the producer "Wrap Seq#" in Figure 19) is incremented. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.

3. 主机A发送CDC消息以将生产者游标更新为字节1004。由于整个接收缓冲区空间已填满,因此设置了生产者写入程序阻塞标志(图19中的“Wrt Blk”指示符(标志)),并且生产者光标换行序列号(图19中的生产者“换行序列号”)递增。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。

4. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token and proceeds to perform normal receive-side processing, waking up the suspended application read thread, copying the data into the application's receive buffer, etc. In this scenario, Host B notices that the producer cursor has not been advanced (same value as the consumer cursor); however, it notices that the producer cursor wrap sequence number is different from its local value (1), indicating that a full window of new data is available. All of the data in the receive buffer can be processed, with the first segment

4. 主机B在收到之前CDC消息的通知后,将定位与RMBE警报令牌关联的RMBE,并继续执行正常的接收端处理、唤醒挂起的应用程序读取线程、将数据复制到应用程序的接收缓冲区等。在这种情况下,主机B注意到生产者光标未被提前(与消费者光标的值相同);但是,它注意到生产者游标包裹序列号与其本地值(1)不同,这表示新数据的完整窗口可用。接收缓冲区中的所有数据都可以用第一段进行处理

(1004-9999) followed by the second segment (4-1003). Because the producer writer blocked indicator was set, Host B schedules a CDC message to update its latest information to the peer: consumer cursor (1004), consumer cursor wrap sequence number (the current value of 2 is used).

(1004-9999)后接第二段(4-1003)。由于设置了producer writer blocked指示符,主机B安排CDC消息将其最新信息更新到对等方:消费者光标(1004),消费者光标包裹序列号(使用当前值2)。

5. Host A, upon receipt of the CDC message, locates the TCP connection associated with the alert token and, upon examining the control information provided, notices that Host B has consumed all of the data (based on the consumer cursor and the consumer cursor wrap sequence number) and initiates the next RDMA write to fill the receive buffer at offset 1003-9999.

5. 主机A在收到CDC消息后,定位与警报令牌关联的TCP连接,并在检查提供的控制信息后,注意到主机B已使用所有数据(基于使用者游标和使用者游标包装序列号)并启动下一次RDMA写入,以填充偏移量1003-9999处的接收缓冲区。

6. Host A then moves the next 1000 bytes into the beginning of the receive buffer (4-1003) by scheduling an RDMA write operation. Note that at this point there are still 8 bytes remaining to be written.

6. 然后,主机A通过调度RDMA写入操作将下一个1000字节移动到接收缓冲区(4-1003)的开头。请注意,此时仍有8个字节需要写入。

7. Host A then sends a CDC message to set the producer writer blocked indicator and to increment the producer cursor wrap sequence number (3).

7. 然后,主机A发送一条CDC消息,以设置生产者写入程序阻塞指示符,并增加生产者光标换行序列号(3)。

8. Host B, upon notification, completes the same processing as step 4 above, including sending a CDC message to update the peer to indicate that all data has been consumed. At this point, Host A can write the final 8 bytes to Host B's RMBE into positions 1004-1011 (not shown).

8. 主机B在收到通知后,完成与上述步骤4相同的处理,包括发送CDC消息以更新对等机,以指示所有数据已被消耗。此时,主机A可以将最后8个字节写入主机B的RMBE的位置1004-1011(未显示)。

4.7.5. Scenario 5: Send Flow, Urgent Data, Window Size Unconstrained
4.7.5. 场景5:发送流、紧急数据、窗口大小不受限制
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
    1000     1         0                  0    1000     1          0
    1000     1         1 ---------------> 1    1000     1          0
                         RDMA-WR Data
                           (1000:1499)
    1000     1         2 ...............> 2    1500     1          UrgP
                         CDC Message                               UrgA
        
             SMC Host A                             SMC Host B
            RMBE A Info                            RMBE B Info
        (Consumer Cursors)                      (Producer Cursors)
    Cursor   Wrap Seq# Time               Time Cursor   Wrap Seq#  Flag
    1000     1         0                  0    1000     1          0
    1000     1         1 ---------------> 1    1000     1          0
                         RDMA-WR Data
                           (1000:1499)
    1000     1         2 ...............> 2    1500     1          UrgP
                         CDC Message                               UrgA
        
    1500     1         3 <............... 3    1500     1          UrgP
                         CDC Message                               UrgA
        
    1500     1         3 <............... 3    1500     1          UrgP
                         CDC Message                               UrgA
        
    1500     1         4 ---------------> 4    1500     1          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1500     1         5 ...............> 5    2500     1          0
                         CDC Message
        
    1500     1         4 ---------------> 4    1500     1          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1500     1         5 ...............> 5    2500     1          0
                         CDC Message
        

Figure 20: Scenario 5: Send Flow, Urgent Data, Window Size Open

图20:场景5:发送流、紧急数据、窗口大小打开

Scenario assumptions:

情景假设:

o Kernel implementation.

o 内核实现。

o Existing SMC-R connection; window size open (unconstrained); all data has been consumed by receiver.

o 现有SMC-R连接;窗口大小打开(无约束);所有数据已被接收器消耗。

o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data.

o 主机A:应用程序向主机B发送500字节的紧急数据指示器(带外),然后发送1000字节的正常数据。

o Host B: RMBE receive buffer size is 10,000; application has issued a recv for 10,000 bytes and is also monitoring the socket for urgent data.

o 主机B:RMBE接收缓冲区大小为10000;应用程序已发出10000字节的recv,并且正在监视套接字以获取紧急数据。

Flow description:

流程描述:

1. The application issues a send() for 500 bytes of urgent data; the SMC-R layer copies data into a kernel send buffer. It then schedules an RDMA write operation to move the data into the peer's RMBE receive buffer, at relative position 1000-1499. Note that no immediate data or alert (i.e., interrupt) is provided to Host B for this RDMA operation.

1. 应用程序为500字节的紧急数据发出send();SMC-R层将数据复制到内核发送缓冲区。然后,它安排一个RDMA写入操作,将数据移动到对等方的RMBE接收缓冲区的相对位置1000-1499。请注意,对于此RDMA操作,没有向主机B提供即时数据或警报(即中断)。

2. Host A sends a CDC message to update its producer cursor to byte 1500 and to turn on the producer Urgent Data Pending (UrgP) and Urgent Data Present (UrgA) flags. This CDC message will deliver an interrupt to Host B. At this point, the SMC-R layer can return control back to the application.

2. 主机A发送CDC消息,将其生产者游标更新为字节1500,并打开生产者紧急数据挂起(UrgP)和紧急数据存在(UrgA)标志。此CDC消息将向主机B发送中断。此时,SMC-R层可以将控制权返回给应用程序。

3. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token, notices that the Urgent Data Pending flag is on, and proceeds with out-of-band socket API notification -- for example, satisfying any outstanding select() or poll() requests on the socket by indicating that urgent data is pending (i.e., by setting the exception bit on). The urgent data present indicator allows Host B to also determine the position of the urgent data (the producer cursor points 1 byte beyond the last byte of urgent data). Host B can then perform normal receive-side processing (including specific urgent data processing), copying the data into the application's receive buffer, etc. Host B then sends a CDC message to update the partner's RMBE control area with its latest consumer cursor (1500). Note that this CDC message must occur, regardless of the current local window size that is available. The partner host (Host A) cannot initiate any additional RDMA writes until it receives acknowledgment that the urgent data has been processed (or at least processed/remembered at the SMC-R layer).

3. 主机B在收到之前的CDC消息的通知后,会定位与RMBE警报令牌关联的RMBE,注意到紧急数据挂起标志处于打开状态,并继续执行带外套接字API通知——例如,满足任何未完成的select()或poll()通过指示紧急数据处于挂起状态(即,通过将异常位设置为on)来请求套接字。紧急数据显示指示器允许主机B也确定紧急数据的位置(生产者光标指向紧急数据最后一个字节之外的1个字节)。然后,主机B可以执行正常的接收端处理(包括特定的紧急数据处理)、将数据复制到应用程序的接收缓冲区等。然后,主机B发送CDC消息,用其最新的消费者光标(1500)更新合作伙伴的RMBE控制区。请注意,无论当前可用的本地窗口大小如何,此CDC消息都必须出现。伙伴主机(主机A)在收到紧急数据已处理(或至少在SMC-R层已处理/记忆)的确认之前,无法启动任何额外的RDMA写入。

4. Upon receipt of the message, Host A wakes up, sees that the peer consumed all data up to and including the last byte of urgent data, and now resumes sending any pending data. In this case, the application had previously issued a send for 1000 bytes of normal data, which would have been copied in the send buffer, and control would have been returned to the application. Host A now initiates an RDMA write to move that data to the peer's receive buffer at position 1500-2499.

4. 收到消息后,主机A将唤醒,发现对等方已使用了所有数据(包括紧急数据的最后一个字节),现在恢复发送任何挂起的数据。在这种情况下,应用程序之前发出了1000字节正常数据的发送,这些数据将被复制到发送缓冲区中,控制权将被返回给应用程序。主机A现在启动RDMA写入,将该数据移动到1500-2499位置的对等方接收缓冲区。

5. Host A then sends a CDC message to update its producer cursor value (2500) and to turn off the Urgent Data Pending and Urgent Data Present flags. Host B wakes up, processes the new data (resumes application, copies data into the application receive buffer), and then proceeds to update the local current consumer cursor (2500). Given that the window size is unconstrained, there is no need for a consumer cursor update in the peer's RMBE.

5. 然后,主机A发送CDC消息以更新其生产者游标值(2500),并关闭紧急数据挂起和紧急数据显示标志。主机B唤醒,处理新数据(恢复应用程序,将数据复制到应用程序接收缓冲区),然后继续更新本地当前使用者游标(2500)。考虑到窗口大小不受限制,对等方的RMBE中不需要消费者光标更新。

4.7.6. Scenario 6: Send Flow, Urgent Data, Window Size Closed
4.7.6. 场景6:发送流、紧急数据、关闭窗口大小

SMC Host A SMC Host B RMBE A Info RMBE B Info (Consumer Cursors) (Producer Cursors) Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flag 1000 1 0 0 1000 2 Wrt Blk

SMC主机A SMC主机B RMBE A信息RMBE B信息(消费者游标)(生产者游标)游标换行顺序#时间游标换行顺序#标志1000 1 0 0 1000 2 Wrt Blk

    1000     1         1 ...............> 1    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP
        
    1000     1         1 ...............> 1    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP
        
    1000     2         2 <............... 2    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP
        
    1000     2         2 <............... 2    1000     2          Wrt
                         CDC Message                               Blk
                                                                   UrgP
        
    1000     2         3 ---------------> 3    1000     2          Wrt
                         RDMA-WR Data                              Blk
                           (1000:1499)                             UrgP
        
    1000     2         3 ---------------> 3    1000     2          Wrt
                         RDMA-WR Data                              Blk
                           (1000:1499)                             UrgP
        
    1000     2         4 ...............> 4    1500     2          UrgP
                         CDC Message                               UrgA
        
    1000     2         4 ...............> 4    1500     2          UrgP
                         CDC Message                               UrgA
        
    1500     2         5 <............... 5    1500     2          UrgP
                         CDC Message                               UrgA
        
    1500     2         5 <............... 5    1500     2          UrgP
                         CDC Message                               UrgA
        
    1500     2         6 ---------------> 6    1500     2          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1000     2         7 ...............> 7    2500     2          0
                         CDC Message
        
    1500     2         6 ---------------> 6    1500     2          UrgP
                         RDMA-WR Data                              UrgA
                           (1500:2499)
    1000     2         7 ...............> 7    2500     2          0
                         CDC Message
        

Figure 21: Scenario 6: Send Flow, Urgent Data, Window Size Closed

图21:场景6:发送流、紧急数据、关闭窗口大小

Scenario assumptions:

情景假设:

o Kernel implementation.

o 内核实现。

o Existing SMC-R connection; window size closed; writer is blocked.

o 现有SMC-R连接;窗口大小关闭;写入程序被阻止。

o Host A: Application issues send for 500 bytes with urgent data indicator (out of band) to Host B, then sends 1000 bytes of normal data.

o 主机A:应用程序向主机B发送500字节的紧急数据指示器(带外),然后发送1000字节的正常数据。

o Host B: RMBE receive buffer size is 10,000; application has no outstanding recv() (for normal data) and is monitoring the socket for urgent data.

o 主机B:RMBE接收缓冲区大小为10000;应用程序没有未完成的recv()(用于正常数据),正在监视套接字以获取紧急数据。

Flow description:

流程描述:

1. The application issues a send() for 500 bytes of urgent data; the SMC-R layer copies data into a kernel send buffer (if available). Since the writer is blocked (window size closed), it cannot send the data immediately. It then sends a CDC message to notify the peer of the Urgent Data Pending (UrgP) indicator (the writer blocked indicator remains on as well). This serves as a signal to Host B that urgent data is pending in the stream. Control is also returned to the application at this point.

1. 应用程序为500字节的紧急数据发出send();SMC-R层将数据复制到内核发送缓冲区(如果可用)。由于写入程序被阻止(窗口大小关闭),因此无法立即发送数据。然后,它发送一条CDC消息,通知对等方紧急数据挂起(UrgP)指示灯(writer blocked指示灯也保持亮起)。这用作向主机B发送紧急数据在流中挂起的信号。此时,控件也返回给应用程序。

2. Host B, once notified of the receipt of the previous CDC message, locates the RMBE associated with the RMBE alert token, notices that the Urgent Data Pending flag is on, and proceeds with out-of-band socket API notification -- for example, satisfying any outstanding select() or poll() requests on the socket by indicating that urgent data is pending (i.e., by setting the exception bit on). At this point, it is expected that the application will enter urgent data mode processing, expeditiously processing all normal data (by issuing recv API calls) so that it can get to the urgent data byte. Whether the application has this urgent mode processing or not, at some point, the application will consume some or all of the pending data in the receive buffer. When this occurs, Host B will also send a CDC message to update its consumer cursor and consumer cursor wrap sequence number to the peer. In the example above, a full window's worth of data was consumed.

2. 主机B在收到之前的CDC消息的通知后,会定位与RMBE警报令牌关联的RMBE,注意到紧急数据挂起标志处于打开状态,并继续执行带外套接字API通知——例如,满足任何未完成的select()或poll()通过指示紧急数据处于挂起状态(即,通过将异常位设置为on)来请求套接字。此时,预计应用程序将进入紧急数据处理模式,快速处理所有正常数据(通过发出recv API调用),以便能够访问紧急数据字节。无论应用程序是否具有此紧急模式处理,在某个时刻,应用程序都将使用接收缓冲区中的部分或全部挂起数据。发生这种情况时,主机B还将向对等机发送CDC消息,以更新其使用者游标和使用者游标包装序列号。在上面的示例中,消耗了整个窗口的数据量。

3. Host A, once awakened by the message, will notice that the window size is now open on this connection (based on the consumer cursor and the consumer cursor wrap sequence number, which now matches the producer cursor wrap sequence number) and resume sending of the urgent data segment by scheduling an RDMA write into relative position 1000-1499.

3. 主机A一旦被消息唤醒,将注意到此连接上的窗口大小现在打开(基于消费者光标和消费者光标包裹序列号,该序列号现在与生产者光标包裹序列号匹配),并通过将RDMA写入调度到相对位置1000-1499来恢复紧急数据段的发送。

4. Host A then sends a CDC message to advance its producer cursor (1500) and to also notify Host B of the Urgent Data Present (UrgA) indicator (and turn off the writer blocked indicator). This signals to Host B that the urgent data is now in the local receive buffer and that the producer cursor points to the last byte of urgent data.

4. 然后,主机A发送一条CDC消息,使其生产者游标(1500)前进,并将紧急数据存在(UrgA)指示器通知主机B(并关闭写入程序阻塞指示器)。这向主机B发出信号,表示紧急数据现在位于本地接收缓冲区中,并且生产者光标指向紧急数据的最后一个字节。

5. Host B wakes up, processes the urgent data, and, once the urgent data is consumed, sends a CDC message to update its consumer cursor (1500).

5. 主机B唤醒,处理紧急数据,一旦紧急数据被消耗,发送CDC消息以更新其消费者光标(1500)。

6. Host A wakes up, sees that Host B has consumed the sequence number associated with the urgent data, and then initiates the next RDMA write operation to move the 1000 bytes associated with the next send() of normal data into the peer's receive buffer at position 1500-2499. Note that the send API would have likely completed earlier in the process by copying the 1000 bytes into a send buffer and returning back to the application, even though we could not send any new data until the urgent data was processed and acknowledged by Host B.

6. 主机A唤醒,发现主机B已使用与紧急数据相关联的序列号,然后启动下一个RDMA写入操作,将与正常数据的下一个send()相关联的1000字节移动到位置1500-2499的对等方接收缓冲区中。请注意,发送API可能在过程的早期完成,将1000字节复制到发送缓冲区并返回到应用程序,即使在主机B处理并确认紧急数据之前,我们无法发送任何新数据。

7. Host A sends a CDC message to advance its producer cursor to 2500 and to reset the Urgent Data Pending and Urgent Data Present flags. Host B wakes up and processes the inbound data.

7. 主机A发送CDC消息,将其生产者光标提前至2500,并重置紧急数据挂起和紧急数据显示标志。主机B唤醒并处理入站数据。

4.8. Connection Termination
4.8. 连接终止

Just as SMC-R connections are established using a combination of TCP connection establishment flows and SMC-R protocol flows, the termination of SMC-R connections also uses a similar combination of SMC-R protocol termination flows and normal TCP connection termination flows. The following sections describe the SMC-R protocol normal and abnormal connection termination flows.

正如SMC-R连接使用TCP连接建立流和SMC-R协议流的组合建立一样,SMC-R连接的终止也使用SMC-R协议终止流和正常TCP连接终止流的类似组合。以下各节描述SMC-R协议正常和异常连接终止流。

4.8.1. Normal SMC-R Connection Termination Flows
4.8.1. 正常SMC-R连接终止流

Normal SMC-R connection flows are triggered via the normal stream socket API semantics, namely by the application issuing a close() or shutdown() API. Most applications, after consuming all incoming data and after sending any outbound data, will then issue a close() API to indicate that they are done both sending and receiving data. Some applications, typically a small percentage, make use of the shutdown() API that allows them to indicate that the application is done sending data, receiving data, or both sending and receiving data. The main use of this API is scenarios where a TCP application wants to alert its partner endpoint that it is done sending data but is still receiving data on its socket (shutdown for write). Issuing shutdown() for both sending and receiving data is really no different than issuing a close() and can therefore be treated in a similar fashion. Shutdown for read is typically not a very useful operation and in normal circumstances does not trigger any network flows to notify the partner TCP endpoint of this operation.

正常的SMC-R连接流是通过正常的流套接字API语义触发的,即通过应用程序发出close()或shutdown()API触发的。大多数应用程序在使用所有传入数据和发送任何出站数据后,都会发出close()API以指示它们已完成发送和接收数据。一些应用程序(通常为一小部分)使用shutdown()API,允许它们指示应用程序已完成发送数据、接收数据或发送和接收数据。此API的主要用途是,TCP应用程序希望向其合作伙伴端点发出警报,告知其已完成发送数据,但仍在其套接字上接收数据(为写入而关机)。为发送和接收数据发出shutdown()实际上与发出close()没有什么不同,因此可以以类似的方式处理。Shutdown for read通常不是一个非常有用的操作,在正常情况下不会触发任何网络流来通知伙伴TCP端点此操作。

These same trigger points will be used by the SMC-R layer to initiate SMC-R connection termination flows. The main design point for SMC-R normal connection flows is to use the SMC-R protocol to first shut down the SMC-R connection and free up any SMC-R RDMA resources, and then allow the normal TCP connection termination protocol (i.e., FIN processing) to drive cleanup of the TCP connection. This design

SMC-R层将使用这些相同的触发点来启动SMC-R连接终止流。SMC-R正常连接流的主要设计点是使用SMC-R协议首先关闭SMC-R连接并释放任何SMC-R RDMA资源,然后允许正常TCP连接终止协议(即FIN处理)驱动TCP连接的清理。这个设计

point is very important in ensuring that RDMA resources such as the RMBEs are only freed and reused when both SMC-R endpoints are completely done with their RDMA write operations to the partner's RMBE.

这一点对于确保RDMA资源(如RMBE)只有在两个SMC-R端点完全完成对合作伙伴的RMBE的RDMA写入操作时才被释放和重用非常重要。

                                      1
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
        3D  |               |                 |              |  4D
            |               +-----------------+              |
            |                       |                        |
            |                     2 |                        |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    |                |     |                 |     |                |
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close  | 3A | 4A |  Passive Close    |
            |                   V    |    V                   |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
        3C  |       |              | | |             |        |  4C
            |       +--------------+ | +-------------+        |
            |             |          |         |              |
            |             | 3B       |     4B  |              |
            |             V          |         V              |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
                    |              | | |             |
                    +--------------+ | +-------------+
                                     |
                                     |
        
                                      1
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
        3D  |               |                 |              |  4D
            |               +-----------------+              |
            |                       |                        |
            |                     2 |                        |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    |                |     |                 |     |                |
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close  | 3A | 4A |  Passive Close    |
            |                   V    |    V                   |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait1| | |AppCloseWait1|--->----|
        3C  |       |              | | |             |        |  4C
            |       +--------------+ | +-------------+        |
            |             |          |         |              |
            |             | 3B       |     4B  |              |
            |             V          |         V              |
            |       +--------------+ | +-------------+        |
            |--<----|PeerCloseWait2| | |AppCloseWait2|--->----|
                    |              | | |             |
                    +--------------+ | +-------------+
                                     |
                                     |
        

Figure 22: SMC-R Connection States

图22:SMC-R连接状态

Figure 22 describes the states that an SMC-R connection typically goes through. Note that there are variations to these states that can occur when an SMC-R connection is abnormally terminated, similar in a way to when a TCP connection is reset. The following are the high-level state transitions for an SMC-R connection:

图22描述了SMC-R连接通常经历的状态。请注意,当SMC-R连接异常终止时,这些状态可能会发生变化,类似于TCP连接重置时的情况。以下是SMC-R连接的高级状态转换:

1. An SMC-R connection begins in the Closed state. This state is meant to reflect an RMBE that is not currently in use (was previously in use but no longer is, or was never allocated).

1. SMC-R连接从关闭状态开始。此状态旨在反映当前未使用的RMBE(以前已使用但不再使用或从未分配)。

2. An SMC-R connection progresses to the Active state once the SMC-R Rendezvous processing has successfully completed, RMB element indices have been exchanged, and SMC-R links have been activated. In this state, the TCP connection is fully established, rendezvous processing has been completed, and SMC-R peers can begin the exchange of data via RDMA.

2. 一旦SMC-R交会处理成功完成,RMB元素索引交换,SMC-R链路激活,SMC-R连接将进入激活状态。在此状态下,TCP连接已完全建立,集合处理已完成,SMC-R对等方可以通过RDMA开始数据交换。

3. Active close processing (on the SMC-R peer that is initiating the connection termination).

3. 主动关闭处理(在启动连接终止的SMC-R对等机上)。

A. When an application on one of the SMC-R connection peers issues a close(), a shutdown() for write, or a shutdown() for both read and write, the SMC-R layer on that host will initiate SMC-R connection termination processing. First, if a close() or shutdown(both) is issued, it will check to see that there's no data in the local RMB element that has not been read by the application. If unread data is detected, the SMC-R connection must be abnormally reset; for more details on this, refer to Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows"). If no unread data is pending, it then checks to see whether or not any outstanding data is waiting to be written to the peer, or if any outstanding RDMA writes for this SMC-R connection have not yet completed. If either of these two scenarios is true, an indicator that this connection is in a pending close state is saved in internal data structures representing this SMC-R connection, and control is returned to the application. If all data to be written to the partner has completed, this peer will send a CDC message to notify the peer of either the PeerConnectionClosed indicator (close or shutdown for both was issued) or the PeerDoneWriting indicator. This will provide an interrupt to inform that partner SMC-R peer that the connection is terminating. At this point, the local side of the SMC-R connection transitions in the PeerCloseWait1 state, and control can be returned to the application. If this process could not be completed synchronously (the pending close condition mentioned above), it is completed when all RDMA writes for data and control cursors have been completed.

A.当一个SMC-R连接对等方上的应用程序发出close()、shutdown()进行写操作或shutdown()进行读写操作时,该主机上的SMC-R层将启动SMC-R连接终止处理。首先,如果发出close()或shutdown(两者),它将检查本地RMB元素中是否没有应用程序未读取的数据。如果检测到未读数据,则必须异常重置SMC-R连接;有关这方面的更多详细信息,请参阅第4.8.2节(“异常SMC-R连接终止流”)。如果没有未读数据挂起,则会检查是否有任何未读数据正在等待写入对等方,或者此SMC-R连接的任何未读RDMA写入尚未完成。如果这两种情况中的任何一种为真,则表示此连接处于挂起关闭状态的指示器将保存在表示此SMC-R连接的内部数据结构中,并将控制权返回给应用程序。如果要写入合作伙伴的所有数据都已完成,则该对等方将发送一条CDC消息,通知对等方PeerConnectionClosed指示器(两者均已发出close或shutdown)或PeerDoneWriting指示器。这将提供一个中断,通知合作伙伴SMC-R对等方连接正在终止。此时,SMC-R连接的本地端转换为PeerCloseWait1状态,控制权可以返回给应用程序。如果无法同步完成此过程(上述挂起关闭条件),则在完成数据和控制游标的所有RDMA写入后,此过程即告完成。

B. At some point, the SMC-R peer application (passive close) will consume all incoming data, realize that that partner is done sending data on this connection, and proceed to initiate its own close of the connection once it has completed sending all data from its end. The partner application can initiate this connection termination processing via close() or shutdown() APIs. If the application does so by issuing a shutdown() for write, then the partner SMC-R layer will send a CDC message to notify the peer (the active close side) of the PeerDoneWriting indicator. When the "active close" SMC-R peer wakes up as a

B.在某一点上,SMC-R对等应用程序(被动关闭)将消耗所有传入数据,意识到合作伙伴已完成此连接上的数据发送,并在完成从其端发送所有数据后继续启动其自己的连接关闭。合作伙伴应用程序可以通过close()或shutdown()API启动此连接终止处理。如果应用程序通过发出shutdown()进行写入,那么合作伙伴SMC-R层将发送一条CDC消息,通知对等方(活动关闭端)PeerDoneWriting指示符。当“主动关闭”SMC-R对等机作为

result of the previous CDC message, it will notice that the PeerDoneWriting indicator is now on and transition to the PeerCloseWait2 state. This state indicates that the peer is done sending data and may still be reading data. At this point, the "active close" peer will also need to ensure that any outstanding recv() calls for this socket are woken up and remember that no more data is forthcoming on this connection (in case the local connection was shutdown() for write only).

根据上一条CDC消息的结果,它将注意到PeerDoneWriting指示器现在已打开,并转换到PeerCloseWait2状态。此状态表示对等方已完成发送数据,并且可能仍在读取数据。此时,“活动关闭”对等方还需要确保唤醒此套接字的任何未完成的recv()调用,并记住此连接上没有更多数据(如果本地连接关闭()仅用于写)。

C. This flow is a common transition from 3A or 3B above. When the SMC-R peer (passive close) consumes all data and updates all necessary cursors to the peer, and the application closes its socket (close or shutdown for both), it will send a CDC message to the peer (the active close side) with the PeerConnectionClosed indicator set. At this point, the connection can transition back to the Closed state if the local application has already closed (or issued shutdown for both) the socket. Once in the Closed state, the RMBE can now be safely reused for a new SMC-R connection. When the PeerConnectionClosed indicator is turned on, the SMC-R peer is indicating that it is done updating the partner's RMBE.

C.此流程是3A或3B以上的常见过渡。当SMC-R对等机(被动关闭)使用所有数据并向对等机更新所有必要的游标时,应用程序关闭其套接字(关闭或关闭两者),它将向对等机(主动关闭端)发送一条CDC消息,并设置PeerConnectionClosed指示灯。此时,如果本地应用程序已经关闭(或同时发出关闭)套接字,则连接可以转换回关闭状态。一旦处于关闭状态,RMBE现在可以安全地重新用于新的SMC-R连接。当PeerConnectionClosed(对等连接关闭)指示灯打开时,SMC-R对等机表示已完成更新合作伙伴的RMBE。

D. Conditional state: If the local application has not yet issued a close() or shutdown(both), we need to wait until the application does so. Once it does, the local host will send a CDC message to notify the peer of the PeerConnectionClosed indicator and then transition to the Closed state.

D.条件状态:如果本地应用程序尚未发出close()或shutdown(两者都发出),我们需要等待应用程序发出close()或shutdown。一旦这样做,本地主机将发送一条CDC消息,通知对等方PeerConnectionClosed指示符,然后转换到Closed状态。

4. Passive close processing (on the SMC-R peer that receives an indication that the partner is closing the connection).

4. 被动关闭处理(在接收到伙伴正在关闭连接指示的SMC-R对等机上)。

A. Upon receipt of a CDC message, the SMC-R layer will detect that the PeerConnectionClosed indicator or PeerDoneWriting indicator is on. If any outstanding recv() calls are pending, they are completed with an indicator that the partner has closed the connection (zero-length data presented to the application). If there is any pending data to be written and PeerConnectionClosed is on, then an SMC-R connection reset must be performed. The connection then enters the AppCloseWait1 state on the passive close side waiting for the local application to initiate its own close processing.

A.收到CDC消息后,SMC-R层将检测到PeerConnectionClosed指示器或PeerDoneWriting指示器打开。如果有任何未完成的recv()调用处于挂起状态,则会使用一个指示器来完成这些调用,该指示器指示合作伙伴已关闭连接(向应用程序提供的长度为零的数据)。如果有任何待写入的数据且PeerConnectionClosed处于打开状态,则必须执行SMC-R连接重置。然后,连接在被动关闭端进入AppCloseWait1状态,等待本地应用程序启动自己的关闭处理。

B. If the local application issues a shutdown() for writing, then the SMC-R layer will send a CDC message to notify the partner of the PeerDoneWriting indicator and then transition the local side of the SMC-R connection to the AppCloseWait2 state.

B.如果本地应用程序发出一个shutdown()进行写入,则SMC-R层将发送一条CDC消息,通知合作伙伴PeerDoneWriting指示符,然后将SMC-R连接的本地端转换为AppCloseWait2状态。

C. When the application issues a close() or shutdown() for both, the local SMC-R peer will send a message informing the peer of the PeerConnectionClosed indicator and transition to the Closed state if the remote peer has also sent the local peer the PeerConnectionClosed indicator. If the peer has not sent the PeerConnectionClosed indicator, we transition into the PeerFinCloseWait state.

C.当应用程序同时发出关闭()或关机()时,本地SMC-R对等方将发送一条消息,通知对等方PeerConnectionClosed指示器,如果远程对等方也向本地对等方发送了PeerConnectionClosed指示器,则将转换到关闭状态。如果对等方未发送PeerConnectionClosed指示符,我们将转换为PeerFencloseWait状态。

D. The local SMC-R connection stays in this state until the peer sends the PeerConnectionClosed indicator in a CDC message. When the indicator is sent, we transition to the Closed state and are then free to reuse this RMBE.

D.本地SMC-R连接保持此状态,直到对等方在CDC消息中发送PeerConnectionClosed指示器。当发送指示器时,我们将转换到关闭状态,然后可以自由地重用此RMBE。

Note that each SMC-R peer needs to provide some logic that will prevent being stranded in a termination state indefinitely. For example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2) state waiting for the remote SMC-R peer to update its connection termination status, it needs to provide a timer that will prevent it from waiting in that state indefinitely should the remote SMC-R peer not respond to this termination request. This could occur in error scenarios -- for example, if the remote SMC-R peer suffered a failure prior to being able to respond to the termination request or the remote application is not responding to this connection termination request by closing its own socket. This latter scenario is similar to the TCP FINWAIT2 state, which has been known to sometimes cause issues when remote TCP/IP hosts lose track of established connections and neglect to close them. Even though the TCP standards do not mandate a timeout from the TCP FINWAIT2 state, most TCP/IP implementations assign a timeout for this state. A similar timeout will be required for SMC-R connections. When this timeout occurs, the local SMC-R peer performs TCP reset processing for this connection. However, no additional RDMA writes to the partner RMBE can occur at this point (we have already indicated that we are done updating the peer's RMBE). After the TCP connection is reset, the RMBE can be returned to the free pool for reallocation. See Section 4.4.2 for more details.

请注意,每个SMC-R对等机都需要提供一些逻辑,以防止无限期地陷入终止状态。例如,如果活动关闭SMC-R对等方处于PeerCloseWait(1或2)状态,等待远程SMC-R对等方更新其连接终止状态,则需要提供一个计时器,以防止远程SMC-R对等方不响应此终止请求时,它在该状态下无限期等待。这可能发生在错误场景中——例如,如果远程SMC-R对等机在能够响应终止请求之前发生故障,或者远程应用程序没有通过关闭自己的套接字来响应此连接终止请求。后一种情况类似于TCP FINWAIT2状态,众所周知,当远程TCP/IP主机失去对已建立连接的跟踪而忽略关闭连接时,该状态有时会导致问题。尽管TCP标准并不要求TCP FINWAIT2状态超时,但大多数TCP/IP实现都会为此状态分配超时。SMC-R连接也需要类似的超时。发生此超时时,本地SMC-R对等方将对此连接执行TCP重置处理。但是,此时不会发生对合作伙伴RMBE的额外RDMA写入(我们已经表示,我们已经完成了对对等方RMBE的更新)。TCP连接重置后,RMBE可以返回到空闲池进行重新分配。详见第4.4.2节。

Also note that it is possible to have two SMC-R endpoints initiate an Active close concurrently. In that scenario, the flows above still apply; however, both endpoints follow the active close path (path 3).

还请注意,可以让两个SMC-R端点同时启动活动关闭。在这种情况下,上述流程仍然适用;但是,两个端点都遵循活动关闭路径(路径3)。

4.8.2. Abnormal SMC-R Connection Termination Flows
4.8.2. 异常SMC-R连接终止流

Abnormal SMC-R connection termination can occur for a variety of reasons, including the following:

SMC-R连接异常终止可能有多种原因,包括以下原因:

o The TCP connection associated with an SMC-R connection is reset. In TCP, either endpoint can send a RST segment to abort an existing TCP connection when error conditions are detected for the connection or the application overtly requests that the connection be reset.

o 与SMC-R连接关联的TCP连接已重置。在TCP中,当检测到连接的错误条件或应用程序公开请求重置连接时,任一端点都可以发送RST段以中止现有TCP连接。

o Normal SMC-R connection termination processing has unexpectedly stalled for a given connection. When the stall is detected (connection termination timeout condition), an abnormal SMC-R connection termination flow is initiated.

o 给定连接的正常SMC-R连接终止处理意外停止。当检测到失速(连接终止超时条件)时,启动异常SMC-R连接终止流。

In these scenarios, it is very important that resources associated with the affected SMC-R connections are properly cleaned up to ensure that there are no orphaned resources and that resources can reliably be reused for new SMC-R connections. Given that SMC-R relies heavily on the RDMA write processing, special care needs to be taken to ensure that an RMBE is no longer being used by an SMC-R peer before logically reassigning that RMBE to a new SMC-R connection.

在这些场景中,正确清理与受影响的SMC-R连接相关的资源非常重要,以确保没有孤立资源,并且资源可以可靠地重新用于新的SMC-R连接。鉴于SMC-R严重依赖RDMA写处理,因此需要特别注意确保SMC-R对等方不再使用RMBE,然后才能将该RMBE逻辑重新分配给新的SMC-R连接。

When an SMC-R peer initiates a TCP connection reset, it also initiates an SMC-R abnormal connection flow at the same time. The SMC-R peers explicitly signal their intent to abnormally terminate an SMC-R connection and await explicit acknowledgment that the peer has received this notification and has also completed abnormal connection termination on its end. Note that TCP connection reset processing can occur in parallel to these flows.

当SMC-R对等方启动TCP连接重置时,它同时也启动SMC-R异常连接流。SMC-R对等方明确表示其异常终止SMC-R连接的意图,并等待对等方已收到此通知并已在其端完成异常连接终止的明确确认。请注意,TCP连接重置处理可以与这些流并行进行。

                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
            |               |                 |              |
            |               +-----------------+              |
            |                                                |
            |                                                |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
            |1B         | (before setting       |          2B|
            |           |  PeerConnectionClosed |            |
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |         1A        |         |      2A          |
            |     Active Abort  |         |  Passive Abort   |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            |-------|PeerAbortWait |   | Process Abort|------|
                    |              |   |              |
                    +--------------+   +--------------+
        
                            +-----------------+
            |-------------->|     CLOSED      |<-------------|
            |               |                 |              |
            |               +-----------------+              |
            |                                                |
            |                                                |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
            |1B         | (before setting       |          2B|
            |           |  PeerConnectionClosed |            |
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |         1A        |         |      2A          |
            |     Active Abort  |         |  Passive Abort   |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            |-------|PeerAbortWait |   | Process Abort|------|
                    |              |   |              |
                    +--------------+   +--------------+
        

Figure 23: SMC-R Abnormal Connection Termination State Diagram

图23:SMC-R异常连接终止状态图

Figure 23 above shows the SMC-R abnormal connection termination state diagram:

上图23显示了SMC-R异常连接终止状态图:

1. Active abort designates the SMC-R peer that is initiating the TCP RST processing. At the time that the TCP RST is sent, the active abort side must also do the following:

1. 主动中止指定启动TCP RST处理的SMC-R对等方。发送TCP RST时,活动中止端还必须执行以下操作:

A. Send the PeerConnAbort indicator to the partner in a CDC message, and then transition to the PeerAbortWait state. During this state, it will monitor this SMC-R connection waiting for the peer to send its corresponding PeerConnAbort indicator but will ignore any other activity in this connection (i.e., new incoming data). It will also generate an appropriate error to any socket API calls issued against this socket (e.g., ECONNABORTED, ECONNRESET).

A.在CDC消息中将PeerConnaPort指示符发送给合作伙伴,然后转换到PeerAbortWait状态。在此状态期间,它将监视此SMC-R连接,等待对等方发送其相应的PeerConnaPort指示符,但将忽略此连接中的任何其他活动(即新传入数据)。它还将对针对该套接字发出的任何套接字API调用生成适当的错误(例如,ECONNABORTED、EconReset)。

B. Once the peer sends the PeerConnAbort indicator to the local host, the local host can transition this SMC-R connection to the Closed state and reuse this RMBE. Note that the SMC-R peer that goes into the active abort state must provide some protection against staying in that state indefinitely should the remote SMC-R peer not respond by sending its own PeerConnAbort indicator to the local host. While this should be a rare scenario, it could occur if the remote SMC-R peer

B.一旦对等方向本地主机发送PeerConnaPort指示符,本地主机可以将此SMC-R连接转换为关闭状态并重用此RMBE。请注意,进入活动中止状态的SMC-R对等机必须提供一些保护,防止远程SMC-R对等机通过向本地主机发送其自己的PeerConnaPort指示符而不响应时无限期地保持在该状态。虽然这应该是一种罕见的情况,但如果远程SMC-R对等机

(passive abort) suffered a failure right after the local SMC-R peer (active abort) sent the PeerConnAbort indicator. To protect against these types of failures, a timer can be set after entering the PeerAbortWait state, and if that timer pops before the peer has sent its local PeerConnAbort indicator (to the active abort side), this RMBE can be returned to the free pool for possible reallocation. See Section 4.4.2 for more details.

(被动中止)在本地SMC-R对等机(主动中止)发送PeerConnaPort指示器后立即发生故障。为了防止这些类型的故障,可以在进入PeerAbortWait状态后设置一个计时器,如果该计时器在对等方发送其本地PeerConnaPort指示符(到活动中止端)之前弹出,则可以将该RMBE返回到空闲池以进行可能的重新分配。详见第4.4.2节。

2. Passive abort designates the SMC-R peer that is the recipient of an SMC-R abort from the peer designated by the PeerConnAbort indicator being sent by the peer in a CDC message. Upon receiving this request, the local peer must do the following:

2. 被动中止指定SMC-R对等方,该对等方在CDC消息中发送PeerConnaPort指示符指定的对等方接收SMC-R中止。收到此请求后,本地对等方必须执行以下操作:

A. Using the appropriate error codes, indicate to the socket application that this connection has been aborted, and then purge all in-flight data for this connection that is waiting to be read or waiting to be sent.

A.使用适当的错误代码,向套接字应用程序指示此连接已中止,然后清除此连接等待读取或等待发送的所有飞行中数据。

B. Send a CDC message to notify the peer of the PeerConnAbort indicator and, once that is completed, transition this RMBE to the Closed state.

B.发送CDC消息,通知对等方PeerConnaPort指示器,完成后,将此RMBE转换为关闭状态。

If an SMC-R peer receives a TCP RST for a given SMC-R connection, it also initiates SMC-R abnormal connection termination processing if it has not already been notified (via the PeerConnAbort indicator) that the partner is severing the connection. It is possible to have two SMC-R endpoints concurrently be in an active abort role for a given connection. In that scenario, the flows above still apply but both endpoints take the active abort path (path 1).

如果SMC-R对等方接收到给定SMC-R连接的TCP RST,如果尚未(通过PeerConnaPort指示器)通知其伙伴正在断开连接,它也会启动SMC-R异常连接终止处理。对于给定的连接,两个SMC-R端点可以同时处于活动中止角色。在该场景中,上述流仍然适用,但两个端点都采用活动中止路径(路径1)。

4.8.3. Other SMC-R Connection Termination Conditions
4.8.3. 其他SMC-R连接终止条件

The following are additional conditions that have implications for SMC-R connection termination:

以下是对SMC-R连接终止有影响的附加条件:

o An SMC-R peer being gracefully shut down. If an SMC-R peer supports a graceful shutdown operation, it should attempt to terminate all SMC-R connections as part of shutdown processing. This could be accomplished via LLC DELETE LINK requests on all active SMC-R links.

o SMC-R对等机被优雅地关闭。如果SMC-R对等机支持正常关机操作,则应在关机处理过程中尝试终止所有SMC-R连接。这可以通过LLC删除所有活动SMC-R链路上的链路请求来实现。

o Abnormal termination of an SMC-R peer. In this example, there may be no opportunity for the host to perform any SMC-R cleanup processing. In this scenario, it is up to the remote peer to detect a RoCE communications failure with the failing host. This

o SMC-R对等端异常终止。在此示例中,主机可能没有机会执行任何SMC-R清理处理。在这种情况下,由远程对等方检测与故障主机的RoCE通信故障。这

could trigger SMC-R link switchover, but that would also generate RoCE errors, causing the remote host to eventually terminate all existing SMC-R connections to this peer.

可能触发SMC-R链路切换,但这也会产生RoCE错误,导致远程主机最终终止与该对等机的所有现有SMC-R连接。

o Loss of RoCE connectivity between two SMC-R peers. If two peers are no longer reachable across any links in their SMC-R link group, then both peers perform a TCP reset for the connections, generate an error to the local applications, and free up all QP resources associated with the link group.

o 两个SMC-R对等点之间的RoCE连接丢失。如果两个对等点在其SMC-R链路组中的任何链路上都无法再访问,则两个对等点都会对连接执行TCP重置,向本地应用程序生成错误,并释放与链路组关联的所有QP资源。

5. Security Considerations
5. 安全考虑
5.1. VLAN Considerations
5.1. VLAN注意事项

The concepts and access control of virtual LANs (VLANs) must be extended to also cover the RoCE network traffic flowing across the Ethernet.

必须扩展虚拟局域网(VLAN)的概念和访问控制,以涵盖流经以太网的RoCE网络流量。

The RoCE VLAN configuration and access permissions must mirror the IP VLAN configuration and access permissions over the Converged Enhanced Ethernet fabric. This means that hosts, routers, and switches that have access to specific VLANs on the IP fabric must also have the same VLAN access across the RoCE fabric. In other words, the SMC-R connectivity will follow the same virtual network access permissions as normal TCP/IP traffic.

RoCE VLAN配置和访问权限必须镜像聚合增强型以太网结构上的IP VLAN配置和访问权限。这意味着可以访问IP结构上特定VLAN的主机、路由器和交换机也必须在RoCE结构上具有相同的VLAN访问权限。换句话说,SMC-R连接将遵循与正常TCP/IP通信相同的虚拟网络访问权限。

5.2. Firewall Considerations
5.2. 防火墙注意事项

As mentioned above, the RoCE fabric inherits the same VLAN topology/access as the IP fabric. RoCE is a Layer 2 protocol that requires both endpoints to reside in the same Layer 2 network (i.e., VLAN). RoCE traffic cannot traverse multiple VLANs, as there is no support for routing RoCE traffic beyond a single VLAN. As a result, SMC-R communications will also be confined to peers that are members of the same VLAN. IP-based firewalls are typically inserted between VLANs (or physical LANs) and rely on normal IP routing to insert themselves in the data path. Since RoCE (and by extension SMC-R) is not routable beyond the local VLAN, there is no ability to insert a firewall in the network path of two SMC-R peers.

如上所述,RoCE结构继承与IP结构相同的VLAN拓扑/访问。RoCE是一种第2层协议,要求两个端点位于同一第2层网络(即VLAN)中。RoCE流量不能穿越多个VLAN,因为不支持将RoCE流量路由到单个VLAN之外。因此,SMC-R通信也将限于属于同一VLAN的对等方。基于IP的防火墙通常插入VLAN(或物理LAN)之间,并依靠正常的IP路由将其自身插入数据路径。由于RoCE(以及扩展的SMC-R)无法在本地VLAN之外路由,因此无法在两个SMC-R对等方的网络路径中插入防火墙。

5.3. Host-Based IP Filters
5.3. 基于主机的IP筛选器

Because SMC-R maintains the TCP three-way handshake for connection setup before switching to RoCE out of band, existing IP filters that control connection setup flows remain effective in an SMC-R environment. IP filters that operate on traffic flowing in an active TCP connection are not supported, because the connection data does not flow over IP.

由于SMC-R在切换到带外RoCE之前保持连接设置的TCP三向握手,因此控制连接设置流的现有IP筛选器在SMC-R环境中仍然有效。不支持对活动TCP连接中的流量进行操作的IP筛选器,因为连接数据不在IP上流动。

5.4. Intrusion Detection Services
5.4. 入侵检测服务

Similar to IP filters, intrusion detection services that operate on TCP connection setups are compatible with SMC-R with no changes required. However, once the TCP connection has switched to RoCE out of band, packets are not available for examination.

与IP筛选器类似,在TCP连接设置上运行的入侵检测服务与SMC-R兼容,无需更改。但是,一旦TCP连接切换到带外RoCE,数据包就不能用于检查。

5.5. IP Security (IPsec)
5.5. IP安全(IPsec)

IP security is not compatible with SMC-R, because there are no IP packets on which to operate. TCP connections that require IP security must opt out of SMC-R.

IP安全性与SMC-R不兼容,因为没有可操作的IP数据包。需要IP安全的TCP连接必须退出SMC-R。

5.6. TLS/SSL
5.6. TLS/SSL

Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved in an SMC-R environment. The TLS/SSL layer resides above the SMC-R layer, and outgoing connection data is encrypted before being passed down to the SMC-R layer for RDMA write. Similarly, incoming connection data goes through the SMC-R layer encrypted and is decrypted by the TLS/SSL layer as it is today.

传输层安全/安全套接字层(TLS/SSL)保存在SMC-R环境中。TLS/SSL层位于SMC-R层之上,传出连接数据在传递到SMC-R层进行RDMA写入之前进行加密。类似地,传入的连接数据通过SMC-R层进行加密,并像今天一样由TLS/SSL层解密。

The TLS/SSL handshake messages flow over the TCP connection after the connection has switched to SMC-R, and so they are exchanged using RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.

TLS/SSL握手消息在连接切换到SMC-R后流经TCP连接,因此SMC-R层使用RDMA写入进行交换,对TLS/SSL层透明。

6. IANA Considerations
6. IANA考虑

The scarcity of TCP option codes available for assignment is understood, and this architecture uses experimental TCP options following the conventions of [RFC6994] ("Shared Use of Experimental TCP Options").

可以理解可用于分配的TCP选项代码的稀缺性,该体系结构使用符合[RFC6994](“实验TCP选项的共享使用”)约定的实验TCP选项。

TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment Identifier. See Section 3.1.

TCP ExID 0xE2D4C3D9已作为TCP实验标识符向IANA注册。见第3.1节。

If this protocol achieves wide acceptance, a discrete option code may be requested by subsequent versions of this protocol.

如果本协议获得广泛接受,本协议的后续版本可能会要求使用离散选项代码。

7. Normative References
7. 规范性引用文件

[RFC793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, DOI 10.17487/RFC0793, September 1981, <http://www.rfc-editor.org/info/rfc793>.

[RFC793]Postel,J.,“传输控制协议”,标准7,RFC 793,DOI 10.17487/RFC0793,1981年9月<http://www.rfc-editor.org/info/rfc793>.

[RFC6994] Touch, J., "Shared Use of Experimental TCP Options", RFC 6994, DOI 10.17487/RFC6994, August 2013, <http://www.rfc-editor.org/info/rfc6994>.

[RFC6994]Touch,J.,“实验TCP选项的共享使用”,RFC 6994,DOI 10.17487/RFC6994,2013年8月<http://www.rfc-editor.org/info/rfc6994>.

[RoCE] InfiniBand, "RDMA over Converged Ethernet specification", <https://cw.infinibandta.org/wg/Members/documentRevision/ download/7149>.

[RoCE]InfiniBand,“聚合以太网上的RDMA规范”<https://cw.infinibandta.org/wg/Members/documentRevision/ 下载/7149>。

Appendix A. Formats
附录A.格式
A.1. TCP Option
A.1. TCP选项

The SMC-R TCP option is formatted in accordance with [RFC6994] ("Shared Use of Experimental TCP Options"). The ExID value is IBM-1047 (EBCDIC) encoding for "SMCR".

SMC-R TCP选项的格式符合[RFC6994](“实验TCP选项的共享使用”)。ExID值是“SMCR”的IBM-1047(EBCDIC)编码。

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Kind = 254  | Length = 6    |   x'E2'       |   x'D4'       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    x'C3'      |    x'D9'      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Kind = 254  | Length = 6    |   x'E2'       |   x'D4'       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |    x'C3'      |    x'D9'      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 24: SMC-R TCP Option Format

图24:SMC-R TCP选项格式

A.2. CLC Messages
A.2. CLC消息

The following rules apply to all CLC messages:

以下规则适用于所有CLC消息:

General rules on formats:

关于格式的一般规则:

o Reserved fields must be set to zero and not validated.

o 保留字段必须设置为零且未验证。

o Each message has an eye catcher at the start and another eye catcher at the end. These must both be validated by the receiver.

o 每条消息的开头都有一个吸引眼球的地方,结尾都有另一个吸引眼球的地方。这两者都必须由接收方验证。

o SMC version indicator: The only SMC-R version defined in this architecture is version 1. In the future, if peers have a mismatch of versions, the lowest common version number is used.

o SMC版本指示器:此体系结构中定义的唯一SMC-R版本是版本1。将来,如果对等点的版本不匹配,则使用最低的通用版本号。

A.2.1. Peer ID Format
A.2.1. 对等ID格式

All CLC messages contain a peer ID that uniquely identifies an instance of a TCP/IP stack. This peer ID is required to be universally unique across TCP/IP stacks and instances (including restarts) of TCP/IP stacks.

所有CLC消息都包含唯一标识TCP/IP堆栈实例的对等ID。此对等ID要求在TCP/IP堆栈和TCP/IP堆栈实例(包括重新启动)之间具有普遍唯一性。

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |          Instance ID          |    RoCE MAC (first 2 bytes)   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                    RoCE MAC (last 4 bytes)                    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |          Instance ID          |    RoCE MAC (first 2 bytes)   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                    RoCE MAC (last 4 bytes)                    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 25: Peer ID Format

图25:对等ID格式

Instance ID

实例ID

A 2-byte instance count that ensures that if the same RNIC MAC is later used in the peer ID for a different TCP/IP stack -- for example, if an RNIC is redeployed to another stack -- the values are unique. It also ensures that if a TCP/IP stack is restarted, the instance ID changes. The value is implementation defined, with one suggestion being 2 bytes of the system clock.

一个2字节的实例计数,用于确保如果同一个RNIC MAC稍后在不同TCP/IP堆栈的对等ID中使用(例如,如果一个RNIC被重新部署到另一个堆栈),则该值是唯一的。它还确保如果重新启动TCP/IP堆栈,实例ID将发生更改。该值由实现定义,其中一个建议是系统时钟的2字节。

RoCE MAC

RoCE MAC

The RoCE MAC address for one of the peer's RNICs. Note that in a virtualized environment this will be the virtual MAC of one of the peer's RNICs.

对等方RNIC之一的RoCE MAC地址。请注意,在虚拟化环境中,这将是对等RNIC之一的虚拟MAC。

A.2.2. SMC Proposal CLC Message Format
A.2.2. SMC提案CLC报文格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 1     |           Length              |Version| Rsrvd |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Client's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Client's preferred GID                       -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Client's preferred RoCE                                      |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Offset to mask/prefix area (0) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     .                                                               .
     .                  Area for future growth                       .
     .                                                               .
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                         IPv4 Subnet Mask                      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | IPv4 Mask Lgth|           Reserved            |Num IPv6 prfx  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                                                               :
     :           Array of IPv6 prefixes (variable length)            :
     :                                                               :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 1     |           Length              |Version| Rsrvd |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Client's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Client's preferred GID                       -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Client's preferred RoCE                                      |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Offset to mask/prefix area (0) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     .                                                               .
     .                  Area for future growth                       .
     .                                                               .
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                         IPv4 Subnet Mask                      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | IPv4 Mask Lgth|           Reserved            |Num IPv6 prfx  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                                                               :
     :           Array of IPv6 prefixes (variable length)            :
     :                                                               :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 26: SMC Proposal CLC Message Format

图26:SMC提案CLC消息格式

The fields present in the SMC Proposal CLC message are:

SMC建议书CLC消息中的字段为:

Eye catchers

引人注目的

Like all CLC messages, the SMC Proposal has beginning and ending eye catchers to aid with verification and parsing. The hex digits spell "SMCR" in IBM-1047 (EBCDIC).

与所有CLC消息一样,SMC提案的开头和结尾都有吸引眼球的地方,有助于验证和解析。十六进制数字在IBM-1047(EBCDIC)中拼写为“SMCR”。

Type

类型

CLC message Type 1 indicates SMC Proposal.

CLC消息类型1表示SMC提案。

Length

The length of this CLC message. If this is an IPv4 flow, this value is 52. Otherwise, it is variable, depending upon how many prefixes are listed.

此CLC消息的长度。如果这是IPv4流,则该值为52。否则,它是可变的,具体取决于列出的前缀数量。

Version

版本

Version of the SMC-R protocol. Version 1 is the only currently defined value.

SMC-R协议的版本。版本1是当前唯一定义的值。

Client's Peer ID

客户端的对等ID

As described in Appendix A.2.1 above.

如上述附录A.2.1所述。

Client's preferred RoCE GID

客户的首选RoCE GID

The IPv6 address of the client's preferred RNIC on the RoCE fabric.

RoCE结构上客户端首选RNIC的IPv6地址。

Client's preferred RoCE MAC address

客户端的首选RoCE MAC地址

The MAC address of the client's preferred RNIC on the RoCE fabric. It is required, as some operating systems do not have neighbor discovery or ARP support for RoCE RNICs.

RoCE结构上客户端首选RNIC的MAC地址。这是必需的,因为某些操作系统不支持RoCE RNICs的邻居发现或ARP。

Offset to mask/prefix area

到遮罩/前缀区域的偏移

Provides the number of bytes that must be skipped after this field, to access the IPv4 Subnet Mask field and the fields that follow it. Allows for future growth of this signal. In this version of the architecture, this value is always zero.

提供必须在此字段之后跳过的字节数,以访问IPv4子网掩码字段及其后面的字段。允许该信号的未来增长。在此版本的体系结构中,此值始终为零。

Area for future growth

未来增长领域

In this version of the architecture, this field does not exist. This indicates where additional information may be inserted into the signal in the future. The "Offset to mask/prefix area" field must be used to skip over this area.

在此版本的体系结构中,此字段不存在。这表示将来可能在信号中插入附加信息的位置。“偏移到遮罩/前缀区域”字段必须用于跳过此区域。

IPv4 Subnet Mask

IPv4子网掩码

If this message is flowing over an IPv4 TCP connection, the value of the subnet mask associated with the interface over which the client sent this message. If this is an IPv6 flow, this field is all zeros.

如果此消息通过IPv4 TCP连接传输,则为与客户端发送此消息的接口关联的子网掩码的值。如果这是一个IPv6流,则此字段全部为零。

This field, along with all fields that follow it in this signal, must be accessed by skipping the number of bytes listed in the "Offset to mask/prefix area" field after the end of that field.

必须通过跳过该字段末尾后“到掩码的偏移量/前缀区域”字段中列出的字节数来访问该字段以及该信号中其后的所有字段。

IPv4 Mask Lgth

IPv4掩码Lgth

If this message is flowing over an IPv4 TCP connection, the number of significant bits in the IPv4 Subnet Mask field. If this is an IPv6 flow, this field is zero.

如果此消息通过IPv4 TCP连接传输,则IPv4子网掩码字段中的有效位数。如果这是IPv6流,则此字段为零。

Num IPv6 prfx

Num IPv6 prfx

If this message is flowing over an IPv6 TCP connection, the number of IPv6 prefixes that follow, with a maximum value of 8. If this is an IPv4 flow, this field is zero and is immediately followed by the ending eye catcher.

如果此消息通过IPv6 TCP连接传输,则随后的IPv6前缀数,最大值为8。如果这是一个IPv4流,则该字段为零,紧跟其后的是结束的引人注目的内容。

Array of IPv6 prefixes

IPv6前缀数组

For IPv6 TCP connections, a list of the IPv6 prefixes associated with the network over which the client sent this message, up to a maximum of eight prefixes.

对于IPv6 TCP连接,与客户端通过其发送此消息的网络关联的IPv6前缀列表,最多八个前缀。

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +                                                               +
     |                                                               |
     +                  IPv6 prefix value                            +
     |                                                               |
     +                                                               +
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Prefix Length |
     +-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +                                                               +
     |                                                               |
     +                  IPv6 prefix value                            +
     |                                                               |
     +                                                               +
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Prefix Length |
     +-+-+-+-+-+-+-+-+
        

Figure 27: Format for IPv6 Prefix Array Element

图27:IPv6前缀数组元素的格式

A.2.3. SMC Accept CLC Message Format
A.2.3. SMC接受CLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 2     |    Length = 68                |Version|F|Rsrvd|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Server's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Server's RoCE GID                            -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Server's RoCE                                                |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |     Server QP (bytes 1-2)     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
     |Srvr QP byte 3 |         Server RMB RKey (bytes 1-3)           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Srvr RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                     Server's RMB virtual address            -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Reserved      |    Server's initial packet sequence number    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 2     |    Length = 68                |Version|F|Rsrvd|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Server's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Server's RoCE GID                            -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Server's RoCE                                                |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |     Server QP (bytes 1-2)     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
     |Srvr QP byte 3 |         Server RMB RKey (bytes 1-3)           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Srvr RMB byte 4|Server RMB indx| Srvr RMB alert tkn (bytes 1-2)|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Srvr RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                     Server's RMB virtual address            -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Reserved      |    Server's initial packet sequence number    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 28: SMC Accept CLC Message Format

图28:SMC接受CLC消息格式

The fields present in the SMC Accept CLC message are:

SMC Accept CLC消息中的字段为:

Eye catchers

引人注目的

Like all CLC messages, the SMC Accept has beginning and ending eye catchers to aid with verification and parsing. The hex digits spell "SMCR" in IBM-1047 (EBCDIC).

与所有CLC消息一样,SMC Accept也有开头和结尾的醒目标记,以帮助验证和解析。十六进制数字在IBM-1047(EBCDIC)中拼写为“SMCR”。

Type

类型

CLC message Type 2 indicates SMC Accept.

CLC消息类型2表示SMC接受。

Length

The SMC Accept CLC message is 68 bytes long.

SMC接受CLC消息的长度为68字节。

Version

版本

Version of the SMC-R protocol. Version 1 is the only currently defined value.

SMC-R协议的版本。版本1是当前唯一定义的值。

F-bit

F位

First contact flag: A 1-bit flag that indicates that the server believes this TCP connection is the first SMC-R contact for this link group.

第一个联系人标志:1位标志,表示服务器认为此TCP连接是此链路组的第一个SMC-R联系人。

Server's Peer ID

服务器的对等ID

As described in Appendix A.2.1 above.

如上述附录A.2.1所述。

Server's RoCE GID

服务器的RoCE GID

The IPv6 address of the RNIC that the server chose for this SMC-R link.

服务器为此SMC-R链路选择的RNIC的IPv6地址。

Server's RoCE MAC address

服务器的RoCE MAC地址

The MAC address of the server's RNIC for the SMC-R link. It is required, as some operating systems do not have neighbor discovery or ARP support for RoCE RNICs.

SMC-R链路服务器RNIC的MAC地址。这是必需的,因为某些操作系统不支持RoCE RNICs的邻居发现或ARP。

Server's QP number

服务器的QP编号

The number for the reliably connected queue pair that the server created for this SMC-R link.

服务器为此SMC-R链路创建的可靠连接队列对的编号。

Server's RMB RKey

服务器的RMB RKey

The RDMA RKey for the RMB that the server created or chose for this TCP connection.

服务器为此TCP连接创建或选择的RMB的RDMA RKey。

Server's RMB element index

服务器的RMB元素索引

Indexes which element within the server's RMB will represent this TCP connection.

索引服务器RMB中表示此TCP连接的元素。

Server's RMB element alert token

服务器的RMB元素警报令牌

A platform-defined, architecturally opaque token that identifies this TCP connection. Added by the client as immediate data on RDMA writes from the client to the server to inform the server that there is data for this connection to retrieve from the RMB element.

平台定义的、体系结构不透明的令牌,用于标识此TCP连接。由客户机添加为RDMA上的即时数据,从客户机写入服务器,以通知服务器有此连接的数据要从RMB元素检索。

Bsize:

b尺寸:

Server's RMB element buffer size in 4-bit compressed notation: x = 4 bits. Actual buffer size value is (2^(x + 4)) * 1K. Smallest possible value is 16K. Largest size supported by this architecture is 512K.

服务器的RMB元素缓冲区大小(4位压缩表示法):x=4位。实际缓冲区大小值为(2^(x+4))*1K。最小可能值为16K。此体系结构支持的最大大小为512K。

MTU

MTU

An enumerated value indicating this peer's QP MTU size. The two peers exchange their MTU values, and whichever value is smaller will be used for the QP. This field should only be validated in the first contact exchange.

指示此对等方的QP MTU大小的枚举值。两个对等方交换其MTU值,较小的值将用于QP。此字段仅应在第一次联系人交换中验证。

The enumerated MTU values are:

枚举的MTU值为:

0: reserved

0:保留

1: 256

1: 256

2: 512

2: 512

3: 1024

3: 1024

4: 2048

4: 2048

5: 4096

5: 4096

6-15: reserved

6-15:保留

Server's RMB virtual address

服务器的RMB虚拟地址

The virtual address of the server's RMB as assigned by the server's RNIC.

服务器RNIC分配的服务器RMB的虚拟地址。

Server's initial packet sequence number

服务器的初始数据包序列号

The starting packet sequence number that this peer will use when sending to the other peer, so that the other peer can prepare its QP for the sequence number to expect.

当发送到另一个对等方时,该对等方将使用的起始分组序列号,以便另一个对等方可以为预期的序列号准备其QP。

A.2.4. SMC Confirm CLC Message Format
A.2.4. SMC确认CLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 3     |    Length = 68                |Version| Rsrvd |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Client's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Client's RoCE GID                            -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Client's RoCE                                                |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |     Client QP (bytes 1-2)     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
     |Clnt QP byte 3 |         Client RMB RKey (bytes 1-3)           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Clnt RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                  Client's RMB Virtual Address               -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Reserved      |    Client's initial packet sequence number    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 3     |    Length = 68                |Version| Rsrvd |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Client's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                Client's RoCE GID                            -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Client's RoCE                                                |
     +- MAC address                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |     Client QP (bytes 1-2)     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+
     |Clnt QP byte 3 |         Client RMB RKey (bytes 1-3)           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Clnt RMB byte 4|Client RMB indx| Clnt RMB alert tkn (bytes 1-2)|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Clnt RMB alert tkn (bytes 3-4)|Bsize  | MTU   |   Reserved    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                  Client's RMB Virtual Address               -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Reserved      |    Client's initial packet sequence number    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 29: SMC Confirm CLC Message Format

图29:SMC确认CLC消息格式

The SMC Confirm CLC message is nearly identical to the SMC Accept, except that it contains client information and lacks a first contact flag.

SMC确认CLC消息与SMC接受消息几乎相同,只是它包含客户信息且缺少第一个联系人标志。

The fields present in the SMC Confirm CLC message are:

SMC确认CLC消息中的字段为:

Eye catchers

引人注目的

Like all CLC messages, the SMC Confirm has beginning and ending eye catchers to aid with verification and parsing. The hex digits spell "SMCR" in IBM-1047 (EBCDIC).

与所有CLC消息一样,SMC Confirm具有开头和结尾醒目的标记,以帮助进行验证和解析。十六进制数字在IBM-1047(EBCDIC)中拼写为“SMCR”。

Type

类型

CLC message Type 3 indicates SMC Confirm.

CLC消息类型3表示SMC确认。

Length

The SMC Confirm CLC message is 68 bytes long.

SMC确认CLC消息的长度为68字节。

Version

版本

Version of the SMC-R protocol. Version 1 is the only currently defined value.

SMC-R协议的版本。版本1是当前唯一定义的值。

Client's Peer ID

客户端的对等ID

As described in Appendix A.2.1 above.

如上述附录A.2.1所述。

Client's RoCE GID

客户的RoCE GID

The IPv6 address of the RNIC that the client chose for this SMC-R link.

客户端为此SMC-R链路选择的RNIC的IPv6地址。

Client's RoCE MAC address

客户端的RoCE MAC地址

The MAC address of the client's RNIC for the SMC-R link. It is required, as some operating systems do not have neighbor discovery or ARP support for RoCE RNICs.

SMC-R链路的客户端RNIC的MAC地址。这是必需的,因为某些操作系统不支持RoCE RNICs的邻居发现或ARP。

Client's QP number

客户的QP编号

The number for the reliably connected queue pair that the client created for this SMC-R link.

客户端为此SMC-R链路创建的可靠连接队列对的编号。

Client's RMB RKey

客户人民币RKey

The RDMA RKey for the RMB that the client created or chose for this TCP connection.

客户端为此TCP连接创建或选择的RMB的RDMA RKey。

Client's RMB element index

客户人民币要素指数

Indexes which element within the client's RMB will represent this TCP connection.

索引客户端RMB中表示此TCP连接的元素。

Client's RMB element alert token

客户端的人民币元素警报令牌

A platform-defined, architecturally opaque token that identifies this TCP connection. Added by the server as immediate data on RDMA writes from the server to the client to inform the client that there is data for this connection to retrieve from the RMB element.

平台定义的、体系结构不透明的令牌,用于标识此TCP连接。由服务器添加为RDMA上的即时数据,从服务器写入客户机,通知客户机有此连接的数据要从RMB元素检索。

Bsize:

b尺寸:

Client's RMB element buffer size in 4-bit compressed notation: x = 4 bits. Actual buffer size value is (2^(x + 4)) * 1K. Smallest possible value is 16K. Largest size supported by this architecture is 512K.

客户端的RMB元素缓冲区大小(4位压缩表示法):x=4位。实际缓冲区大小值为(2^(x+4))*1K。最小可能值为16K。此体系结构支持的最大大小为512K。

MTU

MTU

An enumerated value indicating this peer's QP MTU size. The two peers exchange their MTU values, and whichever value is smaller will be used for the QP. The values are enumerated in Appendix A.2.3. This value should only be validated in the first contact exchange.

指示此对等方的QP MTU大小的枚举值。两个对等方交换其MTU值,较小的值将用于QP。附录A.2.3中列举了这些值。此值仅应在第一次联系人交换中验证。

Client's RMB Virtual Address

客户的人民币虚拟地址

The virtual address of the client's RMB as assigned by the server's RNIC.

由服务器的RNIC分配的客户端RMB的虚拟地址。

Client's initial packet sequence number

客户端的初始数据包序列号

The starting packet sequence number that this peer will use when sending to the other peer, so that the other peer can prepare its QP for the sequence number to expect.

当发送到另一个对等方时,该对等方将使用的起始分组序列号,以便另一个对等方可以为预期的序列号准备其QP。

A.2.5. SMC Decline CLC Message Format
A.2.5. SMC拒绝CLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 4     |    Length = 28                |Version|S|Rsrvd|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Sender's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |              Peer Diagnosis Information                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 4     |    Length = 28                |Version|S|Rsrvd|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       Sender's Peer ID                      -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |              Peer Diagnosis Information                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   x'E2'       |   x'D4'       |     x'C3'     |     x'D9'     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 30: SMC Decline CLC Message Format

图30:SMC拒绝CLC消息格式

The fields present in the SMC Decline CLC message are:

SMC拒绝CLC消息中的字段为:

Eye catchers

引人注目的

Like all CLC messages, the SMC Decline has beginning and ending eye catchers to aid with verification and parsing. The hex digits spell "SMCR" in IBM-1047 (EBCDIC).

与所有CLC消息一样,SMC拒绝也有开头和结尾的引人注目之处,以帮助验证和解析。十六进制数字在IBM-1047(EBCDIC)中拼写为“SMCR”。

Type

类型

CLC message Type 4 indicates SMC Decline.

CLC消息类型4表示SMC拒绝。

Length

The SMC Decline CLC message is 28 bytes long.

SMC拒绝CLC消息的长度为28字节。

Version

版本

Version of the SMC-R protocol. Version 1 is the only currently defined value.

SMC-R协议的版本。版本1是当前唯一定义的值。

S-bit

S位

Sync Bit. Indicates that the link group is out of sync and the receiving peer must clean up its representation of the link group.

同步位。指示链接组不同步,接收对等方必须清除其对链接组的表示。

Sender's Peer ID

发送方的对等ID

As described in Appendix A.2.1 above.

如上述附录A.2.1所述。

Peer Diagnosis Information

同级诊断信息

4 bytes of diagnosis information provided by the peer. These values are defined by the individual peers, and it is necessary to consult the peer's system documentation to interpret the results.

对等方提供的4字节诊断信息。这些值由各个对等方定义,有必要查阅对等方的系统文档来解释结果。

A.3. LLC Messages
A.3. LLC消息

LLC messages are sent over an existing SMC-R link using RoCE SendMsg and are always 44 bytes long so that they fit into the space available in a single WQE without requiring the receiver to post receive buffers. If all 44 bytes are not needed, they are padded out with zeros. LLC messages are in a request/response format. The message type is the same for request and response, and a flag indicates whether a message is flowing as a request or a response.

LLC消息通过使用RoCE SendMsg的现有SMC-R链路发送,长度始终为44字节,因此它们适合单个WQE中的可用空间,而无需接收器发布接收缓冲区。如果不需要全部44个字节,则用零填充。LLC消息采用请求/响应格式。请求和响应的消息类型相同,并且标志指示消息是作为请求还是响应流动。

The two high-order bits of an LLC message opcode indicate how it is to be handled by a peer that does not support the opcode.

LLC消息操作码的两个高位指示不支持该操作码的对等方如何处理该操作码。

If the high-order bits of the opcode are b'00', then the peer must support the LLC message and indicate a protocol error if it does not.

如果操作码的高阶位为b'00',则对等方必须支持LLC消息,如果不支持,则指示协议错误。

If the high-order bits of the opcode are b'10', then the peer must silently discard the LLC message if it does not support the opcode. This requirement is included to allow for toleration of advanced, but optional, functionality.

如果操作码的高阶位为b'10',则对等方必须在不支持操作码的情况下默默放弃LLC消息。包含此要求是为了允许高级但可选的功能。

High-order bits of b'11' indicate a Connection Data Control (CDC) message as described in Appendix A.4.

b'11'的高位表示附录a.4中所述的连接数据控制(CDC)消息。

A.3.1. CONFIRM LINK LLC Message Format
A.3.1. 确认链接LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 1     |  Length = 44  |   Reserved    |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Sender's RoCE                                                |
     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
     |                                                               |
     +-                                                             -+
     |                 Sender's RoCE GID                             |
     +-                                                             -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Sender's QP number, bytes 1-2  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender QP byte3| Link number   |Sender's link userID, bytes 1-2|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender's link userID, bytes 3-4| Max links     |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                         Reserved                            -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 1     |  Length = 44  |   Reserved    |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Sender's RoCE                                                |
     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
     |                                                               |
     +-                                                             -+
     |                 Sender's RoCE GID                             |
     +-                                                             -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Sender's QP number, bytes 1-2  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender QP byte3| Link number   |Sender's link userID, bytes 1-2|
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender's link userID, bytes 3-4| Max links     |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                         Reserved                            -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 31: CONFIRM LINK LLC Message Format

图31:确认链接LLC消息格式

The CONFIRM LINK LLC message is required to be exchanged between the server and client over a newly created SMC-R link to complete the setup of an SMC-R link. Its purpose is to confirm that the RoCE path is actually usable.

需要通过新创建的SMC-R链路在服务器和客户端之间交换确认链路LLC消息,以完成SMC-R链路的设置。其目的是确认RoCE路径实际可用。

On first contact, this message flows after the server receives the SMC Confirm CLC message from the client over the IP connection. For additional links added to an SMC-R link group, it flows after the ADD LINK and ADD LINK CONTINUATION exchange. This flow provides confirmation that the queue pair is in fact usable. Each peer echoes its RoCE information back to the other.

在第一次联系时,此消息在服务器通过IP连接从客户端接收到SMC确认CLC消息后流动。对于添加到SMC-R链接组的其他链接,它在添加链接和添加链接继续交换之后流动。此流确认队列对实际上是可用的。每个对等体将其RoCE信息回传给另一个对等体。

The contents of the CONFIRM LINK LLC message are:

确认链接LLC消息的内容如下:

Type

类型

Type 1 indicates CONFIRM LINK.

类型1表示确认链接。

Length

The CONFIRM LINK LLC message is 44 bytes long.

确认链接LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a CONFIRM LINK reply.

回复标志。设置后,表示这是确认链接回复。

Sender's RoCE MAC address

发件人的RoCE MAC地址

The MAC address of the sender's RNIC for the SMC-R link. It is required, as some operating systems do not have neighbor discovery or ARP support for RoCE RNICs.

SMC-R链路的发送方RNIC的MAC地址。这是必需的,因为某些操作系统不支持RoCE RNICs的邻居发现或ARP。

Sender's RoCE GID

发件人的RoCE GID

The IPv6 address of the RNIC that the sender is using for this SMC-R link.

发送方用于此SMC-R链路的RNIC的IPv6地址。

Sender's QP number

发件人的QP号码

The number for the reliably connected queue pair that the sender created for this SMC-R link.

发送方为此SMC-R链路创建的可靠连接队列对的编号。

Link number

链接号

An identifier assigned by the server that uniquely identifies the link within the link group. This identifier is ONLY unique within a link group. Provided by the server and echoed back by the client.

服务器分配的唯一标识链接组中链接的标识符。此标识符仅在链接组中是唯一的。由服务器提供,并由客户端回显。

Link user ID

链接用户ID

An opaque, implementation-defined identifier assigned by the sender and provided to the receiver solely for purposes of display, diagnosis, network management, etc. The link user ID should be unique across the sender's entire software space, including all other link groups.

由发送方分配并提供给接收方的不透明、实现定义的标识符,仅用于显示、诊断、网络管理等目的。链路用户ID在发送方的整个软件空间(包括所有其他链路组)中应是唯一的。

Max links

最大链接

The maximum number of links the sender can support in a link group. The maximum for this link group is the smaller of the values provided by the two peers.

发件人在链接组中可以支持的最大链接数。此链路组的最大值是两个对等方提供的值中的较小值。

A.3.2. ADD LINK LLC Message Format
A.3.2. 添加链接LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 2     |  Length = 44  | Rsrvd |RsnCode|R|Z| Reserved  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Sender's RoCE                                                |
     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
     |                                                               |
     +-                                                             -+
     |                 Sender's RoCE GID                             |
     +-                                                             -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Sender's QP number, bytes 1-2  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender QP byte3| Link number   |Rsrvd  |  MTU  |Initial PSN    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Initial PSN (continued)      |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                          Reserved                             |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 2     |  Length = 44  | Rsrvd |RsnCode|R|Z| Reserved  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Sender's RoCE                                                |
     +-   MAC address                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
     |                                                               |
     +-                                                             -+
     |                 Sender's RoCE GID                             |
     +-                                                             -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |Sender's QP number, bytes 1-2  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |Sender QP byte3| Link number   |Rsrvd  |  MTU  |Initial PSN    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Initial PSN (continued)      |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                          Reserved                             |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 32: ADD LINK LLC Message Format

图32:添加链接LLC消息格式

The ADD LINK LLC message is sent over an existing link in the link group when a peer wishes to add an SMC-R link to an existing SMC-R link group. It is sent by the server to add a new SMC-R link to the group, or by the client to request that the server add a new link -- for example, when a new RNIC becomes active. When sent from the client to the server, it represents a request that the server initiate an ADD LINK exchange.

当对等方希望将SMC-R链路添加到现有SMC-R链路组时,添加链路LLC消息通过链路组中的现有链路发送。它由服务器发送以向组添加新的SMC-R链接,或由客户端发送以请求服务器添加新链接——例如,当新的RNIC变为活动时。从客户端发送到服务器时,它表示服务器发起添加链接交换的请求。

This message is sent immediately after the initial SMC-R link in the group completes, as described in Section 3.5.1 ("First Contact"). It can also be sent over an existing SMC-R link group at any time as new RNICs are added and become available. Therefore, there can be as few as one new RMB RToken to be communicated, or several. RTokens will be communicated using ADD LINK CONTINUATION messages.

如第3.5.1节(“第一次接触”)所述,该消息在组中的初始SMC-R链接完成后立即发送。随着新RNIC的添加和可用,它也可以随时通过现有SMC-R链路组发送。因此,可以只沟通一个或多个新的RMB项目。RTOKEN将使用添加链接延续消息进行通信。

The contents of the ADD LINK LLC message are:

ADD LINK LLC消息的内容包括:

Type

类型

Type 2 indicates ADD LINK.

类型2表示添加链接。

Length

The ADD LINK LLC message is 44 bytes long.

添加链接LLC消息的长度为44字节。

RsnCode

RsnCode

If the Z (rejection) flag is set, this field provides the reason code. Values can be:

如果设置了Z(拒绝)标志,则此字段提供原因代码。值可以是:

X'1' - no alternate path available: set when the server provides the same MAC/GID as an existing SMC-R link in the group, and the client does not have any additional RNICs available (i.e., the server is attempting to set up an asymmetric link but none is available).

X'1'-无备用路径可用:当服务器提供与组中现有SMC-R链路相同的MAC/GID,并且客户端没有任何其他可用的RNIC(即,服务器尝试设置非对称链路,但没有可用的)时设置。

X'2' - Invalid MTU value specified.

X'2'-指定的MTU值无效。

R

R

Reply flag. When set, indicates that this is an ADD LINK reply.

回复标志。设置后,表示这是添加链接回复。

Z

Z

Rejection flag. When set on reply, indicates that the server's ADD LINK was rejected by the client. When this flag is set, the reason code will also be set.

拒绝标志。当在回复时设置时,表示服务器的添加链接被客户端拒绝。设置此标志时,还将设置原因代码。

Sender's RoCE MAC address

发件人的RoCE MAC地址

The MAC address of the sender's RNIC for the new SMC-R link. It is required, as some operating systems do not have neighbor discovery or ARP support for RoCE RNICs.

新SMC-R链路的发送方RNIC的MAC地址。这是必需的,因为某些操作系统不支持RoCE RNICs的邻居发现或ARP。

Sender's RoCE GID

发件人的RoCE GID

The IPv6 address of the RNIC that the sender is using for the new SMC-R link.

发送方用于新SMC-R链路的RNIC的IPv6地址。

Sender's QP number

发件人的QP号码

The number for the reliably connected queue pair that the sender created for the new SMC-R link.

发送方为新SMC-R链路创建的可靠连接队列对的编号。

Link number

链接号

An identifier for the new SMC-R link. This is assigned by the server and uniquely identifies the link within the link group. This identifier is ONLY unique within a link group. Provided by the server and echoed back by the client.

新SMC-R链路的标识符。这由服务器分配,并唯一标识链接组中的链接。此标识符仅在链接组中是唯一的。由服务器提供,并由客户端回显。

MTU

MTU

An enumerated value indicating this peer's QP MTU size. The two peers exchange their MTU values, and whichever value is smaller will be used for the QP. The values are enumerated in Appendix A.2.3.

指示此对等方的QP MTU大小的枚举值。两个对等方交换其MTU值,较小的值将用于QP。附录A.2.3中列举了这些值。

Initial PSN

初始PSN

The starting packet sequence number (PSN) that this peer will use when sending to the other peer, so that the other peer can prepare its QP for the sequence number to expect.

当发送到另一个对等方时,该对等方将使用的起始分组序列号(PSN),以便另一个对等方可以为预期的序列号准备其QP。

A.3.3. ADD LINK CONTINUATION LLC Message Format
A.3.3. 添加链接延续LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 3     |  Length = 44  |  Reserved     |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Linknum     | NumRTokens    |         Reserved              |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                  RKey/RToken pair                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                  RKey/RToken pair or zeros                  -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                        Reserved                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 3     |  Length = 44  |  Reserved     |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Linknum     | NumRTokens    |         Reserved              |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                  RKey/RToken pair                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                  RKey/RToken pair or zeros                  -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                        Reserved                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 33: ADD LINK CONTINUATION LLC Message Format

图33:添加链接延续LLC消息格式

When a new SMC-R link is added to an SMC-R link group, it is necessary to communicate the new link's RTokens for the RMBs that the SMC-R link group can access. This message follows the ADD LINK and provides the RTokens.

当新的SMC-R链路添加到SMC-R链路组时,有必要为SMC-R链路组可以访问的RMBs传送新链路的RTokens。此消息位于添加链接之后,并提供RTokens。

The server kicks off this exchange by sending the first ADD LINK CONTINUATION LLC message, and the server controls the exchange as described below.

服务器通过发送第一条ADD LINK CONTINUATION LLC消息启动此交换,并且服务器控制交换,如下所述。

o If the client and the server require the same number of ADD LINK CONTINUATION messages to communicate their RTokens, the server starts the exchange by sending the first ADD LINK CONTINUATION request to the client with its (the server's) RTokens. The client then responds with an ADD LINK CONTINUATION response with its RTokens, and so on until the exchange is completed.

o 如果客户端和服务器需要相同数量的添加链接延续消息来通信其RTOKEN,则服务器将通过向客户端及其(服务器的)RTOKEN发送第一个添加链接延续请求来启动exchange。然后,客户机用其RTOKEN的添加链接继续响应,依此类推,直到交换完成。

o If the server requires more ADD LINK CONTINUATION messages than the client, then after the client has communicated all of its RTokens, the server continues to send ADD LINK CONTINUATION request messages to the client. The client continues to respond, using empty (number of RTokens to be communicated = 0) ADD LINK CONTINUATION response messages.

o 如果服务器需要比客户端更多的添加链接延续消息,则在客户端通信其所有RTOKEN后,服务器将继续向客户端发送添加链接延续请求消息。客户端继续响应,使用空(要通信的RTOKEN数=0)添加链接继续响应消息。

o If the client requires more ADD LINK CONTINUATION messages than the server, then after communicating all of its RTokens, the server will continue to send empty ADD LINK CONTINUATION messages to the client to solicit replies with the client's RTokens, until all have been communicated.

o 如果客户端需要的添加链接延续消息多于服务器,则在通信其所有RTOKEN后,服务器将继续向客户端发送空的添加链接延续消息,以请求与客户端RTOKEN的回复,直到所有RTOKEN都已通信。

The contents of the ADD LINK CONTINUATION LLC message are:

添加链接继续LLC消息的内容如下:

Type

类型

Type 3 indicates ADD LINK CONTINUATION.

类型3表示添加链接延续。

Length

The ADD LINK CONTINUATION LLC message is 44 bytes long.

ADD LINK CONTINUATION LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is an ADD LINK CONTINUATION reply.

回复标志。设置后,表示这是添加链接继续回复。

LinkNum

林克纳姆

The link number of the new link within the SMC-R link group for which RKeys are being communicated.

SMC-R链路组内新链路的链路号,RKEY正在为其进行通信。

NumRTokens

numr代币

Number of RTokens remaining to be communicated (including the ones in this message). If the value is less than or equal to 2, this is the last message. If it is greater than 2, another continuation message will be required, and its value will be the value in this message minus 2, and so on until all RKeys are communicated. The maximum value for this field is 255.

剩余要通信的RTOken数(包括此消息中的RTOken)。如果该值小于或等于2,则这是最后一条消息。如果大于2,则需要另一条延续消息,其值将是该消息中的值减去2,依此类推,直到所有RKEY都被传送。此字段的最大值为255。

RKey/RToken pairs (two or less)

RKey/RToken对(两对或更少)

These consist of an RKey for an RMB that is known on the SMC-R link over which this message was sent (the reference RKey), paired with the same RMB's RToken over the new SMC-R link. A full RToken is not required for the reference, because it is only being used to distinguish which RMB it applies to, not address it.

这些包括发送此消息的SMC-R链路上已知的RMB的RKey(参考RKey),与新SMC-R链路上相同RMB的RToken配对。参考文件不需要完整的RToken,因为它仅用于区分其适用于哪种人民币,而不是对其进行处理。

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                         Reference RKey                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                            New RKey                           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       New Virtual Address                   -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                         Reference RKey                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                            New RKey                           |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                       New Virtual Address                   -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 34: RKey/RToken Pair Format

图34:RKey/RToken对格式

The contents of the RKey/RToken pair are:

RKey/RToken对的内容包括:

Reference RKey

参考文献

The RKey of the RMB as it is already known on the SMC-R link over which this message is being sent. Required so that the peer knows with which RMB to associate the new RToken.

人民币的RKey,在发送此消息的SMC-R链路上已知。要求,以便对等方知道与新RToken关联的RMB。

New RKey

新罗基

The RKey of this RMB as it is known over the new SMC-R link.

新SMC-R链路上的人民币RKey。

New Virtual Address

新虚拟地址

The virtual address of this RMB as it is known over the new SMC-R link.

新SMC-R链路上已知的人民币虚拟地址。

A.3.4. DELETE LINK LLC Message Format
A.3.4. 删除链接LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 4     |  Length = 44  |  Reserved     |R|A|O| Rsrvd   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Linknum     |         reason code (bytes 1-3)               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |RsnCode byte 4 |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                          Reserved                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 4     |  Length = 44  |  Reserved     |R|A|O| Rsrvd   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   Linknum     |         reason code (bytes 1-3)               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |RsnCode byte 4 |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                          Reserved                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 35: DELETE LINK LLC Message Format

图35:删除链接LLC消息格式

When the client or server detects that a QP or SMC-R link goes down or needs to come down, it sends this message over one of the other links in the link group.

当客户端或服务器检测到QP或SMC-R链路断开或需要断开时,它通过链路组中的一个其他链路发送此消息。

When the DELETE LINK is sent from the client, it only serves as a notification, and the client expects the server to respond by sending a DELETE LINK request. To avoid races, only the server will initiate the actual DELETE LINK request and response sequence that results from notification from the client.

当从客户端发送删除链接时,它仅用作通知,客户端希望服务器通过发送删除链接请求进行响应。为了避免竞争,只有服务器将启动实际的删除链接请求和响应序列,该序列由客户端的通知产生。

The server can also initiate the DELETE LINK without notification from the client if it detects an error or if orderly link termination was initiated.

如果服务器检测到错误或启动了有序链路终止,则服务器也可以在不通知客户端的情况下启动删除链路。

The client may also request termination of the entire link group, and the server may terminate the entire link group using this message.

客户端还可以请求终止整个链路组,服务器可以使用此消息终止整个链路组。

The contents of the DELETE LINK LLC message are:

DELETE LINK LLC消息的内容如下:

Type

类型

Type 4 indicates DELETE LINK.

类型4表示删除链接。

Length

The DELETE LINK LLC message is 44 bytes long.

DELETE LINK LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a DELETE LINK reply.

回复标志。设置后,表示这是删除链接回复。

A

A.

"All" flag. When set, indicates that all links in the link group are to be terminated. This terminates the link group.

“所有”标志。设置时,表示要终止链接组中的所有链接。这将终止链接组。

O

O

Orderly flag. Indicates orderly termination. Orderly termination is generally caused by an operator command rather than an error on the link. When the client requests orderly termination, the server may wait to complete other work before terminating.

有秩序的旗帜。表示有序终止。有序终止通常由操作员命令而不是链路上的错误引起。当客户端请求有序终止时,服务器可能会在终止之前等待完成其他工作。

LinkNum

林克纳姆

The link number of the link to be terminated. If the A flag is set, this field has no meaning and is set to 0.

要终止的链接的链接编号。如果设置了A标志,则此字段没有意义,设置为0。

RsnCode

RsnCode

The termination reason code. Currently defined reason codes are:

终止原因代码。当前定义的原因代码为:

Request reason codes:

请求原因代码:

X'00010000' = Lost path

X'00010000'=丢失路径

X'00020000' = Operator initiated termination

X'00020000'=操作员发起的终止

X'00030000' = Program initiated termination (link inactivity)

X'00030000'=程序启动终止(链路不活动)

X'00040000' = LLC protocol violation

X'00040000'=违反LLC协议

X'00050000' = Asymmetric link no longer needed

X'00050000'=不再需要不对称链路

Response reason code:

响应原因代码:

X'00100000' = Unknown link ID (no link)

X'00100000'=未知链接ID(无链接)

A.3.5. CONFIRM RKEY LLC Message Format
A.3.5. 确认RKEY LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 6     |  Length = 44  |   Reserved    |R|0|Z|C|Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   NumTkns     |  New RMB RKey for this link (bytes 1-3)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |ThisLink byte 4|                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |           New RMB virtual address for this link               |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-   Other link RMB specification or zeros                     -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                                                               |
     +-                                                             -+
     |      Other link RMB specification or zeros                    |
     +-                                              +-+-+-+-+-+-+-+-+
     |                                               |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 6     |  Length = 44  |   Reserved    |R|0|Z|C|Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |   NumTkns     |  New RMB RKey for this link (bytes 1-3)       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |ThisLink byte 4|                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |           New RMB virtual address for this link               |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-   Other link RMB specification or zeros                     -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                                                               |
     +-                                                             -+
     |      Other link RMB specification or zeros                    |
     +-                                              +-+-+-+-+-+-+-+-+
     |                                               |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 36: CONFIRM RKEY LLC Message Format

图36:确认RKEY LLC消息格式

The CONFIRM RKEY flow can be sent at any time from either the client or the server, to inform the peer that an RMB has been created or deleted. The creator of a new RMB must inform its peer of the new RMB's RToken for all SMC-R links in the SMC-R link group.

确认RKEY流可随时从客户端或服务器发送,以通知对等方已创建或删除RMB。新RMB的创建者必须通知其对等方SMC-R链路组中所有SMC-R链路的新RMB的RToken。

For RMB creation, the creator sends this message over the SMC-R link that the first TCP connection that uses the new RMB is using. This message contains the new RMB RToken for the SMC-R link over which the message is sent. It then lists the sender's SMC-R links in the link group paired with the new RToken for the new RMB for that link. This message can communicate the new RTokens for three QPs: the QP for the link over which this message is sent, and two others. If there are more than three links in the SMC-R link group, a CONFIRM RKEY CONTINUATION will be required.

对于RMB创建,创建者通过使用新RMB的第一个TCP连接正在使用的SMC-R链路发送此消息。此消息包含用于发送消息的SMC-R链路的新RMB RToken。然后,它列出了链接组中发送方的SMC-R链接以及该链接的新RMB的新RToken。此消息可以与三个QP的新RToken通信:用于发送此消息的链路的QP,以及其他两个。如果SMC-R链路组中有三个以上链路,则需要确认RKEY继续。

The peer responds by simply echoing the message with the response flag set. If the response is a negative response, the sender must recalculate the RToken set and start a new CONFIRM RKEY exchange from the beginning. The timing of this retry is controlled by the C flag, as described below.

对等方通过简单地回显设置了响应标志的消息进行响应。如果响应为否定响应,发送方必须重新计算RToken集,并从头开始新的确认RKEY交换。此重试的时间由C标志控制,如下所述。

The contents of the CONFIRM RKEY LLC message are:

确认RKEY LLC消息的内容如下:

Type

类型

Type 6 indicates CONFIRM RKEY.

类型6表示确认RKEY。

Length

The CONFIRM RKEY LLC message is 44 bytes long.

确认RKEY LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a CONFIRM RKEY reply.

回复标志。设置后,表示这是确认RKEY回复。

0

0

Reserved bit.

保留位。

Z

Z

Negative response flag.

否定响应标志。

C

C

Configuration Retry bit. If this is a negative response and this flag is set, the originator should recalculate the RKey set and retry this exchange as soon as the current configuration change is completed. If this flag is not set on a negative response, the originator must wait for the next natural stimulus (for example, a new TCP connection started that requires a new RMB) before retrying.

配置重试位。如果这是否定响应且设置了此标志,则发起人应重新计算RKey集,并在当前配置更改完成后立即重试此交换。如果在否定响应上未设置此标志,则发起者必须等待下一个自然刺激(例如,启动需要新RMB的新TCP连接),然后重试。

NumTkns

numtns

The number of other link/RToken pairs, including those provided in this message, to be communicated. Note that this value does not include the RToken for the link on which this message was sent (i.e., the maximum value is 2). If this value is 3 or less, this is the only message in the exchange. If this value is greater than 3, a CONFIRM RKEY CONTINUATION message will be required.

要通信的其他链路/RToken对的数量,包括本消息中提供的链路/RToken对。请注意,此值不包括发送此消息的链接的RToken(即,最大值为2)。如果此值小于等于3,则这是exchange中唯一的消息。如果该值大于3,则需要确认RKEY继续消息。

Note: In this version of the architecture, eight is the maximum number of links supported in a link group.

注意:在此版本的体系结构中,8是链接组中支持的最大链接数。

New RMB RKey for this link

此链接的新RMB RKey

The new RMB's RKey as assigned on the link over which this message is being sent.

在发送此消息的链接上分配的新RMB的RKey。

New RMB virtual address for this link

此链接的新RMB虚拟地址

The new RMB's virtual address as assigned on the link over which this message is being sent.

在发送此消息的链接上分配的新RMB的虚拟地址。

Other link RMB specification

其他链接人民币规格

The new RMB's specification on the other links in the link group, as shown in Figure 37.

链接组中其他链接的新RMB规范,如图37所示。

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Link number   | RMB's RKey for the specified link (bytes 1-3) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |New RKey byte 4|                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |           RMB's virtual address for the specified link        |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |
     +-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Link number   | RMB's RKey for the specified link (bytes 1-3) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |New RKey byte 4|                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |           RMB's virtual address for the specified link        |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |
     +-+-+-+-+-+-+-+-+
        

Figure 37: Format of Link Number/RKey Pairs

图37:链接编号/RKey对的格式

Link number

链接号

The link number for a link in the link group.

链接组中链接的链接编号。

RMB's RKey for the specified link

指定链接的人民币RKey

The RKey used to reach the RMB over the link whose number was specified in the Link number field.

RKey用于通过在链接编号字段中指定编号的链接到达RMB。

RMB's virtual address for the specified link

指定链接的RMB虚拟地址

The virtual address used to reach the RMB over the link whose number was specified in the Link number field.

用于通过链接到达RMB的虚拟地址,该链接的编号在“链接编号”字段中指定。

A.3.6. CONFIRM RKEY CONTINUATION LLC Message Format
A.3.6. 确认RKEY延续LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 8     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  NumTknsLeft  |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-          Other link RMB specification                       -+
     |                                                               |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-   Other link RMB specification or zeros                     -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                                                               |
     +-                                                             -+
     |      Other link RMB specification or zeros                    |
     +-                                              +-+-+-+-+-+-+-+-+
     |                                               |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 8     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  NumTknsLeft  |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-          Other link RMB specification                       -+
     |                                                               |
     +-              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |               |                                               |
     +-+-+-+-+-+-+-+-+                                              -+
     |                                                               |
     +-   Other link RMB specification or zeros                     -+
     |                                                               |
     +-                              +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                              -+
     |                                                               |
     +-                                                             -+
     |      Other link RMB specification or zeros                    |
     +-                                              +-+-+-+-+-+-+-+-+
     |                                               |  Reserved     |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 38: CONFIRM RKEY CONTINUATION LLC Message Format

图38:确认RKEY延续LLC消息格式

The CONFIRM RKEY CONTINUATION LLC message is used to communicate any additional RMB RTokens that did not fit into the CONFIRM RKEY message. Each of these messages can hold up to three RMB RTokens. The NumTknsLeft field indicates how many RMB RTokens are to be communicated, including the ones in this message. If the value is 3 or less, this is the last message of the group. If the value is 4 or higher, additional CONFIRM RKEY CONTINUATION messages will follow, and the NumTknsLeft value will be a countdown until all are communicated.

确认RKEY延续LLC消息用于传达不符合确认RKEY消息的任何其他RMB RTOKEN。每封邮件最多可容纳三个人民币RTokens。NumTknsLeft字段表示要通信多少RMB RTOKEN,包括此消息中的RTOKEN。如果值为3或更小,则这是组的最后一条消息。如果该值为4或更高,则随后会出现其他确认RKEY继续消息,并且NUMTKNSLEVT值将是一个倒计时,直到所有消息都被传送。

Like the CONFIRM RKEY message, the peer responds by echoing the message back with the reply flag set.

与确认RKEY消息一样,对等方通过设置应答标志回显消息来进行响应。

The contents of the CONFIRM RKEY CONTINUATION LLC message are:

确认RKEY延续LLC消息的内容如下:

Type

类型

Type 8 indicates CONFIRM RKEY CONTINUATION.

类型8表示确认RKEY延续。

Length

The CONFIRM RKEY CONTINUATION LLC message is 44 bytes long.

确认RKEY延续LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a CONFIRM RKEY CONTINUATION reply.

回复标志。设置后,表示这是确认RKEY继续回复。

0

0

Reserved bit.

保留位。

Z

Z

Negative response flag.

否定响应标志。

NumTknsLeft

NUMTKNSLEVT

The number of link/RToken pairs, including those provided in this message, that are remaining to be communicated. If this value is 3 or less, this is the last message in the exchange. If this value is greater than 3, another CONFIRM RKEY CONTINUATION message will be required. Note that in this version of the architecture, eight is the maximum number of links supported in a link group.

剩余待通信的链路/RToken对的数量,包括此消息中提供的链路/RToken对。如果此值小于等于3,则这是exchange中的最后一条消息。如果该值大于3,则需要另一条确认RKEY继续消息。请注意,在此版本的体系结构中,8是链接组中支持的最大链接数。

Other link RMB specification

其他链接人民币规格

The new RMB's specification on other links in the link group, as shown in Figure 37.

链接组中其他链接的新RMB规范,如图37所示。

A.3.7. DELETE RKEY LLC Message Format
A.3.7. 删除RKEY LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 9     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Count     | Error Mask    |        Reserved               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                First deleted RKey                             |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Second deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Third deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Fourth deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Fifth deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Sixth deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Seventh deleted RKey or zeros                      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Eighth deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                       Reserved                                |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 9     |  Length = 44  |   Reserved    |R|0|Z|  Rsrvd  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |     Count     | Error Mask    |        Reserved               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                First deleted RKey                             |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Second deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Third deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Fourth deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Fifth deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Sixth deleted RKey or zeros                        |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Seventh deleted RKey or zeros                      |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |            Eighth deleted RKey or zeros                       |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                       Reserved                                |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 39: DELETE RKEY LLC Message Format

图39:删除RKEY LLC消息格式

The DELETE RKEY flow can be sent at any time from either the client or the server, to inform the peer that one or more RMBs have been deleted. Because the peer already knows every RMB's RKey on each link in the link group, this message only specifies one RKey for each RMB being deleted. The RKey provided for each deleted RMB will be its RKey as known on the SMC-R link over which this message is sent.

可以随时从客户端或服务器发送删除RKEY流,以通知对等方一个或多个RMB已被删除。由于对等方已经知道链接组中每个链接上每个RMB的RKey,因此此消息仅为每个被删除的RMB指定一个RKey。为每个删除的RMB提供的RKey将是发送此消息的SMC-R链路上已知的RKey。

It is not necessary to provide the entire RToken. The RKey alone is sufficient for identifying an existing RMB.

没有必要提供整个RToken。仅RKey就足以识别现有人民币。

The peer responds by simply echoing the message with the response flag set. If the peer did not recognize an RKey, a negative response flag will be set; however, no aggressive recovery action beyond logging the error will be taken.

对等方通过简单地回显设置了响应标志的消息进行响应。如果对等方未识别RKey,将设置否定响应标志;但是,除了记录错误之外,不会采取任何积极的恢复操作。

The contents of the DELETE RKEY LLC message are:

删除RKEY LLC消息的内容如下:

Type

类型

Type 9 indicates DELETE RKEY.

类型9表示删除RKEY。

Length

The DELETE RKEY LLC message is 44 bytes long.

DELETE RKEY LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a DELETE RKEY reply.

回复标志。设置后,表示这是删除RKEY回复。

0

0

Reserved bit.

保留位。

Z

Z

Negative response flag.

否定响应标志。

Count

计数

Number of RMBs being deleted by this message. Maximum value is 8.

此邮件正在删除的RMB数。最大值为8。

Error Mask

错误掩码

If this is a negative response, indicates which RMBs were not successfully deleted. Each bit corresponds to a listed RMB; for example, b'01010000' indicates that the second and fourth RKeys weren't successfully deleted.

如果这是否定响应,则表示未成功删除哪些RMB。每一位对应一个列出的人民币;例如,b'01010000'表示未成功删除第二个和第四个RKEY。

Deleted RKeys

删除RKeys

A list of Count RKeys. Provided on the request flow and echoed back on the response flow. Each RKey is valid on the link over which this message is sent and represents a deleted RMB. Up to eight RMBs can be deleted in this message.

伯爵的名单。在请求流上提供,并在响应流上回显。每个RKey在发送此消息的链接上有效,并表示已删除的RMB。此消息中最多可以删除八个RMB。

A.3.8. TEST LINK LLC Message Format
A.3.8. 测试链接LLC消息格式
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 7     |  Length = 44  |   Reserved    |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                         User Data                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                          Reserved                             |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  Type = 7     |  Length = 44  |   Reserved    |R|  Reserved   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                         User Data                           -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                          Reserved                             |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-                                                             -+
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 40: TEST LINK LLC Message Format

图40:TEST LINK LLC消息格式

The TEST LINK request can be sent from either peer to the other on an existing SMC-R link at any time to test that the SMC-R link is active and healthy at the software level. A peer that receives a TEST LINK LLC message immediately sends back a TEST LINK reply, echoing back the user data. Refer also to Section 4.5.3 ("TCP Keepalive Processing").

测试链路请求可以在任何时候从现有SMC-R链路上的任何一个对等方发送到另一个对等方,以测试SMC-R链路在软件级别是否处于活动状态和健康状态。接收到测试链路LLC消息的对等方立即发回测试链路回复,并回显用户数据。另请参阅第4.5.3节(“TCP保留处理”)。

The contents of the TEST LINK LLC message are:

TEST LINK LLC消息的内容如下:

Type

类型

Type 7 indicates TEST LINK.

类型7表示测试链路。

Length

The TEST LINK LLC message is 44 bytes long.

TEST LINK LLC消息的长度为44字节。

R

R

Reply flag. When set, indicates that this is a TEST LINK reply.

回复标志。设置时,表示这是测试链接回复。

User Data

用户数据

The receiver of this message echoes the sender's data back in a TEST LINK response LLC message.

此消息的接收者在测试链路响应消息中回显发送者的数据。

A.4. Connection Data Control (CDC) Message Format
A.4. 连接数据控制(CDC)消息格式

The RMBE control data is communicated using Connection Data Control (CDC) messages, which use RoCE SendMsg, similar to LLC messages. Also, as with LLC messages, CDC messages are 44 bytes long to ensure that they can fit into private data areas of receive WQEs without requiring the receiver to post receive buffers.

RMBE控制数据使用连接数据控制(CDC)消息进行通信,该消息使用RoCE SendMsg,类似于LLC消息。此外,与LLC消息一样,CDC消息的长度为44字节,以确保它们可以装入接收WQE的私有数据区域,而无需接收方发布接收缓冲区。

Unlike LLC messages, this data is integral to the data path, so its processing must be prioritized and optimized similarly to other data path processing. While LLC messages may be processed on a slower path than data, these messages cannot be.

与LLC消息不同,此数据是数据路径的一部分,因此必须像其他数据路径处理一样对其处理进行优先级排序和优化。虽然LLC消息的处理路径可能比数据慢,但这些消息不能被删除。

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   0  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Type = x'FE'  | Length = 44   |      Sequence number          |
   4  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       SMC-R alert token                       |
   8  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |         Reserved              | Producer cursor wrap seqno    |
   12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       Producer Cursor                         |
   16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |         Reserved              | Consumer cursor wrap seqno    |
   20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       Consumer Cursor                         |
   24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |B|P|U|R|F|Rsrvd|D|C|A|             Reserved                    |
   28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
   32 +-                                                             -+
      |                                                               |
   36 +-                         Reserved                            -+
      |                                                               |
   40 +-                                                             -+
      |                                                               |
   44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        
       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   0  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Type = x'FE'  | Length = 44   |      Sequence number          |
   4  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       SMC-R alert token                       |
   8  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |         Reserved              | Producer cursor wrap seqno    |
   12 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       Producer Cursor                         |
   16 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |         Reserved              | Consumer cursor wrap seqno    |
   20 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                       Consumer Cursor                         |
   24 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |B|P|U|R|F|Rsrvd|D|C|A|             Reserved                    |
   28 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
   32 +-                                                             -+
      |                                                               |
   36 +-                         Reserved                            -+
      |                                                               |
   40 +-                                                             -+
      |                                                               |
   44 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        

Figure 41: Connection Data Control (CDC) Message Format

图41:连接数据控制(CDC)消息格式

Type = x'FE'

类型=x'FE'

This type number has the two high-order bits turned on to enable processing to quickly distinguish it from an LLC message.

此类型编号开启了两个高阶位,以使处理能够快速将其与LLC消息区分开来。

   Length = 44
        
   Length = 44
        

The length of inline data that does not require the posting of a receive buffer.

不需要过帐接收缓冲区的内联数据的长度。

Sequence number

序列号

A 2-byte unsigned integer that represents a wrapping sequence number. The initial value is 1, and this value can wrap to 0. Incremented with every control message sent, except for the failover data validation message, and used to guard against processing an old control message out of sequence. Also used in failover data validation. In normal usage, if this number is less

表示换行序列号的2字节无符号整数。初始值为1,该值可以换行为0。与发送的每个控制消息(故障转移数据验证消息除外)一起递增,用于防止处理旧控制消息顺序错误。也用于故障转移数据验证。在正常使用中,如果此数字小于

than the last received value, discard this message. If greater, process this message. Old control messages can be lost with no ill effect but cannot be processed after newer ones.

如果超过上次接收的值,则放弃此消息。如果更大,则处理此消息。旧的控制消息可能会丢失,但不会产生不良影响,但无法在新的控制消息之后进行处理。

If this is a failover validation CDC message (F flag set), then the receiver must verify that it has received and fully processed the RDMA write that was described by the CDC message with the sequence number in this message. If not, the TCP connection must be reset to guard against data loss. Details of this processing are provided in Section 4.6.1.

如果这是故障转移验证CDC消息(设置了F标志),则接收器必须验证它是否已接收并完全处理了CDC消息中描述的RDMA写入,该消息中包含该消息中的序列号。否则,必须重置TCP连接以防止数据丢失。第4.6.1节提供了该处理的详细信息。

SMC-R alert token

SMC-R警报令牌

The endpoint-assigned alert token that identifies to which TCP connection on the link group this control message refers.

端点分配的警报令牌,用于标识此控制消息引用的链路组上的哪个TCP连接。

Producer cursor wrap seqno

生产者光标换行序号

A 2-byte unsigned integer that represents a wrapping counter incremented by the producer whenever the data written into this RMBE receive buffer causes a wrap (i.e., the producer cursor wraps). This is used by the receiver to determine when new data is available even though the cursors appear unchanged, such as when a full window size write is completed (producer cursor of this RMBE sent by peer = local consumer cursor) or in scenarios where the producer cursor sent for this RMBE < local consumer cursor.

一个2字节无符号整数,表示每当写入此RMBE接收缓冲区的数据导致换行(即,生产者光标换行)时,生产者递增的换行计数器。接收器使用此选项来确定新数据何时可用,即使光标看起来没有变化,例如当完成完整窗口大小的写入时(对等方发送的此RMBE的生产者光标=本地消费者光标),或者在生产者光标为此RMBE发送<本地消费者光标的情况下。

Producer Cursor

生产者光标

A 4-byte unsigned integer that is a wrapping offset into the RMBE data area. Points to the next byte of data to be written by the sender. Can advance up to the receiver's consumer cursor as known by the sender. When the urgent data present indicator is on, points 1 byte beyond the last byte of urgent data. When computing this cursor, the presence of the eye catcher in the RMBE data area must be accounted for. The first writable data location in the RMBE is at offset 4, so this cursor begins at 4 and wraps to 4.

一个4字节无符号整数,是RMBE数据区的换行偏移量。指向发送方要写入的数据的下一个字节。可以前进到发送方已知的接收方消费者光标。当紧急数据显示指示灯点亮时,超出紧急数据最后一个字节1个字节。计算该光标时,必须考虑RMBE数据区域中是否存在引人注目的内容。RMBE中的第一个可写数据位置位于偏移量4处,因此该光标从4开始并换行到4。

Consumer cursor wrap seqno

消费者光标自动换行序号

A 2-byte unsigned integer that mirrors the value of the producer cursor wrap sequence number when the last read from this RMBE occurred. Used as an indicator of how far along the consumer is in reading data (i.e., processed last wrap point or not). The producer side can use this indicator to detect whether or not more data can be written to the partner in full window write scenarios (where the producer cursor = consumer cursor as known on the

一个2字节的无符号整数,当上次从RMBE读取时,该整数反映生产者游标包裹序列号的值。用作指示耗电元件读取数据的距离(即,是否处理了最后一个包裹点)。生产者端可以使用该指示器来检测在全窗口写入场景中是否可以向合作伙伴写入更多数据(其中生产者游标=消费者游标,如图中所示)

remote RMBE). In this scenario, if the consumer sequence number equals the local producer sequence number, the producer knows that more data can be written.

远程(RMBE)。在此场景中,如果使用者序列号等于本地生产者序列号,生产者知道可以写入更多数据。

Consumer Cursor

消费者光标

A 4-byte unsigned integer that is a wrapping offset into the sender's RMBE data area. Points to the offset of the next byte of data to be consumed by the peer in its own RMBE. When computing this cursor, the presence of the eye catcher in the RMBE data area must be accounted for. The first writable data location in the RMBE is at offset 4, so this cursor begins at 4 and wraps to 4. The sender cannot write beyond this cursor into the peer's RMBE without causing data loss.

一个4字节无符号整数,它是发送方RMBE数据区域的换行偏移量。指向对等方在其自身RMBE中要使用的下一个数据字节的偏移量。计算该光标时,必须考虑RMBE数据区域中是否存在引人注目的内容。RMBE中的第一个可写数据位置位于偏移量4处,因此该光标从4开始并换行到4。发送方无法在不导致数据丢失的情况下将超出此光标的内容写入对等方的RMBE。

B-bit

B位

Writer blocked indicator: Sender is blocked for writing. If this bit is set, sender will require explicit notification when receive buffer space is available.

Writer blocked指示符:发件人被阻止写入。如果设置了此位,则当接收缓冲区空间可用时,发送方将需要显式通知。

P-bit

P位

Urgent data pending: Sender has urgent data pending for this connection.

紧急数据挂起:发件人具有此连接的紧急数据挂起。

U-bit

U形位

Urgent data present: Indicates that urgent data is present in the RMBE data area, and the producer cursor points to 1 byte beyond the last byte of urgent data.

存在紧急数据:表示RMBE数据区域中存在紧急数据,并且生产者光标指向紧急数据最后一个字节之外的1个字节。

R-bit

R位

Request for consumer cursor update: Indicates that an immediate consumer cursor update is requested, regardless of whether or not one is warranted according to the window size optimization algorithm described in Section 4.5.1.

消费者光标更新请求:表示请求立即更新消费者光标,无论根据第4.5.1节中描述的窗口大小优化算法是否保证更新。

F-bit

F位

Failover validation indicator: Sent by a peer to guard against data loss during failover when the TCP connection is being moved to another SMC-R link in the link group. When this bit is set, the only other fields in the CDC message that are significant are the Type, Length, SMC-R alert token, and Sequence number fields. The receiver must validate that it has fully processed the RDMA write described by the previous CDC message bearing the same

故障转移验证指示器:当TCP连接移动到链路组中的另一个SMC-R链路时,由对等方发送,以防止故障转移期间的数据丢失。设置此位时,CDC消息中唯一有效的其他字段是类型、长度、SMC-R警报令牌和序列号字段。接收方必须验证其是否已完全处理前一个CDC消息所述的RDMA写操作,该消息包含相同的内容

sequence number as this validation message. If it has, no further action is required. If it has not, the TCP connection must be reset. This processing is described in detail in Section 4.6.1.

与此验证消息相同的序列号。如果有,则无需采取进一步行动。如果没有,则必须重置TCP连接。第4.6.1节详细描述了该处理过程。

D-bit

D位

Sending done indicator: Sent by a peer when it is done writing new data into the receiver's RMBE data area.

发送完成指示器:在将新数据写入接收方的RMBE数据区域时,由对等方发送。

C-bit

C位

PeerConnectionClosed indicator: Sent by a peer when it is completely done with this connection and will no longer be making any updates to the receiver's RMBE or sending any more control messages.

PeerConnectionClosed indicator(对等连接关闭指示器):对等连接完成后,将不再对接收方的RMBE进行任何更新或发送任何更多控制消息,由对等方发送。

A-bit

一点儿

Abnormal close indicator: Sent by a peer when the connection is abnormally terminated (for example, the TCP connection was reset). When sent, it indicates that the peer is completely done with this connection and will no longer be making any updates to this RMBE or sending any more control messages. It also indicates that the RMBE owner must flush any remaining data on this connection and generate an error return code to any outstanding socket APIs on this connection (same processing as receiving a RST segment on a TCP connection).

异常关闭指示器:当连接异常终止(例如,TCP连接被重置)时,由对等方发送。发送时,表示对等方已完全完成此连接,不再对此RMBE进行任何更新或发送任何更多控制消息。它还指示RMBE所有者必须刷新此连接上的任何剩余数据,并向此连接上的任何未完成套接字API生成错误返回码(与在TCP连接上接收RST段的处理相同)。

Appendix B. Socket API Considerations
附录B.插座API注意事项

A key design goal for SMC-R is to require no application changes for exploitation. It is confined to socket applications using stream (i.e., TCP) sockets over IPv4 or IPv6. By virtue of the fact that the switch to the SMC-R protocol occurs after a TCP connection is established, no changes are required in a socket address family or in the IP addresses and ports that the socket applications are using. Existing socket APIs that allow applications to retrieve local and remote socket address structures for an established TCP connection (for example, getsockname() and getpeername()) will continue to function as they have before. Existing DNS setup and APIs for resolving hostnames to IP addresses and vice versa also continue to function without any changes. In general, all of the usual socket APIs that are used for TCP communications (send APIs, recv APIs, etc.) will continue to function as they do today, even if SMC-R is used as the underlying protocol.

SMC-R的一个关键设计目标是无需更改应用程序即可进行开发。它仅限于通过IPv4或IPv6使用流(即TCP)套接字的套接字应用程序。由于SMC-R协议的切换发生在TCP连接建立之后,因此套接字地址系列或套接字应用程序使用的IP地址和端口不需要更改。允许应用程序检索已建立TCP连接的本地和远程套接字地址结构的现有套接字API(例如getsockname()和getpeername())将继续像以前一样工作。用于将主机名解析为IP地址的现有DNS设置和API,以及用于将主机名解析为IP地址的现有DNS设置和API,也可以在不做任何更改的情况下继续运行。一般来说,用于TCP通信的所有常用套接字API(发送API、recv API等)将继续像今天一样工作,即使SMC-R被用作底层协议。

Each SMC-R-enabled implementation does, however, need to pay special attention to any socket APIs that have a reliance on the underlying TCP and IP protocols and also ensure that their behavior in an SMC-R environment is reasonable and minimizes impact on the application. While the basic socket API set is fairly similar across different operating systems, there is more variability when it comes to advanced socket API options. Each implementation needs to perform a detailed analysis of its API options, any possible impact that SMC-R may have, and any resultant implications. As part of that step, a discussion or review with other implementations supporting SMC-R would be useful to ensure consistent implementation.

但是,每个支持SMC-R的实现都需要特别注意依赖于底层TCP和IP协议的任何套接字API,并确保它们在SMC-R环境中的行为是合理的,并将对应用程序的影响降至最低。虽然不同操作系统的基本socket API集非常相似,但在高级socket API选项方面,差异更大。每个实现都需要对其API选项、SMC-R可能产生的任何影响以及由此产生的任何影响进行详细分析。作为该步骤的一部分,与支持SMC-R的其他实施进行讨论或审查将有助于确保一致的实施。

B.1. setsockopt() / getsockopt() Considerations
B.1. setsockopt()/getsockopt()注意事项

These APIs allow socket applications to manipulate socket, transport (TCP/UDP), and IP-level options associated with a given socket. Typically, a platform restricts the number of IP options available to stream (TCP) socket applications, given their connection-oriented nature. The general guideline here is to continue processing these APIs in a manner that allows for application compatibility. Some options will be relevant to the SMC-R protocol and will require special processing "under the covers". For example, the ability to manipulate TCP send and receive buffer sizes is still valid for SMC-R. However, other options may have no meaning for SMC-R. For example, if an application enabled the TCP_NODELAY socket option to disable Nagle's algorithm, it should have no real effect on SMC-R communications, as there is no notion of Nagle's algorithm with this new protocol. But the implementation must accept the TCP_NODELAY option as it does today and save it so that it can be later extracted via getsockopt() processing. Note that any TCP or IP-level options will still have an effect on any TCP/IP packets flowing for an SMC-R connection (i.e., as part of TCP/IP connection establishment and TCP/IP connection termination packet flows).

这些API允许套接字应用程序操作与给定套接字关联的套接字、传输(TCP/UDP)和IP级别选项。通常,平台会限制流(TCP)套接字应用程序可用的IP选项的数量,因为它们具有面向连接的特性。这里的一般原则是继续以允许应用程序兼容性的方式处理这些API。一些选项将与SMC-R协议相关,需要“隐蔽”的特殊处理。例如,操纵TCP发送和接收缓冲区大小的能力对SMC-R仍然有效。但是,其他选项可能对SMC-R没有意义。例如,如果应用程序启用TCP_NODELAY socket选项以禁用Nagle的算法,则对SMC-R通信没有实际影响,因为这个新协议没有Nagle算法的概念。但是实现必须像今天一样接受TCP_NODELAY选项并保存它,以便以后可以通过getsockopt()处理提取它。请注意,任何TCP或IP级别选项仍将对SMC-R连接的任何TCP/IP数据包流产生影响(即,作为TCP/IP连接建立和TCP/IP连接终止数据包流的一部分)。

Under the covers, manipulation of the TCP options will also include the SMC-layer setting, as well as reading the SMC-R experimental option before and after completion of the three-way TCP handshake.

在封面下,TCP选项的操作还包括SMC层设置,以及在完成三向TCP握手之前和之后读取SMC-R实验选项。

Appendix C. Rendezvous Error Scenarios
附录C.交会错误场景

This section discusses error scenarios for setting up and managing SMC-R links.

本节讨论设置和管理SMC-R链路的错误场景。

C.1. SMC Decline during CLC Negotiation
C.1. CLC谈判期间SMC下降

A peer to the SMC-R CLC negotiation can send an SMC Decline in lieu of any expected CLC message to decline SMC and force the TCP connection back to the IP fabric. There can be several reasons for an SMC Decline during the CLC negotiation, including the following:

SMC-R CLC协商的对等方可以发送SMC拒绝,代替任何预期的CLC消息,以拒绝SMC并强制TCP连接返回IP结构。在CLC谈判过程中,SMC的拒绝可能有以下几个原因:

o RNIC went down

o RNIC倒下了

o SMC-R forbidden by local policy

o 当地政策禁止SMC-R

o subnet (IPv4) or prefix (IPv6) doesn't match

o 子网(IPv4)或前缀(IPv6)不匹配

o lack of resources to perform SMC-R

o 缺乏执行SMC-R的资源

In all cases, when an SMC Decline is sent in lieu of an expected CLC message, no confirmation is required, and the TCP connection immediately falls back to using the IP fabric.

在所有情况下,当发送SMC拒绝而不是预期的CLC消息时,不需要确认,TCP连接立即返回到使用IP结构。

To prevent ambiguity between CLC messages and application data, an SMC Decline cannot "chase" another CLC message. An SMC Decline can only be sent in lieu of an expected CLC message. For example, if the client sends an SMC Proposal and then its RNIC goes down, it must wait for the SMC Accept from the server and then reply to the SMC Accept with an SMC Decline.

为了防止CLC消息和应用程序数据之间的歧义,SMC拒绝不能“追踪”另一个CLC消息。SMC拒绝只能代替预期的CLC消息发送。例如,如果客户端发送SMC建议,然后其RNIC停机,则必须等待服务器发出SMC接受,然后以SMC拒绝的方式回复SMC接受。

This "no chase" rule means that if this TCP connection is not a first contact between RoCE peers, a server cannot send an SMC Decline after sending an SMC Accept -- it can only either break the TCP connection or fail over if a problem arises in the RoCE fabric after it has sent the SMC Accept. Similarly, once the client sends an SMC Confirm on a TCP connection that isn't a first contact, it is committed to SMC-R for this TCP connection and cannot fall back to IP.

此“无追踪”规则意味着,如果此TCP连接不是RoCE对等方之间的第一次接触,则服务器在发送SMC接受后无法发送SMC拒绝——只有在发送SMC接受后RoCE结构中出现问题时,服务器才能断开TCP连接或进行故障转移。类似地,一旦客户机在不是第一个联系人的TCP连接上发送SMC确认,它将提交给SMC-R用于此TCP连接,并且不能退回到IP。

C.2. SMC Decline during LLC Negotiation
C.2. 有限责任公司谈判期间SMC下降

For a TCP connection that represents a first contact between RoCE pairs, it is possible for SMC to fall back to IP during the LLC negotiation. This is possible until the first contact SMC-R link is confirmed. For example, see Figure 42. After a first contact SMC-R link is confirmed, fallback to IP is no longer possible. This translates to the following rule: a first contact peer can send an

对于表示RoCE对之间第一次接触的TCP连接,SMC可能在LLC协商期间退回到IP。在确认第一次联系SMC-R链路之前,这是可能的。例如,请参见图42。确认第一次接触SMC-R链路后,不再可能回退到IP。这转化为以下规则:第一个联系人可以发送

SMC Decline at any time during LLC negotiation until it has successfully sent its CONFIRM LINK (request or response) flow. After that point, it cannot fall back to IP.

在LLC协商期间,SMC随时拒绝,直到成功发送其确认链接(请求或响应)流。在这一点之后,它不能退回到IP。

       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    | RKey X |   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|   attempted setup    |GID GB|   | RKey Y2|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|                      |RNIC 4|   | RKey W2|
    |        |   |MAC MC|                      |MAC MD|   |        |
    |       QP 9 |GID GC|                      |GID GD|  QP 65     |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
       Host X -- Server                           Host Y -- Client
    +-------------------+                      +-------------------+
    | Peer ID = PS1     |                      |   Peer ID = PC1   |
    |            +------+                      +------+            |
    |       QP 8 |RNIC 1|    SMC-R Link 1      |RNIC 2|  QP 64     |
    | RKey X |   |MAC MA|<-------------------->|MAC MB|   |        |
    |        |   |GID GA|   attempted setup    |GID GB|   | RKey Y2|
    |       \/   +------+                      +------+  \/        |
    |+--------+         |                      |        +--------+ |
    || RMB    |         |                      |        | RMB    | |
    |+--------+         |                      |        +--------+ |
    |       /\   +------+                      +------+  /\        |
    |        |   |RNIC 3|                      |RNIC 4|   | RKey W2|
    |        |   |MAC MC|                      |MAC MD|   |        |
    |       QP 9 |GID GC|                      |GID GD|  QP 65     |
    |            +------+                      +------+            |
    +-------------------+                      +-------------------+
        
          SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
         <--------------------------------------------------------->
        
          SYN / SYN-ACK / ACK TCP three-way handshake with TCP option
         <--------------------------------------------------------->
        
            SMC Proposal / SMC Accept / SMC Confirm exchange
         <-------------------------------------------------------->
        
            SMC Proposal / SMC Accept / SMC Confirm exchange
         <-------------------------------------------------------->
        
           CONFIRM LINK(request, Link 1)
         .........................................................>
        
           CONFIRM LINK(request, Link 1)
         .........................................................>
        
                           CONFIRM LINK(response, Link 1)
                              X...................................
                                :
                                : RoCE write failure
                                :.................................>
        
                           CONFIRM LINK(response, Link 1)
                              X...................................
                                :
                                : RoCE write failure
                                :.................................>
        
           SMC Decline(PC1, reason code)
          <--------------------------------------------------------
        
           SMC Decline(PC1, reason code)
          <--------------------------------------------------------
        
              Connection data flows over IP fabric
          <------------------------------------------------------->
        
              Connection data flows over IP fabric
          <------------------------------------------------------->
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        
                          Legend:
                   ------------   TCP/IP and CLC flows
                   ............   RoCE (LLC) flows
        

Figure 42: SMC Decline during LLC Negotiation

图42:LLC谈判期间SMC下降

C.3. The SMC Decline Window
C.3. SMC下降窗口

Because SMC-R does not support fallback to IP for a TCP connection that is already using RDMA, there are specific rules on when the SMC Decline CLC message, which signals a fallback to IP because of an error or problem with the RoCE fabric, can be sent during TCP connection setup. There is a "point of no return" after which a connection cannot fall back to IP, and RoCE errors that occur after this point require the connection to be broken with a RST flow in the IP fabric.

由于SMC-R不支持已在使用RDMA的TCP连接回退到IP,因此在TCP连接设置期间,SMC拒绝CLC消息(由于RoCE结构的错误或问题而向IP发出回退信号)何时可以发送有特定规则。存在一个“不返回点”,在此点之后,连接无法退回到IP,在此点之后发生的RoCE错误要求使用IP结构中的RST流断开连接。

For a first contact, that point of no return is after the ADD LINK LLC message has been successfully sent for the second SMC-R link. Specifically, the server cannot fall back to IP after receiving either (1) a positive write completion indication for the ADD LINK request or (2) the ADD LINK response from the client, whichever comes first. The client cannot fall back to IP after sending a negative ADD LINK response, receiving a positive write complete on a positive ADD LINK response, or receiving a CONFIRM LINK for the second SMC-R link from the server, whichever comes first.

对于第一个联系人,该不返回点是在第二个SMC-R链路成功发送ADD LINK LLC消息之后。具体地说,服务器在收到(1)添加链接请求的肯定写入完成指示或(2)来自客户端的添加链接响应(以先到者为准)后不能退回到IP。在发送否定添加链接响应、收到肯定添加链接响应的肯定写入完成或从服务器收到第二个SMC-R链接的确认链接(以先到者为准)后,客户端无法退回到IP。

For a subsequent contact, that point of no return is after the last send of the CLC negotiation completes. This, in combination with the rule that error "chasers" are not allowed during CLC negotiation, means that the server cannot send an SMC Decline after sending an SMC Accept, and the client cannot send an SMC Decline after sending an SMC Confirm.

对于后续联系人,该不返回点在CLC协商的最后一次发送完成之后。这与CLC协商期间不允许错误“追踪器”的规则相结合,意味着服务器在发送SMC接受后不能发送SMC拒绝,而客户端在发送SMC确认后不能发送SMC拒绝。

C.4. Out-of-Sync Conditions during SMC-R Negotiation
C.4. SMC-R协商期间的不同步情况

The SMC Accept CLC message contains a first contact flag that indicates to the client whether the server believes it is setting up a new link group or using an existing link group. This flag is used to detect an out-of-sync condition between the client and the server. The scenario for such a condition is as follows: there is a single existing SMC-R link between the peers. After the client sends the SMC Proposal CLC message, the existing SMC-R link between the client and the server fails. The client cannot chase the SMC Proposal CLC message with an SMC Decline CLC message in this case, because the client does not yet know that the server would have wanted to choose the SMC-R link that just crashed. The QP that failed recovers before the server returns its SMC Accept CLC message. This means that there is a QP but no SMC-R link. Since the server had not yet learned of the SMC-R link failure when it sent the SMC Accept CLC message, it attempts to reuse the SMC-R link that just failed. This means that the server would not set the first contact flag, indicating to the client that the server thinks it is reusing an SMC-R link. However, the client does not have an SMC-R link that matches the server's

SMC Accept CLC消息包含第一个联系人标志,该标志向客户端指示服务器是否认为它正在设置新链接组或使用现有链接组。此标志用于检测客户端和服务器之间的不同步情况。这种情况的场景如下:对等点之间存在单个现有SMC-R链路。客户端发送SMC建议CLC消息后,客户端和服务器之间现有的SMC-R链路将失败。在这种情况下,客户端无法使用SMC拒绝CLC消息来追踪SMC建议CLC消息,因为客户端还不知道服务器想要选择刚刚崩溃的SMC-R链路。失败的QP将在服务器返回其SMC Accept CLC消息之前恢复。这意味着存在QP,但没有SMC-R链路。由于服务器在发送SMC Accept CLC消息时尚未获悉SMC-R链路故障,因此它尝试重新使用刚刚发生故障的SMC-R链路。这意味着服务器不会设置第一个联系人标志,向客户端指示服务器认为它正在重用SMC-R链路。但是,客户端没有与服务器的SMC-R链接匹配的SMC-R链接

specification. Because the first contact flag is off, the client realizes it is out of sync with the server and sends an SMC Decline to cause the connection to fall back to IP.

规格因为第一个联系人标志是关闭的,所以客户端意识到它与服务器不同步,并发送SMC拒绝以使连接退回到IP。

C.5. Timeouts during CLC Negotiation
C.5. CLC协商期间的超时

Because the SMC-R negotiation flows as TCP data, there are built-in timeouts and retransmits at the TCP layer for individual messages. Implementations also must protect the overall TCP/CLC handshake with a timer or timers to prevent connections from hanging indefinitely due to SMC-R processing. This can be done with individual timers for individual CLC messages or an overall timer for the entire exchange, which may include the TCP handshake and the CLC handshake under one timer or separate timers. This decision is implementation dependent.

由于SMC-R协商以TCP数据的形式流动,因此在TCP层上存在针对单个消息的内置超时和重传。实现还必须使用一个或多个计时器保护整个TCP/CLC握手,以防止连接因SMC-R处理而无限期挂起。这可以通过单个CLC消息的单个计时器或整个交换的整体计时器来完成,其中可能包括TCP握手和一个或多个计时器下的CLC握手。这一决定取决于实施情况。

If the TCP and/or CLC handshakes time out, the TCP connection must be terminated as it would be in a legacy IP environment when connection setup doesn't complete in a timely manner. Because the CLC flows are TCP messages, if they cannot be sent and received in a timely fashion, the TCP connection is not healthy and would not work if fallback to IP were attempted.

如果TCP和/或CLC握手超时,则必须终止TCP连接,因为当连接设置未及时完成时,TCP连接将在传统IP环境中终止。由于CLC流是TCP消息,如果无法及时发送和接收这些消息,则TCP连接将不正常,并且在尝试回退到IP时将无法工作。

C.6. Protocol Errors during CLC Negotiation
C.6. CLC协商期间的协议错误

Protocol errors occur during CLC negotiation when a message is received that is not expected. For example, a peer that is expecting a CLC message but instead receives application data has experienced a protocol error; this also indicates a likely software error, as the two sides are out of sync. When application data is expected, this data is not parsed to ensure that it's not a CLC message.

协议错误发生在CLC协商期间,当接收到预期之外的消息时。例如,期望CLC消息但接收应用程序数据的对等方遇到协议错误;这也表明可能存在软件错误,因为双方不同步。当需要应用程序数据时,不会解析该数据以确保它不是CLC消息。

When a peer is expecting a CLC negotiation message, any parsing error except a bad enumerated value in that message must be treated as application data. The CLC negotiation messages are designed with beginning and ending eye catchers to help verify that a CLC negotiation message is actually the expected message. If other parsing errors in an expected CLC message occur, such as incorrect length fields or incorrectly formatted fields, the message must be treated as application data.

当对等方需要CLC协商消息时,必须将该消息中除错误枚举值以外的任何解析错误视为应用程序数据。CLC协商消息的开头和结尾都设计有引人注目的标记,以帮助验证CLC协商消息实际上是预期消息。如果预期CLC消息中出现其他解析错误,例如不正确的长度字段或格式不正确的字段,则必须将该消息视为应用程序数据。

All protocol errors, with the exception of bad enumerated values, must result in termination of the TCP connection. No fallback to IP is allowed in the case of a protocol error, because if the protocols are out of sync, mismatched, or corrupted, then data and security integrity cannot be ensured.

除了错误的枚举值外,所有协议错误都必须导致TCP连接终止。在协议错误的情况下,不允许回退到IP,因为如果协议不同步、不匹配或损坏,则无法确保数据和安全完整性。

The exception to this rule is enumerated values -- for example, the QP MTU values on SMC Accept and SMC Confirm. If a reserved value is received, the proper error response is to send an SMC Decline and fall back to IP; this is because the use of a reserved enumerated value indicates that the other partner likely has additional support that the receiving partner does not have. This indicated mismatch of SMC-R capabilities is not an integrity problem but indicates that SMC-R cannot be used for this connection.

此规则的例外情况是枚举值——例如,SMC Accept和SMC Confirm上的QP MTU值。如果收到保留值,正确的错误响应是发送SMC拒绝并退回到IP;这是因为使用保留枚举值表示另一个合作伙伴可能拥有接收合作伙伴不具备的额外支持。这表明SMC-R功能不匹配不是完整性问题,但表明SMC-R不能用于此连接。

C.7. Timeouts during LLC Negotiation
C.7. 有限责任公司谈判期间的超时

Whenever a peer sends an LLC message to which a reply is expected, it sets a timer after the send posts to wait for the reply. An expected response may be a reply flavor of the LLC message (for example, a CONFIRM LINK reply) or a new LLC message (for example, an ADD LINK CONTINUATION expected from the server by the client if there are more RKeys to be communicated).

每当一个对等方发送一条预期得到回复的LLC消息时,它会在发送后设置一个计时器来等待回复。预期的响应可以是LLC消息的回复风格(例如,确认链接回复)或新的LLC消息(例如,如果有更多的RKEY要通信,则客户端预期从服务器添加链接继续)。

On LLC flows that are part of a first contact setup of a link group, the value of the timer is implementation dependent but should be long enough to allow the other peer to have a write complete timeout and 2-3 retransmits of an SMC Decline on the TCP fabric. For LLC flows that are maintaining the link group and are not part of a first contact setup of a link group, the timers may be shorter. Upon receipt of an expected reply, the timer is cancelled. If a timer pops without a reply having been received, the sender must initiate a recovery action.

在作为链路组的第一个联系人设置的一部分的LLC流上,计时器的值取决于实现,但应该足够长,以允许另一个对等方在TCP结构上具有写入完成超时和2-3次SMC下降的重传。对于维护链路组且不属于链路组第一触点设置一部分的LLC流,计时器可能较短。收到预期回复后,计时器将被取消。如果计时器弹出而未收到回复,则发送方必须启动恢复操作。

During first contact processing, failure of an LLC verification timer is a "should-not-occur" that indicates a problem with one of the endpoints; this is because if there is a "routine" failure in the RoCE fabric that causes an LLC verification send to fail, the sender will get a write completion failure and will then send an SMC Decline to the partner. The only time an LLC verification timer will expire on a first contact is when the sender thinks the send succeeded but it actually didn't. Because of the reliably connected nature of QP connections on the RoCE fabric, this indicates a problem with one of the peers, not with the RoCE fabric.

在第一次接触处理期间,LLC验证计时器的故障是“不应发生”,表明其中一个端点存在问题;这是因为如果RoCE结构中存在导致LLC验证发送失败的“常规”故障,发送方将获得写入完成故障,然后将SMC拒绝发送给合作伙伴。LLC验证计时器在第一次接触时唯一过期的时间是发送方认为发送成功,但实际上没有。由于RoCE结构上QP连接的可靠连接性质,这表明其中一个对等方存在问题,而不是RoCE结构存在问题。

After the reliably connected queue pair for the first SMC-R link in a link group is set up on initial contact, the client sets a timer to wait for a RoCE verification message from the server that the QP is actually connected and usable. If the server experiences a failure sending its QP confirmation message, it will send an SMC Decline, which should arrive at the client before the client's verification timer expires. If the client's timer expires without receiving either an SMC Decline or a RoCE message confirmation from the server,

在链路组中的第一条SMC-R链路的可靠连接队列对在初始接触时建立之后,客户端设置计时器以等待来自服务器的RoCE验证消息,该消息表明QP实际连接且可用。如果服务器在发送其QP确认消息时遇到故障,它将发送SMC拒绝,该拒绝应在客户端的验证计时器过期之前到达客户端。如果客户端的计时器过期而未收到来自服务器的SMC拒绝或RoCE消息确认,

there is a problem with either the server or the TCP fabric. In either case, the client must break the TCP connection and clean up the SMC-R link.

服务器或TCP结构存在问题。无论哪种情况,客户端都必须断开TCP连接并清理SMC-R链路。

There are two scenarios in which the client's response to the QP verification message fails to reach the server. The main difference is whether or not the client has successfully completed the send of the CONFIRM LINK response.

有两种情况下,客户端对QP验证消息的响应无法到达服务器。主要区别在于客户端是否已成功完成确认链接响应的发送。

In the normal case of a problem with the RoCE path, the client will learn of the failure by getting a write completion failure, before the server's timer expires. In this case, the client sends an SMC Decline CLC message to the server, and the TCP connection falls back to IP.

在RoCE路径出现问题的正常情况下,客户机将通过在服务器计时器过期之前获得写入完成失败来了解失败。在这种情况下,客户机向服务器发送SMC DELENCE CLC消息,TCP连接退回到IP。

If the client's send of the confirmation message receives a positive return code but for some reason still does not reach the server, or the client's SMC Decline CLC message fails to reach the server after the client fails to send its RoCE confirmation message, then the server's timer will time out and the server must break the TCP connection by sending a RST. This is expected to be a very rare case, because if the client cannot send its CONFIRM LINK response LLC message, the client should get a negative return code and initiate fallback to IP. A client receiving a positive return code on a send that fails to reach the server should also be an extremely rare case.

如果客户端发送的确认消息收到肯定的返回码,但由于某种原因仍未到达服务器,或者客户端未能发送其RoCE确认消息后,客户端的SMC拒绝CLC消息未能到达服务器,然后服务器的计时器将超时,服务器必须通过发送RST中断TCP连接。这是一种非常罕见的情况,因为如果客户机无法发送其确认链接响应LLC消息,则客户机应获得一个负返回码并启动IP回退。客户端在发送时收到肯定的返回码,但未能到达服务器,这种情况也极为罕见。

C.7.1. Recovery Actions for LLC Timeouts and Failures
C.7.1. LLC超时和故障的恢复操作

The following list describes recovery actions for LLC timeouts. A write completion failure or other indication of send failure for an LLC command is treated the same as a timeout.

下表描述了LLC超时的恢复操作。LLC命令的写入完成失败或其他发送失败指示被视为超时。

LLC message: CONFIRM LINK from server (first contact, first link in the link group)

LLC消息:确认来自服务器的链接(第一个联系人,链接组中的第一个链接)

Timer waits for: CONFIRM LINK reply from client.

计时器等待:确认来自客户端的链接回复。

Recovery action: Break the TCP connection by sending a RST, and clean up the link. The server should have received an SMC Decline from the client by now if the client had an LLC send failure.

恢复操作:通过发送RST中断TCP连接,并清理链接。如果客户端出现LLC发送失败,服务器现在应该已经从客户端接收到SMC拒绝。

LLC message: CONFIRM LINK from server (first contact, second link in the link group)

LLC消息:确认来自服务器的链接(第一个联系人,链接组中的第二个链接)

Timer waits for: CONFIRM LINK reply from client.

计时器等待:确认来自客户端的链接回复。

Recovery action: The second link was not successfully set up. Send a DELETE LINK to the client. Connection data cannot flow in the first link in the link group, until the reply to this DELETE LINK is received, to prevent the peers from being out of sync on the state of the link group.

恢复操作:未成功设置第二个链接。向客户端发送删除链接。在收到对此删除链接的回复之前,连接数据不能在链接组的第一个链接中流动,以防止对等方在链接组的状态下不同步。

LLC message: CONFIRM LINK from server (not first contact)

LLC消息:确认来自服务器的链接(不是第一个联系人)

Timer waits for: CONFIRM LINK reply from client.

计时器等待:确认来自客户端的链接回复。

Recovery action: Clean up the new link, and set a timer to retry. Send a DELETE LINK to the client, in case the client has a longer timer interval, so the client can stop waiting.

恢复操作:清理新链接,并设置计时器重试。向客户端发送一个删除链接,以防客户端的计时器间隔较长,这样客户端就可以停止等待。

LLC message: CONFIRM LINK reply from client (first contact)

LLC消息:确认来自客户的链接回复(第一个联系人)

Timer waits for: ADD LINK from server.

计时器等待:从服务器添加链接。

Recovery action: Clean up the SMC-R link, and break the TCP connection by sending a RST over the IP fabric. There is a problem with the server. If the server had a send failure, it should have sent an SMC Decline by now.

恢复操作:清理SMC-R链路,并通过在IP结构上发送RST来断开TCP连接。服务器有问题。如果服务器发送失败,它现在应该已经发送了SMC拒绝。

LLC message: ADD LINK from server (first contact)

LLC消息:从服务器添加链接(第一个联系人)

Timer waits for: ADD LINK reply from client.

计时器等待:添加来自客户端的链接回复。

Recovery action: Break the TCP connection with a RST, and clean up RoCE resources. The connection is past the point where the server can fall back to IP, and if the client had a send problem it should have sent an SMC Decline by now.

恢复操作:断开与RST的TCP连接,并清理RoCE资源。连接已经超过了服务器可以退回到IP的点,如果客户端有发送问题,它现在应该已经发送了SMC拒绝。

LLC message: ADD LINK from server (not first contact)

LLC消息:从服务器添加链接(不是第一个联系人)

Timer waits for: ADD LINK reply from client.

计时器等待:添加来自客户端的链接回复。

Recovery action: Clean up resources (QP, RKeys, etc.) for the new link, and treat the link over which the ADD LINK was sent as if it had failed. If there is another link available to resend the ADD LINK and the link group still needs another link, retry the ADD LINK over another link in the link group.

恢复操作:清理新链接的资源(QP、RKEY等),并将发送添加链接的链接视为失败。如果有其他链接可用于重新发送添加链接,并且链接组仍需要其他链接,请通过链接组中的其他链接重试添加链接。

LLC message: ADD LINK reply from client (and there are more RKeys to be communicated)

LLC消息:添加来自客户端的链接回复(还有更多RKEY需要沟通)

Timer waits for: ADD LINK CONTINUATION from server.

计时器等待:从服务器添加链接继续。

Recovery action: Treat the same as ADD LINK timer failure.

恢复操作:与添加链接计时器故障相同。

LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from client (and there are no more RKeys to be communicated, for the second link in a first contact scenario)

LLC消息:添加链接回复或添加来自客户端的链接继续回复(对于第一个联系人场景中的第二个链接,没有更多的RKEY需要通信)

Timer waits for: CONFIRM LINK from the server, over the new link.

计时器等待:通过新链接确认来自服务器的链接。

Recovery action: The setup of the new link failed. Send a DELETE LINK to the server. Do not consider the socket opened to the client application until receiving confirmation from the server in the form of a DELETE LINK request for this link and sending the reply (to prevent the partners from being out of sync on the state of the link group).

恢复操作:新链接的设置失败。向服务器发送删除链接。不要将套接字打开到客户端应用程序,直到从服务器收到对该链接的删除链接请求的确认,并发送答复(以防止合作伙伴在链接组的状态下不同步)。

Set a timer to send another ADD LINK to the server if there is still an unused RNIC on the client side.

如果客户端上仍有未使用的RNIC,请设置计时器以向服务器发送另一个添加链接。

LLC message: ADD LINK reply or ADD LINK CONTINUATION reply from client (and there are no more RKeys to be communicated)

LLC消息:添加链接回复或添加来自客户端的链接继续回复(并且没有更多的RKEY需要沟通)

Timer waits for: CONFIRM LINK from the server, over the new link.

计时器等待:通过新链接确认来自服务器的链接。

Recovery action: Send a DELETE LINK to the server for the new link, then clean up any resource allocated for the new link and set a timer to send an ADD LINK to the server if there is still an unused RNIC on the client side. The setup of the new link failed, but the link over which the ADD LINK exchange occurred is unaffected.

恢复操作:为新链接向服务器发送删除链接,然后清理为新链接分配的所有资源,并设置计时器,以便在客户端仍有未使用的RNIC时向服务器发送添加链接。新链接的设置失败,但发生添加链接交换的链接不受影响。

LLC message: ADD LINK CONTINUATION from server

LLC消息:从服务器添加链接延续

Timer waits for: ADD LINK CONTINUATION reply from client.

计时器等待:添加来自客户端的链接继续回复。

Recovery action: Treat the same as ADD LINK timer failure.

恢复操作:与添加链接计时器故障相同。

LLC message: ADD LINK CONTINUATION reply from client (first contact, and RMB count fields indicate that the server owes more ADD LINK CONTINUATION messages)

LLC消息:添加来自客户端的链接继续回复(第一个联系人和RMB计数字段表示服务器欠更多的添加链接继续消息)

Timer waits for: ADD LINK CONTINUATION from server.

计时器等待:从服务器添加链接继续。

Recovery action: Clean up the SMC-R link, and break the TCP connection by sending a RST. There is a problem with the server.

恢复操作:清理SMC-R链路,并通过发送RST中断TCP连接。服务器有问题。

If the server had a send failure, it should have sent an SMC Decline by now.

如果服务器发送失败,它现在应该已经发送了SMC拒绝。

LLC message: ADD LINK CONTINUATION reply from client (not first contact, and RMB count fields indicate that the server owes more ADD LINK CONTINUATION messages)

LLC消息:添加来自客户端的链接继续回复(不是第一个联系人,人民币计数字段表示服务器欠更多的添加链接继续消息)

Timer waits for: ADD LINK CONTINUATION from server.

计时器等待:从服务器添加链接继续。

Recovery action: Treat as if client detected link failure on the link that the ADD LINK exchange is using. Send a DELETE LINK to the server over another active link if one exists; otherwise, clean up the link group.

恢复操作:将客户端视为在添加链接交换正在使用的链接上检测到链接故障。通过另一个活动链接(如果存在)向服务器发送删除链接;否则,请清理链接组。

LLC message: DELETE LINK from client

LLC消息:从客户端删除链接

Timer waits for: DELETE LINK request from server.

计时器等待:从服务器删除链接请求。

Recovery action: If the scope of the request is to delete a single link, the surviving link over which the client sent the DELETE LINK is no longer usable either. If this is the last link in the link group, end TCP connections over the link group by sending RST packets. If there are other surviving links in the link group, resend over a surviving link. Also send a DELETE LINK over a surviving link for the link over which the client attempted to send the initial DELETE LINK message. If the scope of the request is to delete the entire link group, try resending on other links in the link group until success is achieved. If all sends fail, tear down the link group and any TCP connections that exist on it.

恢复操作:如果请求的范围是删除单个链接,则客户端发送删除链接的幸存链接也不再可用。如果这是链路组中的最后一个链路,请通过发送RST数据包结束链路组上的TCP连接。如果链接组中还有其他尚存链接,请通过尚存链接重新发送。对于客户端试图通过其发送初始删除链接消息的链接,还要通过尚存链接发送删除链接。如果请求的范围是删除整个链接组,请尝试在链接组中的其他链接上重新发送,直到成功。如果所有发送都失败,请断开链路组及其上存在的所有TCP连接。

LLC message: DELETE LINK from server (scope: entire link group)

LLC消息:从服务器删除链接(范围:整个链接组)

Timer waits for: Confirmation from the adapter that the message was delivered.

计时器等待:来自适配器的消息已传递的确认。

Recovery action: Tear down the link group and any TCP connections that exist on it.

恢复操作:断开链路组及其上存在的所有TCP连接。

LLC message: DELETE LINK from server (scope: single link)

LLC消息:从服务器删除链接(范围:单链接)

Timer waits for: DELETE LINK reply from client.

计时器等待:从客户端删除链接回复。

Recovery action: The link over which the server sent the DELETE LINK is no longer usable either. If this is the last link in the link group, end TCP connections over the link group by sending RST packets. If there are other surviving links in the link group, resend over a surviving link. Also send a DELETE LINK over a surviving link for the link over which the server attempted to send the initial DELETE LINK message. If the scope of the request is to delete the entire link group, try resending on other

恢复操作:服务器发送删除链接的链接也不再可用。如果这是链路组中的最后一个链路,请通过发送RST数据包结束链路组上的TCP连接。如果链接组中还有其他尚存链接,请通过尚存链接重新发送。对于服务器试图通过其发送初始删除链接消息的链接,还要通过尚存链接发送删除链接。如果请求的范围是删除整个链接组,请尝试在其他服务器上重新发送

links in the link group until success is achieved. If all sends fail, tear down the link group and any TCP connections that exist on it.

链接组中的链接,直到成功。如果所有发送都失败,请断开链路组及其上存在的所有TCP连接。

LLC message: CONFIRM RKEY from client

LLC消息:确认来自客户的RKEY

Timer waits for: CONFIRM RKEY reply from server.

计时器等待:确认来自服务器的RKEY回复。

Recovery action: Perform normal client procedures for detection of failed link. The link over which the message was sent has failed.

恢复操作:执行正常的客户端过程来检测失败的链接。发送消息的链接失败。

LLC message: CONFIRM RKEY from server

LLC消息:确认来自服务器的RKEY

Timer waits for: CONFIRM RKEY reply from client.

计时器等待:确认来自客户端的RKEY回复。

Recovery action: Perform normal server procedures for detection of failed link. The link over which the message was sent has failed.

恢复操作:执行正常的服务器过程以检测失败的链接。发送消息的链接失败。

LLC message: TEST LINK from client

LLC消息:来自客户端的测试链接

Timer waits for: TEST LINK reply from server.

计时器等待:来自服务器的测试链接回复。

Recovery action: Perform normal client procedures for detection of failed link. The link over which the message was sent has failed.

恢复操作:执行正常的客户端过程来检测失败的链接。发送消息的链接失败。

LLC message: TEST LINK from server

LLC消息:来自服务器的测试链接

Timer waits for: TEST LINK reply from client.

计时器等待:来自客户端的测试链接回复。

Recovery action: Perform normal server procedures for detection of failed link. The link over which the message was sent has failed.

恢复操作:执行正常的服务器过程以检测失败的链接。发送消息的链接失败。

The following list describes recovery actions for invalid LLC messages. These could be misformatted or contain out-of-sync data.

以下列表描述了无效LLC消息的恢复操作。这些数据可能格式错误或包含不同步的数据。

LLC message received: CONFIRM LINK from server

收到LLC消息:确认来自服务器的链接

What it indicates: Incorrect link information.

说明:链接信息不正确。

Recovery action: Protocol error. The link must be brought down by sending a DELETE LINK for the link over another link in the link group if one exists. If this is a first contact, fall back to IP by sending an SMC Decline to the server.

恢复操作:协议错误。必须通过在链接组中的另一个链接(如果存在)上发送该链接的删除链接来关闭该链接。如果这是第一次接触,请通过向服务器发送SMC拒绝回复IP。

LLC message received: ADD LINK

收到LLC消息:添加链接

What it indicates: Undefined enumerated MTU value.

指示内容:未定义的枚举MTU值。

Recovery action: Send a negative ADD LINK reply with reason code x'2'.

恢复操作:发送否定添加链接回复,原因代码为x'2'。

LLC message received: ADD LINK reply from client

收到LLC消息:添加来自客户端的链接回复

What it indicates: Client-side link information that would result in a parallel link being set up.

它指示的内容:将导致建立并行链接的客户端链接信息。

Recovery action: Parallel links are not permitted. Delete the link by sending a DELETE LINK to the client over another link in the link group.

恢复操作:不允许并行链接。通过链接组中的另一个链接向客户端发送删除链接来删除链接。

LLC message received: Any link group command from the server, except DELETE LINK for the entire link group

接收到LLC消息:来自服务器的任何链接组命令,但删除整个链接组的链接除外

What it indicates: Client has sent a DELETE LINK for the link on which the message was received.

指示内容:客户端已为接收消息的链接发送删除链接。

Recovery action: Ignore the LLC message. Worst case: the server will time out. Best case: the DELETE LINK crosses with the command from the server, and the server realizes it failed.

恢复操作:忽略LLC消息。最坏情况:服务器将超时。最佳情况:删除链接与来自服务器的命令交叉,服务器意识到它失败了。

LLC message received: ADD LINK CONTINUATION from server or ADD LINK CONTINUATION reply from client

收到LLC消息:从服务器添加链接延续或从客户端添加链接延续回复

What it indicates: Number of RMBs provided doesn't match count given on initial ADD LINK or ADD LINK reply message.

说明:提供的RMB数量与初始添加链接或添加链接回复消息中给出的计数不匹配。

Recovery action: Protocol error. Treat as if detected link outage.

恢复操作:协议错误。视为检测到的链路中断。

LLC message received: DELETE LINK from client

收到LLC消息:从客户端删除链接

What it indicates: Link indicated doesn't exist.

指示内容:指示的链接不存在。

Recovery action: If the link is in the process of being cleaned up, assume timing window and ignore message. Otherwise, send a DELETE LINK reply with reason code 1.

恢复操作:如果链接正在清理过程中,则假定为定时窗口并忽略消息。否则,发送原因代码为1的删除链接回复。

LLC message received: DELETE LINK from server

收到LLC消息:从服务器删除链接

What it indicates: Link indicated doesn't exist.

指示内容:指示的链接不存在。

Recovery action: Send a DELETE LINK reply with reason code 1.

恢复操作:发送原因代码为1的删除链接回复。

LLC message received: CONFIRM RKEY from either client or server

收到LLC消息:确认来自客户端或服务器的RKEY

What it indicates: No RKey provided for one or more of the links in the link group.

说明:没有为链接组中的一个或多个链接提供RKey。

Recovery action: Treat as if detected failure of the link(s) for which no RKey was provided.

恢复操作:视为未提供RKey的链路检测到故障。

LLC message received: DELETE RKEY

收到LLC消息:删除RKEY

What it indicates: Specified RKey doesn't exist.

它表明:指定的RKey不存在。

Recovery action: Send a negative DELETE RKEY response.

恢复操作:发送否定的删除RKEY响应。

LLC message received: TEST LINK reply

收到LLC消息:测试链路应答

What it indicates: User data doesn't match what was sent in the TEST LINK request.

说明:用户数据与测试链接请求中发送的数据不匹配。

Recovery action: Treat as if detected that the link has gone down. This is a protocol error.

恢复操作:视为检测到链接已断开。这是一个协议错误。

LLC message received: Unknown LLC type with high-order bits of opcode equal to b'10'

收到LLC消息:未知LLC类型,操作码高位等于b'10'

What it indicates: This is an optional LLC message that the receiver does not support.

指示内容:这是一条可选的LLC消息,接收方不支持该消息。

Recovery action: Ignore (silently discard) the message.

恢复操作:忽略(以静默方式放弃)消息。

LLC message received: Any unambiguously incorrect or out-of-sync LLC message

收到LLC消息:任何明显不正确或不同步的LLC消息

What it indicates: Link is out of sync.

它表明:链接不同步。

Recovery action: Treat as if detected that the link has gone down. Note that an unsupported or unknown LLC opcode whose two high-order bits are b'10' is not an error and must be silently discarded. Any other unknown or unsupported LLC opcode is an error.

恢复操作:视为检测到链接已断开。请注意,两个高位为b'10'的不受支持或未知LLC操作码不是错误,必须以静默方式丢弃。任何其他未知或不受支持的LLC操作码都是错误。

C.8. Failure to Add Second SMC-R Link to a Link Group
C.8. 未能将第二个SMC-R链路添加到链路组

When there is any failure in setting up the second SMC-R link in an SMC-R link group, including confirmation timer expiration, the SMC-R link group is allowed to continue without available failover. However, this situation is extremely undesirable, and the server must endeavor to correct it as soon as it can.

当在SMC-R链路组中设置第二条SMC-R链路时出现任何故障(包括确认计时器过期),SMC-R链路组可以在没有可用故障切换的情况下继续。但是,这种情况是极不可取的,服务器必须尽快纠正。

The server peer in the SMC-R link group must set a timer to drive it to retry setup of a failed additional SMC-R link. The server will immediately retry the SMC-R link setup when the first of the following events occurs:

SMC-R链路组中的服务器对等方必须设置一个计时器,以驱动它重试设置失败的其他SMC-R链路。发生以下第一个事件时,服务器将立即重试SMC-R链路设置:

o The retry timer expires.

o 重试计时器过期。

o A new RNIC becomes available to the server, on the same LAN as the SMC-R link group.

o 在与SMC-R链路组相同的LAN上,服务器可以使用新的RNIC。

o An ADD LINK LLC request message is received from the client; this indicates the availability of a new RNIC on the client side.

o 从客户端接收添加链接LLC请求消息;这表示新RNIC在客户端的可用性。

Authors' Addresses

作者地址

Mike Fox IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States

美国北卡罗来纳州三角研究园康沃利斯路3039号迈克·福克斯IBM 27709

   Email: mjfox@us.ibm.com
        
   Email: mjfox@us.ibm.com
        

Constantinos (Gus) Kassimis IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States

康斯坦丁诺斯(Gus)卡西米斯IBM美国北卡罗来纳州康沃利斯路研究三角公园3039号,邮编27709

   Email: kassimis@us.ibm.com
        
   Email: kassimis@us.ibm.com
        

Jerry Stevens IBM 3039 Cornwallis Rd. Research Triangle Park, NC 27709 United States

Jerry Stevens IBM美国北卡罗来纳州三角研究园康沃利斯路3039号,邮编27709

   Email: sjerry@us.ibm.com
        
   Email: sjerry@us.ibm.com