Network Working Group                                         V. Kashyap
Request for Comments: 4392                                           IBM
Category: Informational                                       April 2006
        
Network Working Group                                         V. Kashyap
Request for Comments: 4392                                           IBM
Category: Informational                                       April 2006
        

IP over InfiniBand (IPoIB) Architecture

InfiniBand上的IP(IPoIB)体系结构

Status of This Memo

关于下段备忘

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2006).

版权所有(C)互联网协会(2006年)。

Abstract

摘要

InfiniBand is a high-speed, channel-based interconnect between systems and devices.

InfiniBand是系统和设备之间基于通道的高速互连。

This document presents an overview of the InfiniBand architecture. It further describes the requirements and guidelines for the transmission of IP over InfiniBand. Discussions in this document are applicable to both IPv4 and IPv6 unless explicitly specified. The encapsulation of IP over InfiniBand and the mechanism for IP address resolution on IB fabrics are covered in other documents.

本文档概述了InfiniBand体系结构。它进一步描述了通过InfiniBand传输IP的要求和指南。除非明确规定,否则本文档中的讨论适用于IPv4和IPv6。InfiniBand上IP的封装以及IB结构上IP地址解析的机制在其他文档中介绍。

Table of Contents

目录

   1. Introduction to InfiniBand ......................................2
      1.1. InfiniBand Architecture Specification ......................2
      1.2. Overview of InfiniBand Architecture ........................2
           1.2.1. InfiniBand Addresses ................................6
                  1.2.1.1. Unicast GIDs ...............................7
                  1.2.1.2. Multicast GIDs .............................7
      1.3. InfiniBand Multicast Group Management ......................9
           1.3.1. Multicast Member Record ............................10
                  1.3.1.1. JoinState .................................10
           1.3.2. Join and Leave Operations ..........................11
                  1.3.2.1. Creating a Multicast Group ................11
                  1.3.2.2. Deleting a Multicast Group ................11
                  1.3.2.3. Multicast Group Create/Delete Traps .......12
   2. Management of InfiniBand Subnet ................................12
   3. IP over IB .....................................................12
      3.1. InfiniBand as Datalink ....................................13
        
   1. Introduction to InfiniBand ......................................2
      1.1. InfiniBand Architecture Specification ......................2
      1.2. Overview of InfiniBand Architecture ........................2
           1.2.1. InfiniBand Addresses ................................6
                  1.2.1.1. Unicast GIDs ...............................7
                  1.2.1.2. Multicast GIDs .............................7
      1.3. InfiniBand Multicast Group Management ......................9
           1.3.1. Multicast Member Record ............................10
                  1.3.1.1. JoinState .................................10
           1.3.2. Join and Leave Operations ..........................11
                  1.3.2.1. Creating a Multicast Group ................11
                  1.3.2.2. Deleting a Multicast Group ................11
                  1.3.2.3. Multicast Group Create/Delete Traps .......12
   2. Management of InfiniBand Subnet ................................12
   3. IP over IB .....................................................12
      3.1. InfiniBand as Datalink ....................................13
        
      3.2. Multicast Support .........................................13
           3.2.1. Mapping IP Multicast to IB Multicast ...............14
           3.2.2. Transient Flag in IB MGIDs .........................14
      3.3. IP Subnets Across IB Subnets ..............................14
   4. IP Subnets in InfiniBand Fabrics ...............................14
      4.1. IPoIB VLANs ...............................................16
      4.2. Multicast in IPoIB subnets ................................16
           4.2.1. Sending IP Multicast Datagrams .....................17
           4.2.2. Receiving Multicast Packets ........................18
           4.2.3. Router Considerations for IPoIB ....................18
           4.2.4. Impact of InfiniBand Architecture Limits ...........19
           4.2.5. Leaving/Deleting a Multicast Group .................19
      4.3. Transmission of IPoIB Packets .............................20
      4.4. Reverse Address Resolution Protocol (RARP) and
           Static ARP Entries ........................................20
      4.5. DHCPv4 and IPoIB ..........................................21
   5. QoS and Related Issues .........................................21
   6. Security Considerations ........................................21
   7. Acknowledgements ...............................................21
   8. References .....................................................21
      8.1. Normative References ......................................21
      8.2. Informative References ....................................22
        
      3.2. Multicast Support .........................................13
           3.2.1. Mapping IP Multicast to IB Multicast ...............14
           3.2.2. Transient Flag in IB MGIDs .........................14
      3.3. IP Subnets Across IB Subnets ..............................14
   4. IP Subnets in InfiniBand Fabrics ...............................14
      4.1. IPoIB VLANs ...............................................16
      4.2. Multicast in IPoIB subnets ................................16
           4.2.1. Sending IP Multicast Datagrams .....................17
           4.2.2. Receiving Multicast Packets ........................18
           4.2.3. Router Considerations for IPoIB ....................18
           4.2.4. Impact of InfiniBand Architecture Limits ...........19
           4.2.5. Leaving/Deleting a Multicast Group .................19
      4.3. Transmission of IPoIB Packets .............................20
      4.4. Reverse Address Resolution Protocol (RARP) and
           Static ARP Entries ........................................20
      4.5. DHCPv4 and IPoIB ..........................................21
   5. QoS and Related Issues .........................................21
   6. Security Considerations ........................................21
   7. Acknowledgements ...............................................21
   8. References .....................................................21
      8.1. Normative References ......................................21
      8.2. Informative References ....................................22
        
1. Introduction to InfiniBand
1. InfiniBand简介

The InfiniBand Trade Association (IBTA) was formed to develop an I/O specification to deliver a channel based, switched fabric technology. The InfiniBand standard is aimed at meeting the requirements of scalability, reliability, availability, and performance of servers in data centers.

InfiniBand贸易协会(IBTA)成立的目的是开发I/O规范,以提供基于通道的交换结构技术。InfiniBand标准旨在满足数据中心服务器的可扩展性、可靠性、可用性和性能要求。

1.1. InfiniBand Architecture Specification
1.1. InfiniBand体系结构规范

The InfiniBand Trade Association specification is available for download from http://www.infinibandta.org.

InfiniBand贸易协会规范可从以下站点下载:http://www.infinibandta.org.

1.2. Overview of InfiniBand Architecture
1.2. Infinib体系结构和概述

For a more complete overview, the reader is referred to chapter 3 of the InfiniBand specification.

有关更完整的概述,请参阅InfiniBand规范的第3章。

InfiniBand Architecture (IBA) defines a System Area Network (SAN) for connecting multiple independent processor platforms, I/O platforms, and I/O devices. The IBA SAN is a communications and management infrastructure supporting both I/O and inter-processor communications for one or more computer systems.

InfiniBand体系结构(IBA)定义了一个用于连接多个独立处理器平台、I/O平台和I/O设备的系统局域网(SAN)。IBA SAN是一种通信和管理基础设施,支持一个或多个计算机系统的I/O和处理器间通信。

An IBA SAN consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and IB routers (connecting IB subnets). I/O units can range in complexity from a single Application-specific Integrated Circuit (ASIC) IBA-attached device (such as a LAN adapter) to a large, memory-rich Redundant Array of Independent Disks (RAID) subsystem.

IBA SAN由处理器节点和I/O单元组成,通过由级联交换机和IB路由器(连接IB子网)组成的IBA结构连接。I/O单元的复杂程度可以从单个专用集成电路(ASIC)IBA连接设备(如LAN适配器)到大型、内存丰富的独立磁盘冗余阵列(RAID)子系统。

An IBA network may be subdivided into subnets interconnected by routers. These are IB routers and IB subnets and not IP routers or IP subnets. This document will refer to InfiniBand routers and subnets as 'IB routers' and 'IB subnets' respectively. The IP routers and IP subnets will be referred to as 'routers' and 'subnets', respectively.

IBA网络可细分为通过路由器互连的子网。这些是IB路由器和IB子网,而不是IP路由器或IP子网。本文档将InfiniBand路由器和子网分别称为“IB路由器”和“IB子网”。IP路由器和IP子网将分别称为“路由器”和“子网”。

Each IB node or switch may attach to a single or multiple switches or directly with each other. Each IB unit interfaces with the link by way of channel adapters (CAs). The architecture supports multiple CAs per unit with each CA providing one or more ports that connect to the fabric. Each CA appears as a node to the fabric.

每个IB节点或交换机可以连接到单个或多个交换机,也可以彼此直接连接。每个IB单元通过通道适配器(CA)与链路接口。该体系结构支持每个单元有多个CA,每个CA提供一个或多个连接到结构的端口。每个CA显示为结构的一个节点。

The ports are the endpoints to which the data is sent. However, each of the ports may include multiple QPs (Queue Pairs) that may be directly addressed from a remote peer. From the point of view of data transfer the QP number (QPN) is part of the address.

端口是数据发送到的端点。然而,每个端口可包括可从远程对等方直接寻址的多个qp(队列对)。从数据传输的角度来看,QP编号(QPN)是地址的一部分。

IBA supports both connection-oriented and datagram service between the ports. The peers are identified by QPN and the port identifier. There are a two exceptions. QPNs are not used when packets are multicast. QPNs are also not used in the Raw Datagram mode.

IBA支持端口之间的面向连接和数据报服务。对等点由QPN和端口标识符标识。有两个例外。当数据包为多播时,不使用QPN。QPN也不用于原始数据报模式。

A port, in a data packet, is identified by a Local Identifier (LID) and optionally a Global Identifier (GID). The GID in the packet is needed only when communicating across an IB subnet, though it may always be included.

数据包中的端口由本地标识符(LID)和可选的全局标识符(GID)标识。只有在通过IB子网进行通信时才需要数据包中的GID,尽管它可能总是包含在内。

The GID is 128 bits long and is formed by the concatenation of a 64- bit IB subnet prefix and a 64-bit EUI-64-compliant portion. The EUI-64 portion of a GID is referred to as the Global Unique Identifier (GUID; EUI stands for Extended Unique Identifier). The LID is a 16-bit value that is assigned when the port becomes active. The GUID is the only persistent identifier of a port. However, it cannot be used as an address in a packet. If the prefix is modified, then the GID may change. The subnet manager may attempt to keep the LID values constant across reboots, but that is not a requirement.

GID长度为128位,由64位IB子网前缀和64位EUI-64兼容部分串联而成。GID的EUI-64部分称为全局唯一标识符(GUID;EUI代表扩展唯一标识符)。LID是端口激活时分配的16位值。GUID是端口的唯一持久标识符。但是,它不能用作数据包中的地址。如果修改了前缀,则GID可能会更改。子网管理器可能会尝试在重新启动期间保持LID值不变,但这不是一项要求。

The assignment of the GID and the LID is done by the subnet manager. Every IB subnet has at least one subnet manager component that controls the fabric. It assigns the LIDs and GIDs. The subnet

GID和LID的分配由子网管理器完成。每个IB子网至少有一个控制结构的子网管理器组件。它指定LIDs和GID。子网

manager also programs the switches so that they route packets between destinations. The subnet manager (SM) and a related component, the subnet administrator (SA), are the central repository of all information that is required to set-up and bring up the fabric.

manager还对交换机进行编程,以便它们在目的地之间路由数据包。子网管理器(SM)和相关组件子网管理员(SA)是设置和启动结构所需的所有信息的中央存储库。

IB routers are components that route packets between IB subnets based on the GIDs. Thus, within an IB subnet a packet may or may not include a GID but when going across an IB subnet the GID must be included. A LID is always needed in a packet since the destination within a subnet is determined by it.

IB路由器是基于GID在IB子网之间路由数据包的组件。因此,在IB子网内,数据包可能包括也可能不包括GID,但当跨越IB子网时,必须包括GID。数据包中始终需要LID,因为子网中的目的地由它决定。

A CA and a switch may have multiple ports. Each CA port is assigned its own LID or a range of LIDs. The ports of a switch are not addressable by LIDs/GIDs or, in other words, are transparent to other end nodes. Each port has its own set of buffers. The buffering is channeled through virtual lanes (VL) where each VL has its own flow control. There may be up to 16 VLs.

CA和交换机可能有多个端口。每个CA端口都分配了自己的LID或一系列LID。交换机的端口不能通过LIDs/GID寻址,或者换句话说,对其他终端节点是透明的。每个端口都有自己的缓冲区集。缓冲通过虚拟通道(VL)传输,其中每个VL都有自己的流量控制。最多可能有16个VL。

VLs provide a mechanism for creating multiple virtual links within a single physical link. All ports must support VL15 which is reserved exclusively for subnet management datagrams and hence does not concern the IP over Infiniband (IPoIB) discussions. The actual VL that a packet uses is configured by the SM in the switch/channel adapter tables and is determined based on the Service Level (SL) specified in every packet. There are 16 possible SLs.

VLs提供了一种在单个物理链路中创建多个虚拟链路的机制。所有端口都必须支持VL15,VL15专用于子网管理数据报,因此不涉及Infiniband上的IP(IPoIB)讨论。数据包使用的实际VL由SM在交换机/信道适配器表中配置,并根据每个数据包中指定的服务级别(SL)确定。有16种可能的SL。

In addition to the features described above viz. QPs, SLs, and addressing (GID/LID), IBA also defines the following:

除了上述特征,即:。IBA还定义了QPs、SLs和寻址(GID/LID):

Partitioning:

分区:

Every packet, but for the raw datagrams, carries the partition key (P_Key). These values are used for isolation in the fabric. A switch (this is an optional feature) may be programmed by the SM to drop packets not having a certain key. The CA ports always check for the P_Keys. A CA port may belong to multiple partitions. P_Key checking is optional at IB routers.

除原始数据报外,每个数据包都携带分区密钥(P_密钥)。这些值用于结构中的隔离。SM可编程一个开关(这是一个可选功能),以丢弃没有特定密钥的数据包。CA端口始终检查P_密钥。CA端口可能属于多个分区。IB路由器上的P_密钥检查是可选的。

A P_Key may be described as having 'limited membership' or 'full membership'. For a packet to be accepted, at least one of the P_Keys (i.e., the P_Key in the packet or the P_Key in the port) must be 'full membership' P_Keys.

P_密钥可以描述为具有“有限成员”或“完全成员”。对于要接受的数据包,至少一个P_密钥(即数据包中的P_密钥或端口中的P_密钥)必须是“完全成员”P_密钥。

Q_Keys:

Q_键:

Q_Keys are used to enforce access rights for reliable and unreliable IB datagram services. Raw datagram services do not use Q_Keys. At communication establishment, the endpoints exchange

Q_密钥用于强制执行可靠和不可靠IB数据报服务的访问权限。原始数据报服务不使用Q_键。在通信建立时,端点交换数据

the Q_Keys and must always use the relevant Q_Keys when communicating with one another. Multicast packets use the Q_Key associated with the multicast group.

Q_键和在相互通信时必须始终使用相关的Q_键。多播数据包使用与多播组关联的Q_密钥。

Q_Keys with the most significant bit set are considered controlled Q_Keys (such as the General Service Interface (GSI) Q_Key [IB_ARCH]) and a Host Channel Adapter (HCA) does not allow a consumer to arbitrarily specify a controlled Q_Key. An attempt to send a controlled Q_Key results in using the Q_Key in the QP context. Thus, the Operating System maintains control since it can configure the QP context for the controlled Q_Key for privileged consumers. It must be noted that though the notion of a 'controlled Q_Key' is suggested by IB specification, it does not require its use or implementation.

具有最高有效位集的Q_密钥被视为受控Q_密钥(例如通用服务接口(GSI)Q_密钥[IB_ARCH]),并且主机通道适配器(HCA)不允许使用者任意指定受控Q_密钥。尝试发送受控Q_密钥会导致在QP上下文中使用Q_密钥。因此,操作系统保持控制,因为它可以为特权使用者的受控Q_密钥配置QP上下文。必须注意的是,尽管“受控Q_密钥”的概念是由IB规范提出的,但它不需要使用或实现。

Multicast support:

多播支持:

A switch may support multicasting, that is, replication of packets across multiple output ports. This is an optional feature. Similarly, support for sending/receiving multicast packets is optional in CAs. A multicast group is identified by a GID. The GID format is as defined in RFC 2373 on IPv6 addressing [IB_ARCH]. Thus, from an IPv6-over-InfiniBand point of view, the data link multicast address looks like the network address. An IB port must explicitly join a multicast group by sending a request to the SM to receive multicast packets. A port may send packets to any multicast group. In both cases, the multicast LID to be used in the packets is received from the SM.

交换机可以支持多播,即跨多个输出端口复制数据包。这是一个可选功能。类似地,在CAs中,对发送/接收多播数据包的支持是可选的。多播组由GID标识。GID格式如关于IPv6寻址[IB_ARCH]的RFC 2373中所定义。因此,从InfiniBand上的IPv6来看,数据链路多播地址看起来像网络地址。IB端口必须通过向SM发送接收多播数据包的请求来明确加入多播组。端口可以向任何多播组发送数据包。在这两种情况下,从SM接收要在分组中使用的多播LID。

There are six methods for data transfer in IB architecture:

IB体系结构中有六种数据传输方法:

1. Unreliable Datagram (unacknowledged - connectionless)

1. 不可靠的数据报(未确认-无连接)

The Unreliable Datagram (UD) service is connectionless and unacknowledged. It allows the QP to communicate with any unreliable datagram QP on any node.

不可靠数据报(UD)服务是无连接且未确认的。它允许QP与任何节点上的任何不可靠数据报QP通信。

The switches and hence each link can support only a certain MTU. The MTU ranges are 256 octets, 512 octets, 1024 octets, 2048 octets, and 4096 octets. A UD packet cannot be larger than the link MTU between the two peers.

交换机和每个链路只能支持某个MTU。MTU范围为256个八位字节、512个八位字节、1024个八位字节、2048个八位字节和4096个八位字节。UD数据包不能大于两个对等方之间的链路MTU。

2. Reliable Datagram (acknowledged - multiplexed)

2. 可靠数据报(已确认-多路复用)

The Reliable Datagram (RD) service is multiplexed over connections between nodes called End-to-End Contexts (EEC), which allows each RD QP to communicate with any RD QP on any node with an established EEC. Multiple QPs can use the same

可靠数据报(RD)服务通过称为端到端上下文(EEC)的节点之间的连接进行多路复用,这允许每个RD QP与具有已建立EEC的任何节点上的任何RD QP进行通信。多个QP可以使用同一个QP

EEC and a single QP can use multiple EECs (one for each remote node per reliable datagram domain).

EEC和单个QP可以使用多个EEC(每个可靠数据报域的每个远程节点一个EEC)。

3. Reliable Connected (acknowledged - connection oriented)

3. 可靠连接(已确认-面向连接)

The Reliable Connected (RC) service associates a local QP with one and only one remote QP. The message sizes maybe as large as 2^31 octets in length. The CA implementation takes care of segmentation and assembly.

可靠连接(RC)服务将本地QP与一个且仅一个远程QP相关联。消息大小可能大到2^31个八位字节。CA实现负责分段和组装。

4. Unreliable Connected (unacknowledged - connection oriented)

4. 连接不可靠(未确认-面向连接)

The Unreliable Connected (UC) service associates one local QP with one and only one remote QP. There is no acknowledgement and hence no resend of lost or corrupted packets. Such packets are therefore simply dropped. It is similar to RC otherwise.

不可靠连接(UC)服务将一个本地QP与一个且仅一个远程QP关联。没有确认,因此不会重新发送丢失或损坏的数据包。因此,这样的数据包被简单地丢弃。在其他方面与RC类似。

5. Raw Ethertype (unacknowledged - connectionless)

5. 原始以太网类型(未确认-无连接)

The Ethertype raw datagram packet contains a generic transport header that is not interpreted by the CA but it specifies the protocol type. The values for ethertype are the same as defined by Internet Assigned Numbers Authority (IANA) [IANA] for ethertype.

Ethertype原始数据报数据包包含CA不解释的通用传输头,但它指定协议类型。ethertype的值与Internet分配号码管理局(IANA)[IANA]为ethertype定义的值相同。

6. Raw IPv6 (unacknowledged - connectionless)

6. 原始IPv6(未确认-无连接)

Using IPv6 raw datagram service, the IBA CA can support standard protocol layers atop IPv6 (such as TCP/UDP). Thus, native IPv6 packets can be bridged into the IBA SAN and delivered directly to a port and to its IPv6 raw datagram QP.

使用IPv6原始数据报服务,IBA CA可以支持IPv6上的标准协议层(如TCP/UDP)。因此,本机IPv6数据包可以桥接到IBA SAN中,并直接传送到端口及其IPv6原始数据报QP。

The first four types are referred to as IB transports. The latter two are classified as raw datagrams. There is no indication of the QP number in the raw datagram packets. The raw datagram packets are limited by the link MTU in size.

前四种类型称为IB传输。后两者被归类为原始数据报。原始数据报数据包中没有QP编号的指示。原始数据报数据包的大小受链路MTU的限制。

The two connected modes and the Reliable Datagram mode may also support Automatic Path Migration (APM). This is an optional facility that provides for a hardware based path fail over. An alternate path is associated with the QP when the connection/EE context is first created. If unrecoverable errors are encountered, the connection switches to using the alternative path.

两种连接模式和可靠数据报模式也可支持自动路径迁移(APM)。这是一个可选功能,提供基于硬件的路径故障转移。当第一次创建连接/EE上下文时,备用路径与QP相关联。如果遇到不可恢复的错误,连接将切换到使用替代路径。

1.2.1. InfiniBand Addresses
1.2.1. InfiniBand地址

The InfiniBand architecture borrows heavily from the IPv6 architecture in terms of the InfiniBand subnet structure and GIDs.

InfiniBand体系结构在InfiniBand子网结构和GID方面大量借鉴了IPv6体系结构。

The InfiniBand architecture defines the GID associated with a port as a 128-bit unicast or multicast identifier. IBA derives the GID address format, as defined in RFC 2373 [IB_ARCH], with some additional properties/restrictions defined to facilitate efficient discovery, communication, and routing.

InfiniBand体系结构将与端口关联的GID定义为128位单播或多播标识符。IBA派生了RFC 2373[IB_ARCH]中定义的GID地址格式,并定义了一些附加属性/限制,以促进高效的发现、通信和路由。

Note: The IBA explicitly refers to RFC 2373, which is obsolete [RFC3513]. It must be noted that IBA is therefore unaffected by any further changes that are introduced in IPv6 addressing architecture.

注:IBA明确提及RFC 2373,这是过时的[RFC3513]。必须注意的是,IBA因此不受IPv6寻址体系结构中引入的任何进一步更改的影响。

IBA defines two types of GIDs: unicast and multicast.

IBA定义了两种类型的GID:单播和多播。

1.2.1.1. Unicast GIDs
1.2.1.1. 单播GIDs

The unicast GIDs are defined, as in IPv6, with three scopes. The IB specification states the following:

与IPv6中一样,单播GID定义有三个作用域。IB规范规定了以下内容:

a. link local: FE80/10.

a. 本地链接:FE80/10。

The IB routers will not forward packets with a link-local address in source or destination beyond the IB subnet.

IB路由器不会在IB子网之外的源或目标中转发链路本地地址为的数据包。

b. site local: FEC0/10

b. 现场本地:FEC0/10

A unicast GID used within a collection of subnets that is unique within that collection (e.g., a data center or campus) but is not necessarily globally unique. IB routers must not forward any packets with either a site-local Source GID or a site-local Destination GID outside of the site.

在子网集合中使用的单播GID,在该集合中是唯一的(例如,数据中心或校园),但不一定是全局唯一的。IB路由器不得转发站点本地源GID或站点外部站点本地目标GID的任何数据包。

c. global:

c. 全球的:

A unicast GID with a global prefix; an IB router may use this GID to route packets throughout an enterprise or internet.

具有全局前缀的单播GID;IB路由器可以使用此GID在整个企业或互联网上路由数据包。

1.2.1.2. Multicast GIDs
1.2.1.2. 多播GIDs

The multicast GIDs also parallel the IPv6 multicast addresses. The IB specification defines the multicast GIDs as follows:

多播GID还与IPv6多播地址并行。IB规范对多播GID的定义如下:

      FFxy:<112 bits>
        
      FFxy:<112 bits>
        

Flag bits:

标志位:

The nibble, denoted by x above, are the 4 flag bits: 000T.

上面用x表示的半字节是4个标志位:000T。

The first 3 bits are reserved and are set to zero. The last bit is defined as follows:

前3位保留并设置为零。最后一位定义如下:

            T=0: denotes a permanently assigned, that is, well-known GID
            T=1: denotes a transient group
        
            T=0: denotes a permanently assigned, that is, well-known GID
            T=1: denotes a transient group
        

Scope bits:

作用域位:

The 4 bits, denoted by y in the GID above, are the scope bits. These scope values are described in Table 1.

上述GID中由y表示的4位是范围位。表1中描述了这些范围值。

scope value Address value

范围值地址值

0 Reserved 1 Unassigned 2 Link-local 3 Unassigned 4 Unassigned 5 Site-local 6 Unassigned 7 Unassigned 8 Organization-local 9 Unassigned 0xA Unassigned 0xB Unassigned 0xC Unassigned 0xD Unassigned 0xE Global 0xF Reserved

0保留1未分配2链接本地3未分配4未分配5站点本地6未分配7未分配8组织本地9未分配0xA未分配0xB未分配0xC未分配0xD未分配0xE全局0xF保留

Table 1

表1

The IB specification further refers to RFC 2373 and RFC 2375 while defining the well-known multicast addresses. However, it then states that the well-known addresses apply to IB raw IPv6 datagrams only. It must be noted though that a multicast group can be associated with only a single Multicast Global Identifier (MGID). Thus the same MGID cannot be associated with the UD mode and the Raw Datagram mode.

IB规范在定义众所周知的多播地址时还参考RFC 2373和RFC 2375。然而,它随后声明,众所周知的地址仅适用于IB原始IPv6数据报。但必须注意,多播组只能与单个多播全局标识符(MGID)关联。因此,同一MGID不能与UD模式和原始数据报模式相关联。

1.3. InfiniBand Multicast Group Management
1.3. InfiniBand多播组管理

IB multicast groups, identified by MGIDs, are managed by the SM. The SM explicitly programs the IB switches in the fabric to ensure that the packets are received by all the members of the multicast group that request the reception of packets. The SM also needs to program the switches such that packets transmitted to the group by any group member reach all receivers in the multicast group.

由MGID标识的IB多播组由SM管理。SM对结构中的IB交换机进行显式编程,以确保请求接收数据包的多播组的所有成员都接收到数据包。SM还需要对交换机进行编程,以使由任何组成员发送到组的分组到达多播组中的所有接收机。

IBA distinguishes between multicast senders and receivers. Though all members of a multicast group can transmit to the group (and expect their packets to be correctly forwarded), not all members of the group are receivers. A port needs to explicitly request that multicast packets addressed to the group be forwarded to it.

IBA区分多播发送方和接收方。尽管多播组的所有成员都可以向该组发送数据(并期望其数据包被正确转发),但并非该组的所有成员都是接收者。端口需要显式请求将发往组的多播数据包转发给它。

A multicast group is created by sending a join request to the SM. As will be explained later, IBA defines multiple modes for joining a multicast group. The subnet manager records the group's multicast GID and the associated characteristics. The group characteristics are defined by the group path MTU, whether the group will be used for raw datagrams or unreliable datagrams, the service level, the partition key associated with the group, the Local Identifier (LID) associated with the group, and so on. These characteristics are defined at the time of the group creation. The interested reader may look up the 'MCMemberRecord' attribute in the IB architecture specification [IB_ARCH] for the complete list of characteristics that define a group.

通过向SM发送加入请求来创建多播组。如下文所述,IBA定义了加入多播组的多种模式。子网管理器记录组的多播GID和相关特征。组特征由组路径MTU定义,组将用于原始数据报还是不可靠数据报,服务级别,与组关联的分区密钥,与组关联的本地标识符(LID),等等。这些特征在创建组时定义。感兴趣的读者可以在IB体系结构规范[IB_ARCH]中的“MCMemberRecord”属性中查找定义组的特征的完整列表。

A LID is associated with the multicast group by the SM at the time of the multicast group creation. The SM determines the multicast tree based on all the group members and programs the relevant switches. The Multicast LID (MLID) is used by the switches to route the packets.

在多播组创建时,SM将LID与多播组相关联。SM根据所有组成员确定多播树,并对相关交换机进行编程。交换机使用多播LID(MLID)路由数据包。

Any member IB port wanting to participate in the multicast group must join the group. As part of the join operation, the node receives the group characteristics from the SM. At the same time, the subnet manager ensures that the requester can indeed participate in the group by verifying that it can support the group MTU and its accessibility to the rest of the group members. Other group characteristics may need verification too.

任何想要加入多播组的成员IB端口都必须加入该组。作为连接操作的一部分,节点从SM接收组特征。同时,子网管理器通过验证请求者是否能够支持组MTU及其对组其余成员的可访问性,确保请求者确实能够参与组。其他群体特征也可能需要验证。

The SM, for groups that span IB subnet boundaries, must interact with IB routers to determine the presence of this group in other IB subnets. If present, the MTU must match across the IB subnets.

对于跨越IB子网边界的组,SM必须与IB路由器交互,以确定该组是否存在于其他IB子网中。如果存在,MTU必须跨IB子网匹配。

P_Key is another characteristic that must match across IB subnets since the P_Key inserted into a packet is not modified by the IB switches or IB routers. Thus, if the P_Keys did not match the IB router(s) itself might drop the packets or destinations on other subnets might drop the packets.

P_密钥是IB子网之间必须匹配的另一个特征,因为插入到数据包中的P_密钥不会被IB交换机或IB路由器修改。因此,如果P_密钥不匹配,IB路由器本身可能会丢弃数据包,或者其他子网上的目的地可能会丢弃数据包。

A join operation may cause the SM to reprogram the fabric so that the new member can participate in the multicast group. By the same token, a leave may cause the SM to reprogram the fabric to stop forwarding the packets to the requester.

加入操作可能导致SM重新编程结构,以便新成员可以参与多播组。出于同样的原因,许可可能导致SM重新编程结构,以停止将数据包转发给请求者。

1.3.1. Multicast Member Record
1.3.1. 多播成员记录

The multicast group is maintained by the SM with each of the group members represented by an MCMemberRecord [IB_ARCH]. Some of its components are the following:

多播组由SM维护,每个组成员由MCMemberRecord[IB_ARCH]表示。其部分组成部分如下:

   MGID      - Multicast GID for this multicast group
   PortGID   - Valid GID of the port joining this multicast group
   Q_Key     - Q_Key to be used by this multicast group
   MLID      - Multicast LID for this multicast group
   MTU       - MTU for this multicast group
   P_Key     - Partition key for this multicast group
   SL        - Service level for this multicast group
   Scope     - Same as MGID address scope
   JoinState - Join/Leave status requested by the port:
               bit 0: FullMember
               bit 1: NonMember
               bit 2: SendOnlyNonMember
        
   MGID      - Multicast GID for this multicast group
   PortGID   - Valid GID of the port joining this multicast group
   Q_Key     - Q_Key to be used by this multicast group
   MLID      - Multicast LID for this multicast group
   MTU       - MTU for this multicast group
   P_Key     - Partition key for this multicast group
   SL        - Service level for this multicast group
   Scope     - Same as MGID address scope
   JoinState - Join/Leave status requested by the port:
               bit 0: FullMember
               bit 1: NonMember
               bit 2: SendOnlyNonMember
        
1.3.1.1. JoinState
1.3.1.1. 联合国

The JoinState indicates the membership qualities a port wishes to add while joining/creating a group or delete when leaving a group. The meaning of the JoinState bits are as follows:

JoinState表示端口希望在加入/创建组时添加或在离开组时删除的成员资格。JoinState位的含义如下:

FullMember: Messages destined for the group are routed to and from the port. A group may be deleted by the SM if there are no FullMembers in the group.

FullMember:发送给组的消息将路由到端口或从端口发送。如果组中没有完整成员,SM可能会删除该组。

NonMember: Messages destined for the group are routed to and from the port. The port is not considered a member for purposes of group creation/deletion.

非成员:发送给组的消息路由到端口或从端口发送。出于创建/删除组的目的,该端口不被视为成员。

SendOnlyNonMember: Group messages are only routed from the port but not to the port. The port is not considered a member for purposes of group creation/deletion.

SendOnlyNonMember:组消息仅从端口路由,而不路由到端口。出于创建/删除组的目的,该端口不被视为成员。

A port may have multiple bits set in its record. In such a case, the membership qualities are a union of the JoinStates. A port may leave the multicast group for each of the JoinStates individually or in any combination of JoinState bits [IB_ARCH].

一个端口可以在其记录中设置多个位。在这种情况下,成员资格是国家联盟。端口可以单独地或以JoinState比特[IB_ARCH]的任何组合为每个JoinState离开多播组。

1.3.2. Join and Leave Operations
1.3.2. 加入和退出操作

An IB port joins a multicast group by sending a join request (SubnAdmSet() method) and leaves a multicast group by sending a leave message (SubnAdmDelete() method) to the SM. The IBA specification [IB_ARCH] describes the methods and attributes to be used when sending these messages.

IB端口通过发送加入请求(SubnAdmSet()方法)加入多播组,并通过向SM发送离开消息(SubnAdmDelete()方法)离开多播组。IBA规范[IB_ARCH]描述了发送这些消息时要使用的方法和属性。

1.3.2.1. Creating a Multicast Group
1.3.2.1. 创建多播组

There is no 'create' command to form a new multicast group. The FullMember bit in the JoinState must be set to create a multicast group. In other words, the first FullMember join request will cause the group to be created as a side effect of the join request. Subsequent join or leave requests may contain any combination of the JoinState bits.

没有“创建”命令来形成新的多播组。必须设置JoinState中的FullMember位以创建多播组。换句话说,第一个FullMember加入请求将导致创建组作为加入请求的副作用。后续的加入或离开请求可以包含JoinState位的任意组合。

The creator of the group specifies the Q_Key, MTU, P_Key, SL, FlowLabel, TClass, and the Scope value. A creator may request that a suitable MGID be created for it. Alternatively, the request can specify the desired MGID. In both cases, the MLID is assigned by the SM.

组的创建者指定Q_键、MTU、P_键、SL、FlowLabel、TClass和范围值。创建者可以请求为其创建合适的MGID。或者,请求可以指定所需的MGID。在这两种情况下,MLID都由SM分配。

Thus, a group will be created with the specified values when the requester sets the FullMember bit and no such group already exists in the subnet.

因此,当请求者设置FullMember位且子网中不存在这样的组时,将使用指定的值创建一个组。

1.3.2.2. Deleting a Multicast Group
1.3.2.2. 删除多播组

When the last FullMember leaves the multicast group the SM may delete the multicast group releasing all resources, including those that might exist in the fabric itself, associated with the group.

当最后一个完整成员离开多播组时,SM可以删除该多播组,释放所有资源,包括可能存在于结构本身中的与该组关联的资源。

Note that a special 'delete' message does not exist. It is a side effect of the last FullMember 'leave' operation.

请注意,不存在特殊的“删除”消息。这是最后一次正式会员“休假”操作的副作用。

1.3.2.3. Multicast Group Create/Delete Traps
1.3.2.3. 多播组创建/删除陷阱

The SA may be requested by the ports to generate a report whenever a multicast group is created or deleted. The port can specify the multicast group(s) it is interested in by using its MGID or by submitting a wild card request. The SA will report these events using traps 66 (for creates) and 67 (for deletes)[IB_ARCH].

每当创建或删除多播组时,端口可请求SA生成报告。端口可以通过使用其MGID或提交通配符请求来指定它感兴趣的多播组。SA将使用陷阱66(用于创建)和67(用于删除)报告这些事件[IB_ARCH]。

Therefore, a port wishing to join a group but not create it by itself may request a create notification or a port might even request a notification for all groups that are created (a wild card request). The SA will diligently inform them of the creation utilizing the aforementioned traps. The requester can then join the multicast group indicated. Similarly, a SendOnlyNonMember or a NonMember might request the SA to inform it of group deletions. The endnode, on receiving a delete report, can safely release the resources associated with the group. The associated MLID is no longer valid for the group and may be reassigned to a new multicast group by the SM.

因此,希望加入组但不自行创建组的端口可能会请求创建通知,或者端口甚至可能会请求所有已创建组的通知(通配符请求)。SA将勤勉地通知他们使用上述陷阱的创造。然后,请求者可以加入指定的多播组。类似地,SendOnlyNonMember或非成员可能会请求SA通知其组删除。endnode在收到删除报告后,可以安全地释放与组关联的资源。关联的MLID对该组不再有效,SM可能会将其重新分配给新的多播组。

2. Management of InfiniBand Subnet
2. InfiniBand子网的管理

To aid in the monitoring and configuration of InfiniBand subnet components, a set of MIB modules needs to be defined. MIB modules are needed for the channel adapters, InfiniBand interfaces, InfiniBand subnet manager, and InfiniBand subnet management agents and to allow the management of specific device properties. It must be noted that the management objects addressed in the IPoIB documents are for all of the IB subnet components and are not limited to IP (over IB). The relevant MIB modules are described in separate documents and are not covered here.

为了帮助监控和配置InfiniBand子网组件,需要定义一组MIB模块。通道适配器、InfiniBand接口、InfiniBand子网管理器和InfiniBand子网管理代理需要MIB模块,并允许管理特定的设备属性。必须注意,IPoIB文档中所述的管理对象适用于所有IB子网组件,且不限于IP(通过IB)。相关MIB模块在单独的文档中进行了描述,此处不作介绍。

3. IP over IB
3. IP over IB

As described in section 1.0, the InfiniBand architecture provides a broad set of capabilities to choose from when implementing IP over InfiniBand networks.

如第1.0节所述,InfiniBand体系结构提供了一系列在InfiniBand网络上实现IP时可供选择的功能。

The IPoIB specification must not, and does not, require changes in IP and higher-layer protocols. Nor does it mandate requirements on IP stacks to implement special user-level programs. It is an aim of IPoIB specification that the IPoIB changes be amenable to modularization and incorporation into existing implementations at the same level as other media types.

IPoIB规范不得也不要求更改IP和更高层协议。它也没有强制要求IP堆栈实现特殊的用户级程序。IPoIB规范的一个目标是,IPoIB的更改可以进行模块化,并与其他媒体类型在同一级别合并到现有的实现中。

3.1. InfiniBand as Datalink
3.1. InfiniBand作为数据链路

InfiniBand architecture provides multiple methods of data exchange between two endpoints as was noted above. These are the following:

如上所述,InfiniBand体系结构提供了两个端点之间的多种数据交换方法。这些措施如下:

Reliable Connected (RC) Reliable Datagram (RD) Unreliable Connected (UC) Unreliable Datagram (UD) Raw Datagram : Raw IPv6 (R6) : Raw Ethertype (RE)

可靠连接(RC)可靠数据报(RD)不可靠连接(UC)不可靠数据报(UD)原始数据报:原始IPv6(R6):原始以太网类型(RE)

IPoIB can be implemented over any, multiple, or all of these services. A case can be made for support on any of the transport methods depending on the desired features.

IPoIB可以在任何、多个或所有这些服务上实现。根据所需的特征,可以制作一个支持任何运输方法的箱子。

The IB specification requires Unreliable Datagram mode to be supported by all the IB nodes. The host channel adapters (HCAs) are specifically required to support Reliable connected (RC) and Unreliable connected (UC) modes but the same is not the case with target channel adapters (TCAs). Support for the two Raw Datagram modes is entirely optional. The Raw Datagram mode supports a 16-bit Cyclic Redundancy Check (CRC) as compared to the better protection provided by the use of a 32-bit CRC in other modes.

IB规范要求所有IB节点都支持不可靠的数据报模式。主机通道适配器(HCA)特别需要支持可靠连接(RC)和不可靠连接(UC)模式,但目标通道适配器(TCA)的情况并非如此。对两种原始数据报模式的支持完全是可选的。与在其他模式中使用32位CRC提供的更好保护相比,原始数据报模式支持16位循环冗余校验(CRC)。

For the sake of simplicity, ease of implementation and integration with existing stacks, it is desirable that the fabric support multicasting. This is possible only in Unreliable datagram (UD) and IB's Raw datagram modes.

为了简单、易于实现以及与现有堆栈的集成,结构支持多播是可取的。这仅在不可靠数据报(UD)和IB的原始数据报模式下才可能实现。

Thus, it is only the UD mode that is universal, supports multicast, and supports a robust CRC. Given these conditions it is the obvious choice for IP over InfiniBand [RFC4391].

因此,只有UD模式是通用的,支持多播,并且支持健壮的CRC。考虑到这些条件,IP over InfiniBand[RFC4391]显然是一种选择。

Future documents might consider the connected modes. In contrast to the limited link MTU offered by UD mode, the connected modes can offer significant benefit in terms of performance by utilizing a larger MTU. Reliability is also enhanced if the underlying feature of automatic path migration of connected modes is utilized.

未来的文档可能会考虑连接模式。与UD模式提供的有限链路MTU相比,连接模式可以通过使用更大的MTU在性能方面提供显著优势。如果利用连接模式的自动路径迁移的基本功能,可靠性也会提高。

3.2. Multicast Support
3.2. 多播支持

InfiniBand specification makes support of multicasting in the switches optional. Multicast however, is a basic requirement in IP networks. Therefore, IPoIB requires that multicast-capable InfiniBand fabrics be used to implement IPoIB subnets.

InfiniBand规范使交换机中的多播支持成为可选。然而,组播是IP网络中的一项基本要求。因此,IPoIB要求使用支持多播的InfiniBand结构来实现IPoIB子网。

3.2.1. Mapping IP Multicast to IB Multicast
3.2.1. 将IP多播映射到IB多播

Well-known IP multicast groups are defined for both IPv4 and IPv6 [IANA, RFC3513]. Multicast groups may also be dynamically created at any time. To avoid creating unnecessary duplicates of multicast packets in the fabric, and to avoid unnecessary handling of such packets at the hosts, each of the IP multicast groups needs to be associated with a different IB multicast group as far as possible. A process is defined in [RFC4391] for mapping the IP multicast addresses to unique IB multicast addresses.

为IPv4和IPv6定义了众所周知的IP多播组[IANA,RFC3513]。也可以随时动态创建多播组。为了避免在结构中创建不必要的多播分组副本,并避免在主机上对此类分组进行不必要的处理,每个IP多播组需要尽可能与不同的IB多播组相关联。[RFC4391]中定义了将IP多播地址映射到唯一IB多播地址的过程。

3.2.2. Transient Flag in IB MGIDs
3.2.2. IB-mgid中的瞬态标志

The IB specification describes the flag bits as discussed in section 1.2. The IB specification also defines some well-known IB MGIDs. The MGIDs are reserved for the IB's Raw Datagram mode which is incompatible with the other transports of IB. Any mapping that is defined from IP multicast addresses therefore must not fall into IB's definition of a well-known address.

IB规范描述了第1.2节中讨论的标志位。IB规范还定义了一些众所周知的IB MGID。MGID保留用于IB的原始数据报模式,该模式与IB的其他传输不兼容。因此,从IP多播地址定义的任何映射不得属于IB的已知地址定义。

Therefore all IPoIB related multicast GIDs always set the transient bit.

因此,所有与IPoIB相关的多播GID始终设置瞬时位。

3.3. IP Subnets Across IB Subnets
3.3. 跨IB子网的IP子网

Some implementations may wish to support multiple clusters of machines in their own IB subnets but otherwise be part of a common IP subnet. For such a solution, the IB specification needs multiple upgrades. Some of the required enhancements are as follows:

一些实现可能希望在其自己的IB子网中支持多个计算机集群,但也可能是公共IP子网的一部分。对于这样的解决方案,IB规范需要多次升级。所需的一些增强功能如下:

1) A method for creating IB multicast GIDs that span multiple IB subnets. The partition keys and other parameters need to be consistent across IB subnets.

1) 一种用于创建跨多个IB子网的IB多播GID的方法。分区键和其他参数需要在IB子网之间保持一致。

2) Develop IB routing protocol to determine the IB topology across IB subnets.

2) 开发IB路由协议,以确定IB子网之间的IB拓扑。

3) Define the process and protocols needed between IB nodes and IB routers.

3) 定义IB节点和IB路由器之间所需的流程和协议。

Until the above conditions are met, it is not possible to implement IPoIB subnets that span IB subnets. The IPoIB standards have however, been defined with this possibility in mind.

在满足上述条件之前,不可能实现跨越IB子网的IPoIB子网。然而,IPoIB标准的定义考虑到了这种可能性。

4. IP Subnets in InfiniBand Fabrics
4. InfiniBand结构中的IP子网

The IPoIB subnet is overlaid over the IB subnet. The IPoIB subnet is brought up in the following steps:

IPoIB子网覆盖在IB子网上。IPoIB子网通过以下步骤启动:

Note: the join/leave operation at the IP level will be referred to as IP_join/IP_leave and the join/leave operations at the IB level will be referred to as IB_join in this document.

注:在本文件中,IP级别的加入/离开操作称为IP_加入/IP_离开,IB级别的加入/离开操作称为IB_加入。

1. The all-IPoIB nodes IB multicast group is created

1. 将创建所有IPoIB节点IB多播组

The fabric administrator creates an IB multicast group (henceforth called 'broadcast group') when the IP subnet is set up. The 'broadcast group' is defined in [RFC4391]. The method by which the broadcast group is setup is not defined by IPoIB. The group may be setup at the SM by the administrator or by the first IB_join.

结构管理员在设置IP子网时创建一个IB多播组(以下称为“广播组”)。[RFC4391]中定义了“广播组”。IPoIB未定义设置广播组的方法。该组可由管理员或第一次IBU加入在SM处设置。

As noted earlier, at the time of creating an IB multicast group, multiple values such as the P_Key, Q_Key, Service Level, Hop Limit, Flow ID, TClass, MTU, etc. have to be specified. These values should be such that all potential members of the IB multicast group are able to communicate with one another when using them. In the future, as the IB specification associates more meaning with the various parameters and defines IB Quality of Service (QoS), different values for IP multicast traffic may be possible. All unicast packets also need to use the P_Key and Q_Key specified in the broadcast group [RFC4391]. It is obvious that a thought out configuration is required for a successful setup of the IPoIB subnet.

如前所述,在创建IB多播组时,必须指定多个值,例如P_密钥、Q_密钥、服务级别、跃点限制、流ID、TClass、MTU等。这些值应确保IB多播组的所有潜在成员在使用它们时能够彼此通信。将来,随着IB规范将更多的含义与各种参数相关联,并定义IB服务质量(QoS),IP多播流量的不同值可能是可能的。所有单播数据包还需要使用广播组[RFC4391]中指定的P_密钥和Q_密钥。显然,成功设置IPoIB子网需要经过深思熟虑的配置。

2. All IPoIB interfaces IB_join the broadcast group

2. 所有IPoIB接口IB_都加入广播组

The broadcast group defines the span and the members of the IPoIB link. This link gets built up as IPoIB nodes IB_join the broadcast group.

广播组定义IPoIB链路的范围和成员。当IPoIB节点IB_加入广播组时,会建立此链接。

The IB_join to the broadcast group has the additional benefit of distributing the above mentioned multicast group parameters to all the members of the subnet.

IB_加入广播组具有将上述多播组参数分配给子网所有成员的额外好处。

Note that this IB_join to the broadcast group is a FullMember join. If any of the ports or the switches linking the port to the rest of the IPoIB subnet cannot support the parameters (e.g., path MTU or P_Key) associated with the broadcast group, then the IB_join request will fail and the requesting port will not become part of the IPoIB subnet.

请注意,广播组的此IB_连接是一个完整成员连接。如果将端口链接到IPoIB子网其余部分的任何端口或交换机无法支持与广播组关联的参数(例如,路径MTU或P_密钥),则IB_加入请求将失败,请求端口将不会成为IPoIB子网的一部分。

3. Configuration Parameters

3. 配置参数

As noted above, parameters such as Q_Key and Path MTU, which are needed for all IPoIB communication, are returned to the IPoIB node on IB_joining the 'broadcast group'. [RFC4391] also notes that

如上所述,所有IPoIB通信所需的Q_密钥和路径MTU等参数返回到加入“广播组”的IB_上的IPoIB节点。[RFC4391]还注意到

the parameters used in the broadcast group are used when creating other multicast groups.

在创建其他多播组时,将使用广播组中使用的参数。

However, the P_Key must still be known to the IPoIB endnode before it can join the broadcast group. The P_Key is included in the mapping of the broadcast group [RFC4391]. Another parameter, the scope of the broadcast group, also needs to be known to the endnode before it can join the broadcast group. It is an implementation choice on how the P_Key and the scope bits related to the IPoIB subnet are determined by the implementation. These could be configuration parameters initialized by some means by the administrator.

但是,IPoIB端节点在加入广播组之前必须知道P_密钥。P_键包含在广播组[RFC4391]的映射中。另一个参数,即广播组的范围,也需要在endnode加入广播组之前为其所知。这是一种实现选择,取决于实现如何确定与IPoIB子网相关的P_密钥和作用域位。这些可以是管理员通过某种方式初始化的配置参数。

The methods employed by an implementation to determine the P_Key and scope bits are not specified by IPoIB.

IPoIB未指定实现用于确定P_键和作用域位的方法。

4.1. IPoIB VLANs
4.1. IPoIB虚拟局域网

The endpoints in an IB subnet must have compatible P_Keys to communicate with one another. Thus, the administrator when setting up an IP subnet over an IB subnet must ensure that all the members have compatible P_Keys. An IP subnet can have only one P_Key associated with it to ensure that all IP nodes in it can talk to one another. An endpoint may, however, have multiple P_Keys.

IB子网中的端点必须具有兼容的P_密钥才能相互通信。因此,在IB子网上设置IP子网时,管理员必须确保所有成员都具有兼容的P_密钥。IP子网只能有一个与之关联的P_密钥,以确保其中的所有IP节点都可以相互通信。但是,端点可能有多个P_键。

The IB architecture specifies that there can be only one MGID associated with a multicast group in the IB subnet. The P_Key is included in the MGID mappings from the IP multicast addresses [RFC4391]. Since the P_Key is unique in the IB subnet, the inclusion of the P_Key in the IB MGIDs ensures that unique MGID mappings are created. Every unique broadcast group MGID so formed creates a separate abstract IPoIB link and hence an IPoIB VLAN.

IB体系结构指定IB子网中只能有一个与多播组关联的MGID。P_密钥包含在来自IP多播地址[RFC4391]的MGID映射中。由于P_密钥在IB子网中是唯一的,因此在IB MGID中包含P_密钥可确保创建唯一的MGID映射。这样形成的每个唯一广播组MGID创建一个单独的抽象IPoIB链路,从而创建一个IPoIB VLAN。

4.2. Multicast in IPoIB subnets
4.2. IPoIB子网中的组播

IP multicast on InfiniBand subnets follows the same concepts and rules as on any other media. However, unlike most other media multicast over InfiniBand requires interaction with another entity, the IB subnet manager. This section describes the outline of the process and suggests some guidelines.

InfiniBand子网上的IP多播遵循与任何其他媒体上相同的概念和规则。但是,与InfiniBand上的大多数其他媒体多播不同,它需要与另一个实体IB子网管理器进行交互。本节描述了该过程的概要,并提出了一些指导原则。

IB architecture specifies the following format for IB multicast packets when used over Unreliable Datagram (UD) mode:

IB体系结构在不可靠数据报(UD)模式上使用时,为IB多播数据包指定以下格式:

   +--------+-------+---------+---------+-------+---------+---------+
   |Local   |Global |Base     |Datagram |Packet |Invariant| Variant |
   |Routing |Routing|Transport|Extended |Payload| CRC     |  CRC    |
   |Header  |Header |Header   |Transport| (IP)  |         |         |
   |        |       |         |Header   |       |         |         |
   +--------+-------+---------+---------+-------+---------+---------+
        
   +--------+-------+---------+---------+-------+---------+---------+
   |Local   |Global |Base     |Datagram |Packet |Invariant| Variant |
   |Routing |Routing|Transport|Extended |Payload| CRC     |  CRC    |
   |Header  |Header |Header   |Transport| (IP)  |         |         |
   |        |       |         |Header   |       |         |         |
   +--------+-------+---------+---------+-------+---------+---------+
        

For details about the various headers please refer to InfiniBand Architecture Specification [IB_ARCH].

有关各种标头的详细信息,请参阅InfiniBand体系结构规范[IB_ARCH]。

The Global Routing Header (GRH) includes the IB multicast group GID. The Local Routing Header (LRH) includes the Local Identifier (LID). The IB switches in the fabric route the packet based on the LID.

全局路由报头(GRH)包括IB多播组GID。本地路由头(LRH)包括本地标识符(LID)。结构中的IB交换机根据LID路由数据包。

The GID is made available to the receiving IB user (the IPoIB interface driver for example). The driver can therefore determine the IB group the packet belongs to.

GID可供接收IB用户使用(例如IPoIB接口驱动程序)。因此,驱动程序可以确定数据包所属的IB组。

IPv4 defines three levels of multicast conformance [RFC1112].

IPv4定义了三个级别的多播一致性[RFC1112]。

Level 0: No support for IP multicasting

级别0:不支持IP多播

Level 1: Support for sending but not receiving multicasts

级别1:支持发送但不接收多播

Level 2: Full support for IP multicasting

第2级:完全支持IP多播

In IPv6, there is no such distinction. Full multicast support is mandatory. In addition, all IPv4 subnets support broadcast (255.255.255.255). IPv4 broadcast can always be sent/received by all IPv4 interfaces.

在IPv6中,没有这样的区别。完全多播支持是必需的。此外,所有IPv4子网都支持广播(255.255.255.255)。IPv4广播始终可以由所有IPv4接口发送/接收。

Every IPoIB subnet requires the broadcast GID to be defined. Thus, a packet can always be broadcast.

每个IPoIB子网都需要定义广播GID。因此,包总是可以被广播的。

4.2.1. Sending IP Multicast Datagrams
4.2.1. 发送IP多播数据报

An IP host may send a multicast packet at any time to any multicast address.

IP主机可以在任何时间向任何多播地址发送多播分组。

The IP layer conveys the multicast packet to the IPoIB interface driver/module. This module attempts to IB_join the relevant IB multicast group. This is required since otherwise InfiniBand architecture does not guarantee that the packet will reach its destinations.

IP层将多播数据包传送到IPoIB接口驱动程序/模块。此模块尝试IB_加入相关IB多播组。这是必需的,因为否则InfiniBand体系结构不能保证数据包将到达其目的地。

A pure sender may choose to join the multicast group as a FullMember. In such a case, the sender will receive all the multicast packets transmitted to the IB group. In addition, the IB group will not be deleted until the sender leaves the group.

纯发送方可以选择作为完整成员加入多播组。在这种情况下,发送方将接收发送到IB组的所有多播分组。此外,在发件人离开IB组之前,不会删除IB组。

Alternatively, a sender might IB_join as a SendOnlyNonMember. In such a case, the packets are not routed to the sender though packets transmitted by it can reach the other group members. In addition, the group can be deleted when all FullMembers have left the group. The sender can further request delete updates from the SM.

或者,发送方可以IB_作为SendOnlyNonMember加入。在这种情况下,尽管发送方发送的数据包可以到达其他组成员,但数据包不会路由到发送方。此外,当所有完整成员离开组时,可以删除该组。发件人可以进一步请求从SM删除更新。

If the sender does not find the group in existence, it is recommended in [RFC4391] that the packets be sent to the MGID corresponding to the all-IP routers address. A sender could also send the packets to the broadcast group. The sender might also choose to request 'creation' reports from the SM.

如果发送方未发现该组存在,则在[RFC4391]中建议将数据包发送到对应于所有IP路由器地址的MGID。发送方也可以将数据包发送到广播组。发件人还可以选择从SM请求“创建”报告。

4.2.2. Receiving Multicast Packets
4.2.2. 接收多播数据包

The IP host must join the IB multicast group corresponding to the IP address. This follows from the IBA requirement that the receiver must join the relevant IB multicast group. The group is automatically created if it does not exist [IB_ARCH].

IP主机必须加入与IP地址对应的IB多播组。这源于IBA的要求,即接收方必须加入相关的IB多播组。如果该组不存在,则会自动创建该组[IB_ARCH]。

The IP receivers must IB_leave the IB group when the IP layer stops listening of the corresponding IP address. The SM can then choose to delete the group.

当IP层停止侦听相应的IP地址时,IP接收器必须离开IB组。然后,SM可以选择删除该组。

4.2.3. Router Considerations for IPoIB
4.2.3. IPoIB的路由器注意事项

IP routers know of the new IP groups created in the subnet by the use of protocols such as Internet Group Management Protocol (IGMPv3) / Multicast Listener Discovery (MLD) [RFC3376, RFC2710]. However, this is not enough for IPoIB since the router needs to IB_join the relevant IB groups to be able to receive and transmit the packets. There is no promiscuous mode for listening to all packets.

IP路由器知道通过使用诸如Internet组管理协议(IGMPv3)/多播侦听器发现(MLD)[RFC3376,RFC2710]等协议在子网中创建的新IP组。然而,这对于IPoIB是不够的,因为路由器需要IB_加入相关的IB组才能接收和发送数据包。没有用于侦听所有数据包的混杂模式。

The IPoIB routers therefore need to request the SM to report all creations of IB groups in the fabric. The IPoIB router can then IB_join the reported group. It is not desirable that the router's IB_joining of a multicast group be considered the same as the IB_join from a receiver -- the router's IB_join should not disallow the group's deletion when all receivers leave. To overcome just this type of situation, IBA provides the NonMember IB_join mode.

因此,IPoIB路由器需要请求SM报告结构中IB组的所有创建。然后,IPoIB路由器可以IB_加入报告的组。多播组中路由器的IBU连接被视为与接收器的IBU连接相同,这是不可取的——当所有接收器离开时,路由器的IBU连接不应禁止删除该组。为了克服这种情况,IBA提供了非成员ibu连接模式。

The NonMember IB_join mode can be used by IP routers when they join in response to the create reports. A router should ideally request the delete reports too so that it can release all the resources

当IP路由器响应创建报告而加入时,可以使用非成员IB_加入模式。理想情况下,路由器也应该请求删除报告,以便释放所有资源

associated with the group. The MLID associated with a deleted MGID can be reassigned by the SM, and therefore there is a possibility of erroneous transmissions if the MLID is cached. A router that does not request delete reports will still work correctly since it will receive the correct MLID , and purge any old cached value, when it IB_joins the IB group in response to a create report.

与该组关联。SM可以重新分配与已删除的MGID关联的MLID,因此,如果缓存MLID,则有可能发生错误传输。不请求删除报告的路由器仍能正常工作,因为它将在IB_加入IB组以响应创建报告时接收正确的MLID,并清除任何旧的缓存值。

It is reasonable for a router to IB_join as a FullMember if it is joining the IB group in response to an application/routing daemon request. In such a case, the router might end up controlling the existence of the IB group (since it is a FullMember of the group).

如果路由器响应应用程序/路由守护进程请求而加入IB组,则将IB_作为完整成员加入是合理的。在这种情况下,路由器可能最终控制IB组的存在(因为它是组的完整成员)。

4.2.4. Impact of InfiniBand Architecture Limits
4.2.4. InfiniBand体系结构限制的影响

An HCA or TCA may have a limit on the number of MGIDs it can support. Thus, even though the groups may not be limited at the subnet manager and in the subnet as such, they may be limited at a particular interface. It is advisable to choose an adequately provisioned HCA/TCA when setting up an IPoIB subnet.

HCA或TCA可能对其支持的MGID数量有限制。因此,即使组可能不限于子网管理器和子网中,它们也可能限于特定接口。在设置IPoIB子网时,建议选择充分配置的HCA/TCA。

4.2.5. Leaving/Deleting a Multicast Group
4.2.5. 离开/删除多播组

An IPv4 sender (level 1 compliance) IB_joins the IB multicast group only because that is the only way to guarantee reception of the packets by all the group recipients. The sender must, however, IB_leave the group at some time. A sender could, when not a receiver on the group, start a timer per multicast group sent to. The sender leaves the IB group when the timer goes off. It restarts the timer if another message is sent.

IPv4发送方(1级合规性)IB_加入IB多播组只是因为这是保证所有组接收方接收数据包的唯一方法。但是,发件人必须在某个时间离开组。当发送方不是组中的接收方时,可以为发送到的每个多播组启动计时器。当计时器关闭时,发送方离开IB组。如果发送了另一条消息,它将重新启动计时器。

This suggestion does not apply to the IB broadcast group. It also does not apply to the IB group corresponding to the all-hosts multicast group. An IPv4 host must always remain a member of the broadcast group.

此建议不适用于IB广播组。它也不适用于所有主机多播组对应的IB组。IPv4主机必须始终是广播组的成员。

An IP multicast receiver IB_leaves the corresponding IB multicast group when it IP_leaves the IP multicast group. In the case of IPv4 implementation, the receiver may choose to continue to be a sender (level 1 compliance), in which case it may choose not to IB_leave the IB group but start a timer as explained above.

IP多播接收器IB_在离开IP多播组时离开相应的IB多播组。在IPv4实施的情况下,接收方可以选择继续作为发送方(1级合规),在这种情况下,它可以选择不离开IB组,而是如上所述启动计时器。

As noted elsewhere, the SM can choose to free up the resources (e.g., routing entries in the switches) associated with the IB group when the last FullMember IB_leaves the group. The MLID therefore becomes invalid for the group. The MLID can be reassigned when a new group is created.

如其他地方所述,SM可以选择在最后一个完整成员IB_离开IB组时释放与IB组相关的资源(例如交换机中的路由条目)。因此,MLID对于组无效。创建新组时,可以重新分配MLID。

SendOnlyNonMember/NonMember ports caching the MLID need to avoid this possibility. The way out is for them to request group delete reports. An IP router requesting reports for all groups need not request the delete report since an IB_join in response to a create report will return the new MLID association to it.

SendOnlyOnMember/非成员端口缓存MLID需要避免这种可能性。解决办法是他们请求组删除报告。请求所有组的报告的IP路由器不需要请求删除报告,因为响应create报告的IB_join将向其返回新的MLID关联。

A router might prefer to IB_leave the IB multicast group when there are no members of the IP multicast address in the subnet and it has no explicit knowledge of any need to forward such packets.

当子网中没有IP多播地址的成员且路由器不明确知道需要转发此类数据包时,路由器可能更愿意离开IB_多播组。

4.3. Transmission of IPoIB Packets
4.3. IPoIB数据包的传输

The encapsulation of IP packets in InfiniBand is described in [RFC4391].

[RFC4391]中描述了InfiniBand中IP数据包的封装。

It specifies the use of an 'Ethertype' value [IANA] in all IPoIB communication packets. The link-layer address is comprised of the GID and the Queue Pair Number (QPN) [RFC4391].

它指定在所有IPoIB通信数据包中使用“Ethertype”值[IANA]。链路层地址由GID和队列对号(QPN)[RFC4391]组成。

To enable IPoIB subnets to span across multiple IB-subnets, the specification utilizes the GID as part of the link-layer address. Since all packets in IB have to use the Local Identifier (LID), the address resolution process has the additional step of resolving the destination GID, returned in response to Address Resolution Protocol (ARP) / Neighbor Discover (ND) request, to the LID [RFC4391]. This phase of address resolution might also be used to determine other essential parameters (e.g., the SL, path rate, etc.) for successful IB communication between two peers.

为了使IPoIB子网能够跨越多个IB子网,规范将GID用作链路层地址的一部分。由于IB中的所有数据包都必须使用本地标识符(LID),因此地址解析过程还有一个额外步骤,即解析目标GID,该GID响应地址解析协议(ARP)/邻居发现(ND)请求返回给LID[RFC4391]。地址解析的这一阶段还可用于确定两个对等方之间成功IB通信的其他基本参数(例如,SL、路径速率等)。

As noted earlier, all communication in the IPoIB subnet derives the Q_Key to use from the Q_Key specified in the broadcast group.

如前所述,IPoIB子网中的所有通信都从广播组中指定的Q_密钥导出要使用的Q_密钥。

4.4. Reverse Address Resolution Protocol (RARP) and Static ARP Entries
4.4. 反向地址解析协议(RARP)和静态ARP条目

RARP entries or static ARP entries are based on invariant link addresses. In the case of IPoIB, the link address includes the QPN, which might not be constant across reboots or even across network interface resets. Therefore, static ARP entries or RARP server entries will only work if the implementation(s) using these options can ensure that the QPN associated with an interface is invariant across reboots/network resets [RFC4391].

RARP条目或静态ARP条目基于不变的链接地址。在IPoIB的情况下,链路地址包括QPN,QPN在重新启动或网络接口重置期间可能不是常数。因此,静态ARP条目或RARP服务器条目只有在使用这些选项的实现能够确保与接口相关联的QPN在重新启动/网络重置期间保持不变时才会起作用[RFC4391]。

4.5. DHCPv4 and IPoIB
4.5. DHCPv4和IPoIB

DHCPv4 [RFC2131] utilizes a 'client identifier' field (expected to hold the link-layer address) of 16 octets. The address in the case of IPoIB is 20 octets. To get around this problem, IPoIB specifies [RFC4390] that the 'broadcast flag' be used by the client when requesting an IP address.

DHCPv4[RFC2131]使用16个八位字节的“客户机标识符”字段(预计保存链路层地址)。IPoIB的地址是20个八位字节。为了解决这个问题,IPoIB指定[RFC4390]客户端在请求IP地址时使用“广播标志”。

5. QoS and Related Issues
5. 服务质量及相关问题

The IB specification suggests the use of service levels for load balancing, QoS, and deadlock avoidance within an IB subnet. But the IB specification leaves the usage and mode of determination of the SL for the application to decide. The SL and list of SLs are available in the SA, but it is up to the endnode's application to choose the 'right' value.

IB规范建议在IB子网内使用服务级别进行负载平衡、QoS和死锁避免。但是IB规范将SL的用法和确定模式留给应用程序来决定。SL和SL列表在SA中可用,但由endnode的应用程序选择“正确”值。

Every IPoIB implementation will determine the relevant SL value based on its own policy. No method or process for choosing the SL has been defined by the IPoIB standards.

每个IPoIB实施将根据自己的政策确定相关的SL值。IPoIB标准尚未定义选择SL的方法或流程。

6. Security Considerations
6. 安全考虑

This document describes the IB architecture as relevant to IPoIB. It further restates issues specified in other documents. It does not itself specify any requirements. There are no security issues introduces by this document. IPoIB-related security issues are described in [RFC4391] and [RFC4390].

本文档描述了与IPoIB相关的IB体系结构。它进一步重申了其他文件中规定的问题。它本身没有规定任何要求。本文档没有引入任何安全问题。[RFC4391]和[RFC4390]中描述了IPoIB相关的安全问题。

7. Acknowledgements
7. 致谢

This document has benefited from the comments and suggestions of the members of the IPoIB working group and the members of the InfiniBand(SM) Trade Association.

本文件得益于IPoIB工作组成员和InfiniBand(SM)贸易协会成员的意见和建议。

8. References
8. 工具书类
8.1. Normative References
8.1. 规范性引用文件

[IB_ARCH] InfiniBand Architecture Specification, Volume 1, Release 1.2, October, 2004.

[IB_ARCH]InfiniBand体系结构规范,第1卷,1.2版,2004年10月。

[RFC4391] Chu, J. and V. Kashyap, "Transmission of IP over InfiniBand (IPoIB)", RFC 4391, April 2006.

[RFC4391]Chu,J.和V.Kashyap,“InfiniBand上的IP传输(IPoIB)”,RFC 4391,2006年4月。

[RFC4390] Kashyap, V., "Dynamic Host Configuration Protocol (DHCP) over InfiniBand", RFC 4390, April 2006.

[RFC4390]Kashyap,V.,“InfiniBand上的动态主机配置协议(DHCP)”,RFC 4390,2006年4月。

[RFC2131] Droms, R., "Dynamic Host Configuration Protocol", RFC 2131, March 1997.

[RFC2131]Droms,R.,“动态主机配置协议”,RFC21311997年3月。

8.2. Informative References
8.2. 资料性引用

[RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 (IPv6) Addressing Architecture", RFC 3513, April 2003.

[RFC3513]Hinden,R.和S.Deering,“互联网协议版本6(IPv6)寻址体系结构”,RFC 3513,2003年4月。

[RFC2375] Hinden, R. and S. Deering, "IPv6 Multicast Address Assignments", RFC 2375, July 1998.

[RFC2375]Hinden,R.和S.Deering,“IPv6多播地址分配”,RFC 23751998年7月。

[IANA] Internet Assigned Numbers Authority, URL http://www.iana.org

[IANA]互联网分配号码管理局,URLhttp://www.iana.org

[RFC1112] Deering, S., "Host extensions for IP multicasting", STD 5, RFC 1112, August 1989.

[RFC1112]Deering,S.,“IP多播的主机扩展”,STD 5,RFC11121989年8月。

[RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. Thyagarajan, "Internet Group Management Protocol, Version 3", RFC 3376, October 2002.

[RFC3376]Cain,B.,Deering,S.,Kouvelas,I.,Fenner,B.,和A.Thyagarajan,“互联网组管理协议,第3版”,RFC 3376,2002年10月。

[RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast Listener Discovery (MLD) for IPv6", RFC 2710, October 1999.

[RFC2710]Deering,S.,Fenner,W.,和B.Haberman,“IPv6的多播侦听器发现(MLD)”,RFC 2710,1999年10月。

Author's Address

作者地址

Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006

Vivek Kashyap IBM 15450、西南科尔公园路比弗顿或97006

   Phone: +1 503 578 3422
   EMail: vivk@us.ibm.com
        
   Phone: +1 503 578 3422
   EMail: vivk@us.ibm.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (2006).

版权所有(C)互联网协会(2006年)。

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。

This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件及其包含的信息是按“原样”提供的,贡献者、他/她所代表或赞助的组织(如有)、互联网协会和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Intellectual Property

知识产权

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.

Acknowledgement

确认

Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).

RFC编辑器功能的资金由IETF行政支持活动(IASA)提供。