Internet Engineering Task Force (IETF)                         T. Narten
Request for Comments: 6820                               IBM Corporation
Category: Informational                                         M. Karir
ISSN: 2070-1721                                       Merit Network Inc.
                                                                  I. Foo
                                                     Huawei Technologies
                                                            January 2013
        
Internet Engineering Task Force (IETF)                         T. Narten
Request for Comments: 6820                               IBM Corporation
Category: Informational                                         M. Karir
ISSN: 2070-1721                                       Merit Network Inc.
                                                                  I. Foo
                                                     Huawei Technologies
                                                            January 2013
        

Address Resolution Problems in Large Data Center Networks

解决大型数据中心网络中的解决方案问题

Abstract

摘要

This document examines address resolution issues related to the scaling of data centers with a very large number of hosts. The scope of this document is relatively narrow, focusing on address resolution (the Address Resolution Protocol (ARP) in IPv4 and Neighbor Discovery (ND) in IPv6) within a data center.

本文档探讨了与具有大量主机的数据中心的扩展相关的解决方案问题。本文档的范围相对较窄,侧重于数据中心内的地址解析(IPv4中的地址解析协议(ARP)和IPv6中的邻居发现(ND))。

Status of This Memo

关于下段备忘

This document is not an Internet Standards Track specification; it is published for informational purposes.

本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 5741.

本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。并非IESG批准的所有文件都适用于任何级别的互联网标准;见RFC 5741第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6820.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc6820.

Copyright Notice

版权公告

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2013 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。

Table of Contents

目录

   1. Introduction ....................................................3
   2. Terminology .....................................................3
   3. Background ......................................................4
   4. Address Resolution in IPv4 ......................................6
   5. Address Resolution in IPv6 ......................................7
   6. Generalized Data Center Design ..................................7
      6.1. Access Layer ...............................................8
      6.2. Aggregation Layer ..........................................8
      6.3. Core .......................................................9
      6.4. L3/L2 Topological Variations ...............................9
           6.4.1. L3 to Access Switches ...............................9
           6.4.2. L3 to Aggregation Switches ..........................9
           6.4.3. L3 in the Core Only ................................10
           6.4.4. Overlays ...........................................10
      6.5. Factors That Affect Data Center Design ....................11
           6.5.1. Traffic Patterns ...................................11
           6.5.2. Virtualization .....................................11
           6.5.3. Summary ............................................12
   7. Problem Itemization ............................................12
      7.1. ARP Processing on Routers .................................12
      7.2. IPv6 Neighbor Discovery ...................................14
      7.3. MAC Address Table Size Limitations in Switches ............15
   8. Summary ........................................................15
   9. Acknowledgments ................................................16
   10. Security Considerations .......................................16
   11. Informative References ........................................16
        
   1. Introduction ....................................................3
   2. Terminology .....................................................3
   3. Background ......................................................4
   4. Address Resolution in IPv4 ......................................6
   5. Address Resolution in IPv6 ......................................7
   6. Generalized Data Center Design ..................................7
      6.1. Access Layer ...............................................8
      6.2. Aggregation Layer ..........................................8
      6.3. Core .......................................................9
      6.4. L3/L2 Topological Variations ...............................9
           6.4.1. L3 to Access Switches ...............................9
           6.4.2. L3 to Aggregation Switches ..........................9
           6.4.3. L3 in the Core Only ................................10
           6.4.4. Overlays ...........................................10
      6.5. Factors That Affect Data Center Design ....................11
           6.5.1. Traffic Patterns ...................................11
           6.5.2. Virtualization .....................................11
           6.5.3. Summary ............................................12
   7. Problem Itemization ............................................12
      7.1. ARP Processing on Routers .................................12
      7.2. IPv6 Neighbor Discovery ...................................14
      7.3. MAC Address Table Size Limitations in Switches ............15
   8. Summary ........................................................15
   9. Acknowledgments ................................................16
   10. Security Considerations .......................................16
   11. Informative References ........................................16
        
1. Introduction
1. 介绍

This document examines issues related to the scaling of large data centers. Specifically, this document focuses on address resolution (ARP in IPv4 and Neighbor Discovery in IPv6) within the data center. Although strictly speaking the scope of address resolution is confined to a single L2 broadcast domain (i.e., ARP runs at the L2 layer below IP), the issue is complicated by routers having many interfaces on which address resolution must be performed or with the presence of IEEE 802.1Q domains, where individual VLANs effectively form their own L2 broadcast domains. Thus, the scope of address resolution spans both the L2 link and the devices attached to those links.

本文档研究与大型数据中心的扩展相关的问题。具体而言,本文档重点介绍数据中心内的地址解析(IPv4中的ARP和IPv6中的邻居发现)。虽然严格来说,地址解析的范围仅限于一个L2广播域(即,ARP在IP下的L2层运行),但由于路由器具有许多必须执行地址解析的接口,或者存在IEEE 802.1Q域,问题变得复杂,其中各个VLAN有效地形成自己的L2广播域。因此,地址解析的范围涵盖L2链路和连接到这些链路的设备。

This document identifies potential issues associated with address resolution in data centers with a large number of hosts. The scope of this document is intentionally relatively narrow, as it mirrors the Address Resolution for Massive numbers of hosts in the Data center (ARMD) WG charter. This document lists "pain points" that are being experienced in current data centers. The goal of this document is to focus on address resolution issues and not other broader issues that might arise in data centers.

本文档确定了与具有大量主机的数据中心中的地址解析相关的潜在问题。本文档的范围有意相对狭窄,因为它反映了数据中心(ARMD)工作组章程中大量主机的地址解析。本文档列出了当前数据中心面临的“痛点”。本文档的目标是重点解决解决解决方案问题,而不是数据中心可能出现的其他更广泛的问题。

2. Terminology
2. 术语

Address Resolution: The process of determining the link-layer address corresponding to a given IP address. In IPv4, address resolution is performed by ARP [RFC0826]; in IPv6, it is provided by Neighbor Discovery (ND) [RFC4861].

地址解析:确定与给定IP地址对应的链路层地址的过程。在IPv4中,地址解析由ARP[RFC0826]执行;在IPv6中,它由邻居发现(ND)[RFC4861]提供。

Application: Software that runs on either a physical or virtual machine, providing a service (e.g., web server, database server, etc.).

应用程序:在物理或虚拟机上运行的软件,提供服务(如web服务器、数据库服务器等)。

L2 Broadcast Domain: The set of all links, repeaters, and switches that are traversed to reach all nodes that are members of a given L2 broadcast domain. In IEEE 802.1Q networks, a broadcast domain corresponds to a single VLAN.

L2广播域:所有链路、中继器和交换机的集合,这些链路、中继器和交换机被遍历以到达作为给定L2广播域成员的所有节点。在IEEE 802.1Q网络中,广播域对应于单个VLAN。

Host (or server): A computer system on the network.

主机(或服务器):网络上的计算机系统。

Hypervisor: Software running on a host that allows multiple VMs to run on the same host.

虚拟机监控程序:在主机上运行的软件,允许多个虚拟机在同一主机上运行。

Virtual Machine (VM): A software implementation of a physical machine that runs programs as if they were executing on a physical, non-virtualized machine. Applications (generally) do not know they are running on a VM as opposed to running on a

虚拟机(VM):一种物理机的软件实现,它运行的程序就像是在一台物理的、非虚拟化的机器上执行的程序一样。应用程序(通常)不知道它们正在VM上运行,而不是在虚拟机上运行

"bare" host or server, though some systems provide a paravirtualization environment that allows an operating system or application to be aware of the presence of virtualization for optimization purposes.

“裸”主机或服务器,但有些系统提供了准虚拟化环境,允许操作系统或应用程序意识到虚拟化的存在,以便进行优化。

ToR: Top-of-Rack Switch. A switch placed in a single rack to aggregate network connectivity to and from hosts in that rack.

ToR:机架顶部开关。放置在单个机架中的交换机,用于聚合与该机架中主机之间的网络连接。

EoR: End-of-Row Switch. A switch used to aggregate network connectivity from multiple racks. EoR switches are the next level of switching above ToR switches.

EoR:行结束开关。用于从多个机架聚合网络连接的交换机。EoR开关是ToR开关之上的下一级开关。

3. Background
3. 出身背景

Large, flat L2 networks have long been known to have scaling problems. As the size of an L2 broadcast domain increases, the level of broadcast traffic from protocols like ARP increases. Large amounts of broadcast traffic pose a particular burden because every device (switch, host, and router) must process and possibly act on such traffic. In extreme cases, "broadcast storms" can occur where the quantity of broadcast traffic reaches a level that effectively brings down part or all of a network. For example, poor implementations of loop detection and prevention or misconfiguration errors can create conditions that lead to broadcast storms as network conditions change. The conventional wisdom for addressing such problems has been to say "don't do that". That is, split large L2 networks into multiple smaller L2 networks, each operating as its own L3/IP subnet. Numerous data center networks have been designed with this principle, e.g., with each rack placed within its own L3 IP subnet. By doing so, the broadcast domain (and address resolution) is confined to one ToR switch, which works well from a scaling perspective. Unfortunately, this conflicts in some ways with the current trend towards dynamic workload shifting in data centers and increased virtualization, as discussed below.

长期以来,人们都知道大型平面L2网络存在扩展问题。随着L2广播域大小的增加,来自ARP等协议的广播流量级别也会增加。由于每个设备(交换机、主机和路由器)都必须处理并可能对这些流量采取行动,因此大量的广播流量造成了特殊的负担。在极端情况下,“广播风暴”可能发生在广播流量达到一定程度,从而导致部分或全部网络瘫痪的情况下。例如,环路检测和预防的执行不力或配置错误会在网络条件发生变化时产生导致广播风暴的条件。解决这些问题的传统智慧是说“不要那样做”。也就是说,将大型L2网络拆分为多个较小的L2网络,每个网络作为自己的L3/IP子网运行。许多数据中心网络都是按照这一原则设计的,例如,每个机架都位于其自己的L3 IP子网中。通过这样做,广播域(和地址分辨率)仅限于一个ToR交换机,从缩放角度来看,该交换机工作良好。不幸的是,这在某些方面与当前数据中心动态工作负载转移和虚拟化增加的趋势相冲突,如下所述。

Workload placement has become a challenging task within data centers. Ideally, it is desirable to be able to dynamically reassign workloads within a data center in order to optimize server utilization, add more servers in response to increased demand, etc. However, servers are often pre-configured to run with a given set of IP addresses. Placement of such servers is then subject to constraints of the IP addressing restrictions of the data center. For example, servers configured with addresses from a particular subnet could only be placed where they connect to the IP subnet corresponding to their IP addresses. If each ToR switch is acting as a gateway for its own subnet, a server can only be connected to the one ToR switch. This gateway switch represents the L2/L3 boundary. A similar constraint occurs in virtualized environments, as discussed next.

在数据中心内,工作负载分配已成为一项具有挑战性的任务。理想情况下,最好能够在数据中心内动态地重新分配工作负载,以优化服务器利用率,添加更多服务器以响应不断增长的需求等。但是,服务器通常预先配置为使用给定的一组IP地址运行。然后,此类服务器的放置受数据中心IP地址限制的约束。例如,配置了特定子网地址的服务器只能放置在它们连接到与其IP地址对应的IP子网的位置。如果每个ToR交换机充当其子网的网关,则服务器只能连接到一个ToR交换机。此网关交换机表示二级/三级边界。在虚拟化环境中也会出现类似的约束,如下所述。

Server virtualization is fast becoming the norm in data centers. With server virtualization, each physical server supports multiple virtual machines, each running its own operating system, middleware, and applications. Virtualization is a key enabler of workload agility, i.e., allowing any server to host any application (on its own VM) and providing the flexibility of adding, shrinking, or moving VMs within the physical infrastructure. Server virtualization provides numerous benefits, including higher utilization, increased data security, reduced user downtime, and even significant power conservation, along with the promise of a more flexible and dynamic computing environment.

服务器虚拟化正迅速成为数据中心的常态。通过服务器虚拟化,每个物理服务器都支持多个虚拟机,每个虚拟机运行自己的操作系统、中间件和应用程序。虚拟化是工作负载敏捷性的关键促成因素,即允许任何服务器托管任何应用程序(在其自己的虚拟机上),并提供在物理基础架构中添加、缩小或移动虚拟机的灵活性。服务器虚拟化提供了许多好处,包括更高的利用率、更高的数据安全性、减少用户停机时间、甚至显著节约电源,以及更灵活、更动态的计算环境。

The discussion below focuses on VM placement and migration. Keep in mind, however, that even in a non-virtualized environment, many of the same issues apply to individual workloads running on standalone machines. For example, when increasing the number of servers running a particular workload to meet demand, placement of those workloads may be constrained by IP subnet numbering considerations, as discussed earlier.

下面的讨论重点是VM放置和迁移。但是,请记住,即使在非虚拟化环境中,许多相同的问题也适用于在独立机器上运行的单个工作负载。例如,如前所述,当增加运行特定工作负载的服务器数量以满足需求时,这些工作负载的放置可能会受到IP子网编号注意事项的限制。

The greatest flexibility in VM and workload management occurs when it is possible to place a VM (or workload) anywhere in the data center regardless of what IP addresses the VM uses and how the physical network is laid out. In practice, movement of VMs within a data center is easiest when VM placement and movement do not conflict with the IP subnet boundaries of the data center's network, so that the VM's IP address need not be changed to reflect its actual point of attachment on the network from an L3/IP perspective. In contrast, if a VM moves to a new IP subnet, its address must change, and clients will need to be made aware of that change. From a VM management perspective, management is simplified if all servers are on a single large L2 network.

当可以将VM(或工作负载)放置在数据中心的任何位置时,VM和工作负载管理的最大灵活性就会出现,而不管VM使用什么IP地址以及物理网络的布局如何。实际上,当虚拟机的位置和移动不与数据中心网络的IP子网边界冲突时,数据中心内的虚拟机移动最容易,因此,从L3/IP的角度来看,无需更改虚拟机的IP地址以反映其在网络上的实际连接点。相反,如果虚拟机移动到一个新的IP子网,它的地址必须改变,客户端需要知道这个改变。从VM管理的角度来看,如果所有服务器都位于一个大型L2网络上,那么管理就简化了。

With virtualization, it is not uncommon to have a single physical server host ten or more VMs, each having its own IP (and Media Access Control (MAC)) addresses. Consequently, the number of addresses per machine (and hence per subnet) is increasing, even when the number of physical machines stays constant. In a few years, the numbers will likely be even higher.

通过虚拟化,一台物理服务器主机上有十个或更多虚拟机并不少见,每个虚拟机都有自己的IP(和媒体访问控制(MAC))地址。因此,即使物理机器的数量保持不变,每台机器(以及每子网)的地址数量也在增加。再过几年,这个数字可能会更高。

In the past, applications were static in the sense that they tended to stay in one physical place. An application installed on a physical machine would stay on that machine because the cost of moving an application elsewhere was generally high. Moreover, physical servers hosting applications would tend to be placed in such a way as to facilitate communication locality. That is, applications running on servers would be physically located near the servers hosting the applications they communicated with most heavily. The

在过去,应用程序是静态的,因为它们往往停留在一个物理位置。安装在物理机器上的应用程序将留在该机器上,因为将应用程序移动到其他地方的成本通常很高。此外,承载应用程序的物理服务器将倾向于以便于通信的方式放置。也就是说,在服务器上运行的应用程序在物理上位于与它们通信最频繁的应用程序所在的服务器附近。这个

network traffic patterns in such environments could thus be optimized, in some cases keeping significant traffic local to one network segment. In these more static and carefully managed environments, it was possible to build networks that approached scaling limitations but did not actually cross the threshold.

因此,可以优化此类环境中的网络流量模式,在某些情况下,将大量流量保持在一个网段的本地。在这些更为静态和精心管理的环境中,可以构建接近扩展限制但实际上没有超过阈值的网络。

Today, with the proliferation of VMs, traffic patterns are becoming more diverse and less predictable. In particular, there can easily be less locality of network traffic as VMs hosting applications are moved for such reasons as reducing overall power usage (by consolidating VMs and powering off idle machines) or moving a VM to a physical server with more capacity or a lower load. In today's changing environments, it is becoming more difficult to engineer networks as traffic patterns continually shift as VMs move around.

今天,随着虚拟机的扩散,流量模式变得越来越多样化,也越来越不可预测。特别是,当承载应用程序的虚拟机被移动时,网络流量的局部性很容易降低,原因包括降低总体功耗(通过整合虚拟机并关闭空闲机器)或将虚拟机移动到容量更大或负载更低的物理服务器。在当今不断变化的环境中,随着虚拟机的移动,流量模式不断变化,网络工程变得越来越困难。

In summary, both the size and density of L2 networks are increasing. In addition, increasingly dynamic workloads and the increased usage of VMs are creating pressure for ever-larger L2 networks. Today, there are already data centers with over 100,000 physical machines and many times that number of VMs. This number will only increase going forward. In addition, traffic patterns within a data center are also constantly changing. Ultimately, the issues described in this document might be observed at any scale, depending on the particular design of the data center.

总之,L2网络的规模和密度都在增加。此外,不断增加的动态工作负载和虚拟机使用的增加给越来越大的二级网络带来了压力。如今,已经有了超过100000台物理机器的数据中心,虚拟机的数量是原来的数倍。这个数字今后只会增加。此外,数据中心内的流量模式也在不断变化。最终,根据数据中心的特定设计,本文档中描述的问题可能会在任何规模上被观察到。

4. Address Resolution in IPv4
4. IPv4中的地址解析

In IPv4 over Ethernet, ARP provides the function of address resolution. To determine the link-layer address of a given IP address, a node broadcasts an ARP Request. The request is delivered to all portions of the L2 network, and the node with the requested IP address responds with an ARP Reply. ARP is an old protocol and, by current standards, is sparsely documented. For example, there are no clear requirements for retransmitting ARP Requests in the absence of replies. Consequently, implementations vary in the details of what they actually implement [RFC0826][RFC1122].

在以太网IPv4中,ARP提供地址解析功能。为了确定给定IP地址的链路层地址,节点广播ARP请求。请求被传送到L2网络的所有部分,具有请求的IP地址的节点用ARP应答进行响应。ARP是一个古老的协议,按照目前的标准,很少有文档记录。例如,对于在没有应答的情况下重新传输ARP请求,没有明确的要求。因此,实现在实际实现的细节上有所不同[RFC0826][RFC1122]。

From a scaling perspective, there are a number of problems with ARP. First, it uses broadcast, and any network with a large number of attached hosts will see a correspondingly large amount of broadcast ARP traffic. The second problem is that it is not feasible to change host implementations of ARP -- current implementations are too widely entrenched, and any changes to host implementations of ARP would take years to become sufficiently deployed to matter. That said, it may be possible to change ARP implementations in hypervisors, L2/L3 boundary routers, and/or ToR access switches, to leverage such techniques as Proxy ARP. Finally, ARP implementations need to take steps to flush out stale or otherwise invalid entries.

从缩放角度来看,ARP存在许多问题。首先,它使用广播,任何连接了大量主机的网络都会看到相应的大量广播ARP流量。第二个问题是,改变ARP的主机实现是不可行的——当前的实现过于根深蒂固,对ARP的主机实现的任何改变都需要数年的时间才能充分部署。也就是说,可能会更改虚拟机监控程序、L2/L3边界路由器和/或ToR访问交换机中的ARP实现,以利用代理ARP等技术。最后,ARP实现需要采取步骤清除过时或无效的条目。

Unfortunately, existing standards do not provide clear implementation guidelines for how to do this. Consequently, implementations vary significantly, and some implementations are "chatty" in that they just periodically flush caches every few minutes and send new ARP queries.

不幸的是,现有的标准并没有为如何做到这一点提供明确的实施指南。因此,实现有很大的不同,一些实现是“闲聊”的,因为它们只是每隔几分钟定期刷新缓存并发送新的ARP查询。

5. Address Resolution in IPv6
5. IPv6中的地址解析

Broadly speaking, from the perspective of address resolution, IPv6's Neighbor Discovery (ND) behaves much like ARP, with a few notable differences. First, ARP uses broadcast, whereas ND uses multicast. When querying for a target IP address, ND maps the target address into an IPv6 Solicited Node multicast address. Using multicast rather than broadcast has the benefit that the multicast frames do not necessarily need to be sent to all parts of the network, i.e., the frames can be sent only to segments where listeners for the Solicited Node multicast address reside. In the case where multicast frames are delivered to all parts of the network, sending to a multicast address still has the advantage that most (if not all) nodes will filter out the (unwanted) multicast query via filters installed in the Network Interface Card (NIC) rather than burdening host software with the need to process such packets. Thus, whereas all nodes must process every ARP query, ND queries are processed only by the nodes to which they are intended. In cases where multicast filtering can't effectively be implemented in the NIC (e.g., as on hypervisors supporting virtualization), filtering would need to be done in software (e.g., in the hypervisor's vSwitch).

从广义上讲,从地址解析的角度来看,IPv6的邻居发现(ND)的行为非常类似于ARP,但有几个显著的区别。首先,ARP使用广播,而ND使用多播。查询目标IP地址时,ND将目标地址映射到IPv6请求的节点多播地址。使用多播而不是广播的好处是,多播帧不一定需要发送到网络的所有部分,即,帧只能发送到请求节点多播地址的侦听器所在的段。在多播帧被传送到网络的所有部分的情况下,发送到多播地址仍然具有这样的优点,即大多数(如果不是全部)节点将通过安装在网络接口卡(NIC)中的过滤器过滤掉(不需要的)多播查询,而不是使主机软件负担处理此类分组的需要。因此,尽管所有节点都必须处理每个ARP查询,但ND查询只由它们所要处理的节点处理。如果无法在NIC中有效实施多播过滤(例如,在支持虚拟化的虚拟机监控程序上),则需要在软件中进行过滤(例如,在虚拟机监控程序的vSwitch中)。

6. Generalized Data Center Design
6. 广义数据中心设计

There are many different ways in which data center networks might be designed. The designs are usually engineered to suit the particular workloads that are being deployed in the data center. For example, a large web server farm might be engineered in a very different way than a general-purpose multi-tenant cloud hosting service. However, in most cases the designs can be abstracted into a typical three-layer model consisting of an access layer, an aggregation layer, and the Core. The access layer generally refers to the switches that are closest to the physical or virtual servers; the aggregation layer serves to interconnect multiple access-layer devices. The Core switches connect the aggregation switches to the larger network core.

数据中心网络的设计方式有很多种。这些设计通常设计为适合数据中心中部署的特定工作负载。例如,大型web服务器场的设计方式可能与通用多租户云托管服务截然不同。然而,在大多数情况下,设计可以抽象为一个典型的三层模型,由访问层、聚合层和核心组成。接入层通常指离物理或虚拟服务器最近的交换机;聚合层用于互连多个接入层设备。核心交换机将聚合交换机连接到更大的网络核心。

Figure 1 shows a generalized data center design, which captures the essential elements of various alternatives.

图1显示了一个通用的数据中心设计,它捕获了各种备选方案的基本元素。

                  +-----+-----+     +-----+-----+
                  |   Core0   |     |    Core1  |      Core
                  +-----+-----+     +-----+-----+
                        /    \        /       /
                       /      \----------\   /
                      /    /---------/    \ /
                    +-------+           +------+
                  +/------+ |         +/-----+ |
                  | Aggr11| + --------|AggrN1| +      Aggregation Layer
                  +---+---+/          +------+/
                    /     \            /      \
                   /       \          /        \
                 +---+    +---+      +---+     +---+
                 |T11|... |T1x|      |TN1|     |TNy|  Access Layer
                 +---+    +---+      +---+     +---+
                 |   |    |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+  Server Racks
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
        
                  +-----+-----+     +-----+-----+
                  |   Core0   |     |    Core1  |      Core
                  +-----+-----+     +-----+-----+
                        /    \        /       /
                       /      \----------\   /
                      /    /---------/    \ /
                    +-------+           +------+
                  +/------+ |         +/-----+ |
                  | Aggr11| + --------|AggrN1| +      Aggregation Layer
                  +---+---+/          +------+/
                    /     \            /      \
                   /       \          /        \
                 +---+    +---+      +---+     +---+
                 |T11|... |T1x|      |TN1|     |TNy|  Access Layer
                 +---+    +---+      +---+     +---+
                 |   |    |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+  Server Racks
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
                 |   |... |   |      |   |     |   |
                 +---+    +---+      +---+     +---+
        

Typical Layered Architecture in a Data Center

数据中心中的典型分层体系结构

Figure 1

图1

6.1. Access Layer
6.1. 接入层

The access switches provide connectivity directly to/from physical and virtual servers. The access layer may be implemented by wiring the servers within a rack to a ToR switch or, less commonly, the servers could be wired directly to an EoR switch. A server rack may have a single uplink to one access switch or may have dual uplinks to two different access switches.

接入交换机提供与物理和虚拟服务器之间的直接连接。接入层可以通过将机架内的服务器连接到ToR交换机来实现,或者,不太常见的是,服务器可以直接连接到EoR交换机。服务器机架可以具有到一个接入交换机的单个上行链路,也可以具有到两个不同接入交换机的双上行链路。

6.2. Aggregation Layer
6.2. 聚合层

In a typical data center, aggregation switches interconnect many ToR switches. Usually, there are multiple parallel aggregation switches, serving the same group of ToRs to achieve load sharing. It is no longer uncommon to see aggregation switches interconnecting hundreds of ToR switches in large data centers.

在典型的数据中心中,聚合交换机将许多ToR交换机互连。通常,存在多个并行聚合交换机,为同一组TOR提供服务以实现负载共享。在大型数据中心中,聚合交换机将数百个ToR交换机互连已不再罕见。

6.3. Core
6.3. 果心

Core switches provide connectivity between aggregation switches and the main data center network. Core switches interconnect different sets of racks and provide connectivity to data center gateways leading to external networks.

核心交换机提供聚合交换机与主数据中心网络之间的连接。核心交换机互连不同的机架组,并提供与通向外部网络的数据中心网关的连接。

6.4. L3/L2 Topological Variations
6.4. L3/L2拓扑变化
6.4.1. L3 to Access Switches
6.4.1. L3至接入交换机

In this scenario, the L3 domain is extended all the way from the core network to the access switches. Each rack enclosure consists of a single L2 domain, which is confined to the rack. In general, there are no significant ARP/ND scaling issues in this scenario, as the L2 domain cannot grow very large. Such a topology has benefits in scenarios where servers attached to a particular access switch generally run VMs that are confined to using a single subnet. These VMs and the applications they host aren't moved (migrated) to other racks that might be attached to different access switches (and different IP subnets). A small server farm or very static compute cluster might be well served via this design.

在此场景中,L3域从核心网络一直扩展到接入交换机。每个机架机柜由单个L2域组成,该域仅限于机架。一般来说,在这种情况下,没有明显的ARP/ND扩展问题,因为L2域不能增长得很大。在连接到特定访问交换机的服务器通常运行限于使用单个子网的虚拟机的情况下,这种拓扑结构具有优势。这些虚拟机及其承载的应用程序不会移动(迁移)到可能连接到不同接入交换机(和不同IP子网)的其他机架。通过这种设计,小型服务器场或非常静态的计算集群可能会得到很好的服务。

6.4.2. L3 to Aggregation Switches
6.4.2. L3到聚合交换机

When the L3 domain extends only to aggregation switches, hosts in any of the IP subnets configured on the aggregation switches can be reachable via L2 through any access switches if access switches enable all the VLANs. Such a topology allows a greater level of flexibility, as servers attached to any access switch can run any VMs that have been provisioned with IP addresses configured on the aggregation switches. In such an environment, VMs can migrate between racks without IP address changes. The drawback of this design, however, is that multiple VLANs have to be enabled on all access switches and all access-facing ports on aggregation switches. Even though L2 traffic is still partitioned by VLANs, the fact that all VLANs are enabled on all ports can lead to broadcast traffic on all VLANs that traverse all links and ports, which has the same effect as one big L2 domain on the access-facing side of the aggregation switch. In addition, the internal traffic itself might have to cross different L2 boundaries, resulting in significant ARP/ND load at the aggregation switches. This design provides a good tradeoff between flexibility and L2 domain size. A moderate-sized data center might utilize this approach to provide high-availability services at a single location.

当L3域仅扩展到聚合交换机时,如果访问交换机启用所有VLAN,则可以通过L2通过任何访问交换机访问聚合交换机上配置的任何IP子网中的主机。这种拓扑允许更高级别的灵活性,因为连接到任何访问交换机的服务器都可以运行任何已配置聚合交换机上的IP地址的虚拟机。在这种环境中,虚拟机可以在机架之间迁移,而无需更改IP地址。然而,这种设计的缺点是,必须在所有接入交换机和聚合交换机上所有面向接入的端口上启用多个VLAN。尽管二级通信仍然由VLAN进行分区,但在所有端口上启用所有VLAN的事实可能会导致所有VLAN上的广播通信穿越所有链路和端口,这与聚合交换机面向访问侧的一个大二级域具有相同的效果。此外,内部通信量本身可能必须跨越不同的L2边界,从而在聚合交换机上产生显著的ARP/ND负载。这种设计在灵活性和L2域大小之间提供了良好的折衷。中等规模的数据中心可以利用这种方法在单个位置提供高可用性服务。

6.4.3. L3 in the Core Only
6.4.3. 仅在核心中使用L3

In some cases, where a wider range of VM mobility is desired (i.e., a greater number of racks among which VMs can move without IP address changes), the L3 routed domain might be terminated at the core routers themselves. In this case, VLANs can span multiple groups of aggregation switches, which allows hosts to be moved among a greater number of server racks without IP address changes. This scenario results in the largest ARP/ND performance impact, as explained later. A data center with very rapid workload shifting may consider this kind of design.

在某些情况下,如果需要更大范围的虚拟机移动性(即,虚拟机可以在其中移动而不改变IP地址的机架数量更多),则L3路由域可能会在核心路由器本身处终止。在这种情况下,VLAN可以跨越多组聚合交换机,从而允许在更多服务器机架之间移动主机,而无需更改IP地址。如下文所述,此场景会对ARP/ND性能产生最大影响。数据中心具有非常快的工作量转移可以考虑这种设计。

6.4.4. Overlays
6.4.4. 覆盖层

There are several approaches where overlay networks can be used to build very large L2 networks to enable VM mobility. Overlay networks using various L2 or L3 mechanisms allow interior switches/routers to mask host addresses. In addition, L3 overlays can help the data center designer control the size of the L2 domain and also enhance the ability to provide multi-tenancy in data center networks. However, the use of overlays does not eliminate traffic associated with address resolution; it simply moves it to regular data traffic. That is, address resolution is implemented in the overlay and is not directly visible to the switches of the data center network.

有几种方法可以使用覆盖网络构建非常大的L2网络,以实现VM移动性。使用各种L2或L3机制的覆盖网络允许内部交换机/路由器屏蔽主机地址。此外,L3覆盖层可以帮助数据中心设计者控制L2域的大小,还可以增强在数据中心网络中提供多租户的能力。然而,覆盖的使用并不能消除与地址解析相关的通信量;它只是将其移动到常规数据通信。也就是说,地址解析是在覆盖层中实现的,数据中心网络的交换机无法直接看到地址解析。

A potential problem that arises in a large data center is that when a large number of hosts communicate with their peers in different subnets, all these hosts send (and receive) data packets to their respective L2/L3 boundary nodes, as the traffic flows are generally bidirectional. This has the potential to further highlight any scaling problems. These L2/L3 boundary nodes have to process ARP/ND requests sent from originating subnets and resolve physical (MAC) addresses in the target subnets for what are generally bidirectional flows. Therefore, for maximum flexibility in managing the data center workload, it is often desirable to use overlays to place related groups of hosts in the same topological subnet to avoid the L2/L3 boundary translation. The use of overlays in the data center network can be a useful design mechanism to help manage a potential bottleneck at the L2/L3 boundary by redefining where that boundary exists.

大型数据中心中出现的一个潜在问题是,当大量主机与不同子网中的对等机通信时,所有这些主机都会向各自的L2/L3边界节点发送(和接收)数据包,因为流量通常是双向的。这有可能进一步突出任何缩放问题。这些L2/L3边界节点必须处理从起始子网发送的ARP/ND请求,并解析目标子网中通常为双向流的物理(MAC)地址。因此,为了在管理数据中心工作负载时获得最大的灵活性,通常需要使用覆盖将相关主机组放置在同一拓扑子网中,以避免L2/L3边界转换。在数据中心网络中使用覆盖可以是一种有用的设计机制,通过重新定义二级/三级边界的存在位置,帮助管理二级/三级边界的潜在瓶颈。

6.5. Factors That Affect Data Center Design
6.5. 影响数据中心设计的因素
6.5.1. Traffic Patterns
6.5.1. 交通模式

Expected traffic patterns play an important role in designing appropriately sized access, aggregation, and core networks. Traffic patterns also vary based on the expected use of the data center.

预期的流量模式在设计适当大小的接入、聚合和核心网络时起着重要作用。根据数据中心的预期使用情况,流量模式也会有所不同。

Broadly speaking, it is desirable to keep as much traffic as possible on the access layer in order to minimize the bandwidth usage at the aggregation layer. If the expected use of the data center is to serve as a large web server farm, where thousands of nodes are doing similar things and the traffic pattern is largely in and out of a large data center, an access layer with EoR switches might be used, as it minimizes complexity, allows for servers and databases to be located in the same L2 domain, and provides for maximum density.

广义地说,希望在接入层上保持尽可能多的通信量,以便最小化聚合层的带宽使用。如果数据中心的预期用途是作为一个大型web服务器场,其中数千个节点正在做类似的事情,并且流量模式主要在大型数据中心内外,则可以使用带有EoR交换机的访问层,因为它将复杂性降至最低,允许服务器和数据库位于同一L2域中,并提供最大密度。

A data center that is expected to host a multi-tenant cloud hosting service might have some completely unique requirements. In order to isolate inter-customer traffic, smaller L2 domains might be preferred, and though the size of the overall data center might be comparable to the previous example, the multi-tenant nature of the cloud hosting application requires a smaller and more compartmentalized access layer. A multi-tenant environment might also require the use of L3 all the way to the access-layer ToR switch.

预期托管多租户云托管服务的数据中心可能有一些完全独特的要求。为了隔离客户间的流量,较小的L2域可能是首选,尽管整个数据中心的大小可能与前面的示例相当,但云托管应用程序的多租户性质需要更小、更分区的访问层。多租户环境可能还需要一直使用L3到访问层ToR交换机。

Yet another example of a workload with a unique traffic pattern is a high-performance compute cluster, where most of the traffic is expected to stay within the cluster but at the same time there is a high degree of crosstalk between the nodes. This would once again call for a large access layer in order to minimize the requirements at the aggregation layer.

具有独特流量模式的工作负载的另一个示例是高性能计算集群,其中大部分流量预计将留在集群内,但同时节点之间存在高度串扰。这将再次需要一个大的访问层,以最小化聚合层的需求。

6.5.2. Virtualization
6.5.2. 虚拟化

Using virtualization in the data center further serves to increase the possible densities that can be achieved. However, virtualization also further complicates the requirements on the access layer, as virtualization restricts the scope of server placement in the event of server failover resulting from hardware failures or server migration for load balancing or other reasons.

在数据中心中使用虚拟化可以进一步提高可能实现的密度。但是,虚拟化还使访问层的要求进一步复杂化,因为在由于硬件故障或由于负载平衡或其他原因导致服务器迁移的情况下,虚拟化限制了服务器放置的范围。

Virtualization also can place additional requirements on the aggregation switches in terms of address resolution table size and the scalability of any address-learning protocols that might be used on those switches. The use of virtualization often also requires the use of additional VLANs for high-availability beaconing, which would

虚拟化还可以在地址解析表大小和可能在这些交换机上使用的任何地址学习协议的可伸缩性方面对聚合交换机提出额外的要求。虚拟化的使用通常还需要使用额外的VLAN来实现高可用性信标,这将

need to span the entire virtualized infrastructure. This would require the access layer to also span the entire virtualized infrastructure.

需要跨越整个虚拟化基础架构。这将要求访问层也跨越整个虚拟化基础架构。

6.5.3. Summary
6.5.3. 总结

The designs described in this section have a number of tradeoffs. The "L3 to access switches" design described in Section 6.4.1 is the only design that constrains L2 domain size in a fashion that avoids ARP/ND scaling problems. However, that design has limitations and does not address some of the other requirements that lead to configurations that make use of larger L2 domains. Consequently, ARP/ND scaling issues are a real problem in practice.

本节中描述的设计有许多折衷之处。第6.4.1节中描述的“L3到接入交换机”设计是唯一一种以避免ARP/ND缩放问题的方式限制L2域大小的设计。然而,这种设计有局限性,并且没有解决导致使用更大L2域的配置的一些其他需求。因此,ARP/ND缩放问题在实践中是一个实际问题。

7. Problem Itemization
7. 问题分项

This section articulates some specific problems or "pain points" that are related to large data centers.

本节阐述了与大型数据中心相关的一些具体问题或“痛点”。

7.1. ARP Processing on Routers
7.1. 路由器上的ARP处理

One pain point with large L2 broadcast domains is that the routers connected to the L2 domain may need to process a significant amount of ARP traffic in some cases. In particular, environments where the aggregate level of ARP traffic is very large may lead to a heavy ARP load on routers. Even though the vast majority of ARP traffic may not be aimed at that router, the router still has to process enough of the ARP Request to determine whether it can safely be ignored. The ARP algorithm specifies that a recipient must update its ARP cache if it receives an ARP query from a source for which it has an entry [RFC0826].

大型L2广播域的一个难点是,在某些情况下,连接到L2域的路由器可能需要处理大量ARP流量。特别是,ARP流量总量非常大的环境可能会导致路由器上的ARP负载过大。即使绝大多数ARP通信可能不是针对该路由器的,路由器仍然必须处理足够多的ARP请求,以确定是否可以安全地忽略它。ARP算法规定,如果收件人从其具有条目的源接收ARP查询,则必须更新其ARP缓存[RFC0826]。

ARP processing in routers is commonly handled in a "slow path" software processor, rather than directly by a hardware Application-Specific Integrated Circuit (ASIC) as is the case when forwarding packets. Such a design significantly limits the rate at which ARP traffic can be processed compared to the rate at which ASICs can forward traffic. Current implementations at the time of this writing can support ARP processing in the low thousands of ARP packets per second. In some deployments, limitations on the rate of ARP processing have been cited as being a problem.

路由器中的ARP处理通常在“慢路径”软件处理器中处理,而不是像转发数据包那样直接由硬件专用集成电路(ASIC)处理。与ASIC转发流量的速率相比,这种设计大大限制了ARP流量的处理速率。在撰写本文时,当前的实现可以支持每秒数千个ARP数据包的低ARP处理。在一些部署中,ARP处理速率的限制被认为是一个问题。

To further reduce the ARP load, some routers have implemented additional optimizations in their forwarding ASIC paths. For example, some routers can be configured to discard ARP Requests for target addresses other than those assigned to the router. That way, the router's software processor only receives ARP Requests for

为了进一步降低ARP负载,一些路由器在其转发ASIC路径中实施了额外的优化。例如,一些路由器可以配置为丢弃对目标地址的ARP请求,而不是分配给路由器的地址。这样,路由器的软件处理器只接收ARP请求

addresses it owns and must respond to. This can significantly reduce the number of ARP Requests that must be processed by the router.

它拥有的地址和必须响应的地址。这可以显著减少路由器必须处理的ARP请求数量。

Another optimization concerns reducing the number of ARP queries targeted at routers, whether for address resolution or to validate existing cache entries. Some routers can be configured to broadcast periodic gratuitous ARPs [RFC5227]. Upon receipt of a gratuitous ARP, implementations mark the associated entry as "fresh", resetting the aging timer to its maximum setting. Consequently, sending out periodic gratuitous ARPs can effectively prevent nodes from needing to send ARP Requests intended to revalidate stale entries for a router. The net result is an overall reduction in the number of ARP queries routers receive. Gratuitous ARPs, broadcast to all nodes in the L2 broadcast domain, may in some cases also pre-populate ARP caches on neighboring devices, further reducing ARP traffic. But it is not believed that pre-population of ARP entries is supported by most implementations, as the ARP specification [RFC0826] recommends only that pre-existing ARP entries be updated upon receipt of ARP messages; it does not call for the creation of new entries when none already exist.

另一个优化涉及减少针对路由器的ARP查询的数量,无论是地址解析还是验证现有缓存项。一些路由器可以配置为广播周期性的免费ARP[RFC5227]。收到免费ARP后,实现会将相关条目标记为“新鲜”,将老化计时器重置为其最大设置。因此,发送周期性的免费ARP可以有效地防止节点需要发送ARP请求来重新验证路由器的过时条目。最终结果是路由器接收到的ARP查询数量总体上减少了。向L2广播域中的所有节点广播的免费ARP在某些情况下还可以预先填充相邻设备上的ARP缓存,从而进一步减少ARP流量。但是,大多数实现并不支持预先填充ARP条目,因为ARP规范[RFC0826]仅建议在收到ARP消息时更新预先存在的ARP条目;当没有新条目时,它不要求创建新条目。

Finally, another area concerns the overhead of processing IP packets for which no ARP entry exists. Existing standards specify that one or more IP packets for which no ARP entries exist should be queued pending successful completion of the address resolution process [RFC1122] [RFC1812]. Once an ARP query has been resolved, any queued packets can be forwarded on. Again, the processing of such packets is handled in the "slow path", effectively limiting the rate at which a router can process ARP "cache misses", and is viewed as a problem in some deployments today. Additionally, if no response is received, the router may send the ARP/ND query multiple times. If no response is received after a number of ARP/ND requests, the router needs to drop any queued data packets and may send an ICMP destination unreachable message as well [RFC0792]. This entire process can be CPU intensive.

最后,另一个领域涉及处理不存在ARP条目的IP数据包的开销。现有标准规定,在成功完成地址解析过程[RFC1122][RFC1812]之前,不存在ARP条目的一个或多个IP数据包应排队。一旦解析了ARP查询,任何排队的数据包都可以在服务器上转发。同样,此类数据包的处理是在“慢路径”中进行的,有效地限制了路由器处理ARP“缓存未命中”的速率,并且在今天的一些部署中被视为一个问题。此外,如果没有收到响应,路由器可能会多次发送ARP/ND查询。如果在多次ARP/ND请求后没有收到响应,路由器需要丢弃任何排队的数据包,并可能发送ICMP目的地不可到达消息[RFC0792]。整个过程可能是CPU密集型的。

Although address resolution traffic remains local to one L2 network, some data center designs terminate L2 domains at individual aggregation switches/routers (e.g., see Section 6.4.2). Such routers can be connected to a large number of interfaces (e.g., 100 or more). While the address resolution traffic on any one interface may be manageable, the aggregate address resolution traffic across all interfaces can become problematic.

尽管地址解析通信仍然是一个L2网络的本地通信,但一些数据中心设计在单个聚合交换机/路由器处终止L2域(例如,参见第6.4.2节)。这种路由器可以连接到大量接口(例如,100个或更多)。虽然任何一个接口上的地址解析流量都是可管理的,但所有接口上的聚合地址解析流量可能会出现问题。

Another variant of the above issue has individual routers servicing a relatively small number of interfaces, with the individual interfaces themselves serving very large subnets. Once again, it is the aggregate quantity of ARP traffic seen across all of the router's

上述问题的另一个变体是,单独的路由器为数量相对较少的接口提供服务,而单独的接口本身为非常大的子网提供服务。再一次,它是所有路由器的ARP流量的总和

interfaces that can be problematic. This pain point is essentially the same as the one discussed above, the only difference being whether a given number of hosts are spread across a few large IP subnets or many smaller ones.

可能有问题的接口。这个痛点与上面讨论的痛点基本相同,唯一的区别在于给定数量的主机是分布在几个大型IP子网还是许多较小的IP子网上。

When hosts in two different subnets under the same L2/L3 boundary router need to communicate with each other, the L2/L3 router not only has to initiate ARP/ND requests to the target's subnet, it also has to process the ARP/ND requests from the originating subnet. This process further adds to the overall ARP processing load.

当位于同一L2/L3边界路由器下的两个不同子网中的主机需要相互通信时,L2/L3路由器不仅必须向目标子网发起ARP/ND请求,还必须处理来自发起子网的ARP/ND请求。这个过程进一步增加了整个ARP处理负载。

7.2. IPv6 Neighbor Discovery
7.2. IPv6邻居发现

Though IPv6's Neighbor Discovery behaves much like ARP, there are several notable differences that result in a different set of potential issues. From an L2 perspective, an important difference is that ND address resolution requests are sent via multicast, which results in ND queries only being processed by the nodes for which they are intended. Compared with broadcast ARPs, this reduces the total number of ND packets that an implementation will receive.

尽管IPv6的邻居发现行为与ARP非常相似,但有几个显著的差异导致了一系列不同的潜在问题。从L2的角度来看,一个重要的区别是ND地址解析请求是通过多播发送的,这导致ND查询只由它们所要处理的节点处理。与广播ARP相比,这减少了实现将接收的ND数据包的总数。

Another key difference concerns revalidating stale ND entries. ND requires that nodes periodically revalidate any entries they are using, to ensure that bad entries are timed out quickly enough that TCP does not terminate a connection. Consequently, some implementations will send out "probe" ND queries to validate in-use ND entries as frequently as every 35 seconds [RFC4861]. Such probes are sent via unicast (unlike in the case of ARP). However, on larger networks, such probes can result in routers receiving many such queries (i.e., many more than with ARP, which does not specify such behavior). Unfortunately, the IPv4 mitigation technique of sending gratuitous ARPs (as described in Section 7.1) does not work in IPv6. The ND specification specifically states that gratuitous ND "updates" cannot cause an ND entry to be marked "valid". Rather, such entries are marked "probe", which causes the receiving node to (eventually) generate a probe back to the sender, which in this case is precisely the behavior that the router is trying to prevent!

另一个关键区别是重新验证过时的ND条目。ND要求节点定期重新验证其使用的任何条目,以确保错误条目的超时足够快,从而使TCP不会终止连接。因此,一些实现将每隔35秒发送“探测”ND查询以验证正在使用的ND条目[RFC4861]。这种探测通过单播发送(与ARP不同)。然而,在更大的网络上,这样的探测会导致路由器接收到许多这样的查询(即,比ARP更多,ARP没有指定这样的行为)。不幸的是,发送免费ARP的IPv4缓解技术(如第7.1节所述)在IPv6中不起作用。ND规范明确规定,无偿的ND“更新”不能导致ND条目标记为“有效”。相反,这些条目被标记为“probe”,这会导致接收节点(最终)生成一个探测器返回给发送方,在这种情况下,这正是路由器试图阻止的行为!

Routers implementing Neighbor Unreachability Discovery (NUD) (for neighboring destinations) will need to process neighbor cache state changes such as transitioning entries from REACHABLE to STALE. How this capability is implemented may impact the scalability of ND on a router. For example, one possible implementation is to have the forwarding operation detect when an ND entry is referenced that needs to transition from REACHABLE to STALE, by signaling an event that would need to be processed by the software processor. Such an implementation could increase the load on the service processor in

实现邻居不可访问性发现(NUD)(用于相邻目的地)的路由器将需要处理邻居缓存状态更改,例如将条目从可访问状态转换为过时状态。该功能的实现方式可能会影响ND在路由器上的可伸缩性。例如,一种可能的实现是,通过发信号通知将需要由软件处理器处理的事件,使转发操作检测何时引用需要从可到达转换为过时的ND条目。这样的实现可能会增加系统中服务处理器的负载

much the same way that high rates of ARP requests have led to problems on some routers.

与高ARP请求率导致某些路由器出现问题的方式大致相同。

It should be noted that ND does not require the sending of probes in all cases. Section 7.3.1 of [RFC4861] describes a technique whereby hints from TCP can be used to verify that an existing ND entry is working fine and does not need to be revalidated.

需要注意的是,ND并非在所有情况下都需要发送探头。[RFC4861]第7.3.1节描述了一种技术,通过该技术,可以使用TCP的提示来验证现有ND条目是否正常工作,并且无需重新验证。

Finally, IPv6 and IPv4 are often run simultaneously and in parallel on the same network, i.e., in dual-stack mode. In such environments, the IPv4 and IPv6 issues enumerated above compound each other.

最后,IPv6和IPv4通常在同一网络上同时并行运行,即双栈模式。在这样的环境中,上面列举的IPv4和IPv6问题相互加剧。

7.3. MAC Address Table Size Limitations in Switches
7.3. 交换机中的MAC地址表大小限制

L2 switches maintain L2 MAC address forwarding tables for all sources and destinations traversing the switch. These tables are populated through learning and are used to forward L2 frames to their correct destination. The larger the L2 domain, the larger the tables have to be. While in theory a switch only needs to keep track of addresses it is actively using (sometimes called "conversational learning"), switches flood broadcast frames (e.g., from ARP), multicast frames (e.g., from Neighbor Discovery), and unicast frames to unknown destinations. Switches add entries for the source addresses of such flooded frames to their forwarding tables. Consequently, MAC address table size can become a problem as the size of the L2 domain increases. The table size problem is made worse with VMs, where a single physical machine now hosts many VMs (in the 10's today, but growing rapidly as the number of cores per CPU increases), since each VM has its own MAC address that is visible to switches.

L2交换机维护穿越交换机的所有源和目标的L2 MAC地址转发表。这些表通过学习填充,用于将二语帧转发到正确的目的地。L2域越大,表就必须越大。理论上,交换机只需跟踪其正在使用的地址(有时称为“对话学习”),即可将大量广播帧(例如,来自ARP)、多播帧(例如,来自邻居发现)和单播帧切换到未知目的地。交换机将此类泛洪帧的源地址的条目添加到其转发表中。因此,随着L2域大小的增加,MAC地址表大小可能成为一个问题。虚拟机的表大小问题更为严重,因为每个虚拟机都有自己的MAC地址,交换机可以看到,因此单个物理机现在承载许多虚拟机(在今天的10个虚拟机中,但随着每个CPU的内核数目的增加,表大小问题会迅速增加)。

When L3 extends all the way to access switches (see Section 6.4.1), the size of MAC address tables in switches is not generally a problem. When L3 extends only to aggregation switches (see Section 6.4.2), however, MAC table size limitations can be a real issue.

当L3一直扩展到访问交换机时(参见第6.4.1节),交换机中MAC地址表的大小通常不是问题。然而,当L3仅扩展到聚合交换机时(参见第6.4.2节),MAC表大小限制可能是一个真正的问题。

8. Summary
8. 总结

This document has outlined a number of issues related to address resolution in large data centers. In particular, this document has described different scenarios where such issues might arise and what these potential issues are, along with outlining fundamental factors that cause them. It is hoped that describing specific pain points will facilitate a discussion as to whether they should be addressed and how best to address them.

本文档概述了与大型数据中心的地址解析相关的一些问题。特别是,本文件描述了可能出现此类问题的不同场景以及这些潜在问题是什么,并概述了导致这些问题的基本因素。希望描述具体的痛点将有助于讨论是否应该解决这些痛点以及如何最好地解决这些痛点。

9. Acknowledgments
9. 致谢

This document has been significantly improved by comments from Manav Bhatia, David Black, Stewart Bryant, Ralph Droms, Linda Dunbar, Donald Eastlake, Wesley Eddy, Anoop Ghanwani, Joel Halpern, Sue Hares, Pete Resnick, Benson Schliesser, T. Sridhar, and Lucy Yong. Igor Gashinsky deserves additional credit for highlighting some of the ARP-related pain points and for clarifying the difference between what the standards require and what some router vendors have actually implemented in response to operator requests.

Manav Bhatia、David Black、Stewart Bryant、Ralph Droms、Linda Dunbar、Donald Eastlake、Wesley Eddy、Anoop Ghanwani、Joel Halpern、Sue Hares、Pete Resnick、Benson Schliesser、T.Sridhar和Lucy Yong的评论大大改进了本文件。Igor Gashinsky强调了一些与ARP相关的痛点,并澄清了标准要求与一些路由器供应商响应运营商请求实际实施的标准之间的差异,这值得额外的赞扬。

10. Security Considerations
10. 安全考虑

This document does not create any security implications nor does it have any security implications. The security vulnerabilities in ARP are well known, and this document does not change or mitigate them in any way. Security considerations for Neighbor Discovery are discussed in [RFC4861] and [RFC6583].

本文档不会产生任何安全影响,也不会产生任何安全影响。ARP中的安全漏洞是众所周知的,本文档不会以任何方式更改或缓解这些漏洞。[RFC4861]和[RFC6583]中讨论了邻居发现的安全注意事项。

11. Informative References
11. 资料性引用

[RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, RFC 792, September 1981.

[RFC0792]Postel,J.,“互联网控制消息协议”,STD 5,RFC 792,1981年9月。

[RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or converting network protocol addresses to 48.bit Ethernet address for transmission on Ethernet hardware", STD 37, RFC 826, November 1982.

[RFC0826]Plummer,D.,“以太网地址解析协议:或将网络协议地址转换为48位以太网地址,以便在以太网硬件上传输”,STD 37,RFC 826,1982年11月。

[RFC1122] Braden, R., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, October 1989.

[RFC1122]Braden,R.,“互联网主机的要求-通信层”,标准3,RFC 1122,1989年10月。

[RFC1812] Baker, F., "Requirements for IP Version 4 Routers", RFC 1812, June 1995.

[RFC1812]Baker,F.,“IP版本4路由器的要求”,RFC1812,1995年6月。

[RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, September 2007.

[RFC4861]Narten,T.,Nordmark,E.,Simpson,W.,和H.Soliman,“IP版本6(IPv6)的邻居发现”,RFC 48612007年9月。

[RFC5227] Cheshire, S., "IPv4 Address Conflict Detection", RFC 5227, July 2008.

[RFC5227]柴郡,S.,“IPv4地址冲突检测”,RFC 5227,2008年7月。

[RFC6583] Gashinsky, I., Jaeggli, J., and W. Kumari, "Operational Neighbor Discovery Problems", RFC 6583, March 2012.

[RFC6583]Gashinsky,I.,Jaeggli,J.,和W.Kumari,“操作邻居发现问题”,RFC 6583,2012年3月。

Authors' Addresses

作者地址

Thomas Narten IBM Corporation 3039 Cornwallis Ave. PO Box 12195 Research Triangle Park, NC 27709-2195 USA

Thomas Narten IBM Corporation美国北卡罗来纳州研究三角公园康沃利斯大道3039号邮政信箱12195号,邮编27709-2195

   EMail: narten@us.ibm.com
        
   EMail: narten@us.ibm.com
        

Manish Karir Merit Network Inc.

曼尼什·卡里尔价值网络公司。

   EMail: mkarir@merit.edu
        
   EMail: mkarir@merit.edu
        

Ian Foo Huawei Technologies

华为技术有限公司

   EMail: Ian.Foo@huawei.com
        
   EMail: Ian.Foo@huawei.com