Internet Engineering Task Force (IETF)                       P. Lapukhov
Request for Comments: 7938                                      Facebook
Category: Informational                                        A. Premji
ISSN: 2070-1721                                          Arista Networks
                                                        J. Mitchell, Ed.
                                                             August 2016
        
Internet Engineering Task Force (IETF)                       P. Lapukhov
Request for Comments: 7938                                      Facebook
Category: Informational                                        A. Premji
ISSN: 2070-1721                                          Arista Networks
                                                        J. Mitchell, Ed.
                                                             August 2016
        

Use of BGP for Routing in Large-Scale Data Centers

BGP在大型数据中心路由中的应用

Abstract

摘要

Some network operators build and operate data centers that support over one hundred thousand servers. In this document, such data centers are referred to as "large-scale" to differentiate them from smaller infrastructures. Environments of this scale have a unique set of network requirements with an emphasis on operational simplicity and network stability. This document summarizes operational experience in designing and operating large-scale data centers using BGP as the only routing protocol. The intent is to report on a proven and stable routing design that could be leveraged by others in the industry.

一些网络运营商建立和运营支持超过十万台服务器的数据中心。在本文档中,此类数据中心被称为“大型”,以区别于小型基础设施。这种规模的环境有一套独特的网络要求,强调操作简单性和网络稳定性。本文档总结了使用BGP作为唯一路由协议设计和运行大型数据中心的操作经验。其目的是报告一种经过验证且稳定的路由设计,该设计可供业内其他人利用。

Status of This Memo

关于下段备忘

This document is not an Internet Standards Track specification; it is published for informational purposes.

本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 7841.

本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。并非IESG批准的所有文件都适用于任何级别的互联网标准;见RFC 7841第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7938.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc7938.

Copyright Notice

版权公告

Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2016 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。

Table of Contents

目录

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Network Design Requirements . . . . . . . . . . . . . . . . .   4
     2.1.  Bandwidth and Traffic Patterns  . . . . . . . . . . . . .   4
     2.2.  CAPEX Minimization  . . . . . . . . . . . . . . . . . . .   4
     2.3.  OPEX Minimization . . . . . . . . . . . . . . . . . . . .   5
     2.4.  Traffic Engineering . . . . . . . . . . . . . . . . . . .   5
     2.5.  Summarized Requirements . . . . . . . . . . . . . . . . .   6
   3.  Data Center Topologies Overview . . . . . . . . . . . . . . .   6
     3.1.  Traditional DC Topology . . . . . . . . . . . . . . . . .   6
     3.2.  Clos Network Topology . . . . . . . . . . . . . . . . . .   7
       3.2.1.  Overview  . . . . . . . . . . . . . . . . . . . . . .   7
       3.2.2.  Clos Topology Properties  . . . . . . . . . . . . . .   8
       3.2.3.  Scaling the Clos Topology . . . . . . . . . . . . . .   9
       3.2.4.  Managing the Size of Clos Topology Tiers  . . . . . .  10
   4.  Data Center Routing Overview  . . . . . . . . . . . . . . . .  11
     4.1.  L2-Only Designs . . . . . . . . . . . . . . . . . . . . .  11
     4.2.  Hybrid L2/L3 Designs  . . . . . . . . . . . . . . . . . .  12
     4.3.  L3-Only Designs . . . . . . . . . . . . . . . . . . . . .  12
   5.  Routing Protocol Design . . . . . . . . . . . . . . . . . . .  13
     5.1.  Choosing EBGP as the Routing Protocol . . . . . . . . . .  13
     5.2.  EBGP Configuration for Clos Topology  . . . . . . . . . .  15
       5.2.1.  EBGP Configuration Guidelines and Example ASN Scheme   15
       5.2.2.  Private Use ASNs  . . . . . . . . . . . . . . . . . .  16
       5.2.3.  Prefix Advertisement  . . . . . . . . . . . . . . . .  17
       5.2.4.  External Connectivity . . . . . . . . . . . . . . . .  18
       5.2.5.  Route Summarization at the Edge . . . . . . . . . . .  19
   6.  ECMP Considerations . . . . . . . . . . . . . . . . . . . . .  20
     6.1.  Basic ECMP  . . . . . . . . . . . . . . . . . . . . . . .  20
     6.2.  BGP ECMP over Multiple ASNs . . . . . . . . . . . . . . .  21
     6.3.  Weighted ECMP . . . . . . . . . . . . . . . . . . . . . .  21
     6.4.  Consistent Hashing  . . . . . . . . . . . . . . . . . . .  22
        
   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Network Design Requirements . . . . . . . . . . . . . . . . .   4
     2.1.  Bandwidth and Traffic Patterns  . . . . . . . . . . . . .   4
     2.2.  CAPEX Minimization  . . . . . . . . . . . . . . . . . . .   4
     2.3.  OPEX Minimization . . . . . . . . . . . . . . . . . . . .   5
     2.4.  Traffic Engineering . . . . . . . . . . . . . . . . . . .   5
     2.5.  Summarized Requirements . . . . . . . . . . . . . . . . .   6
   3.  Data Center Topologies Overview . . . . . . . . . . . . . . .   6
     3.1.  Traditional DC Topology . . . . . . . . . . . . . . . . .   6
     3.2.  Clos Network Topology . . . . . . . . . . . . . . . . . .   7
       3.2.1.  Overview  . . . . . . . . . . . . . . . . . . . . . .   7
       3.2.2.  Clos Topology Properties  . . . . . . . . . . . . . .   8
       3.2.3.  Scaling the Clos Topology . . . . . . . . . . . . . .   9
       3.2.4.  Managing the Size of Clos Topology Tiers  . . . . . .  10
   4.  Data Center Routing Overview  . . . . . . . . . . . . . . . .  11
     4.1.  L2-Only Designs . . . . . . . . . . . . . . . . . . . . .  11
     4.2.  Hybrid L2/L3 Designs  . . . . . . . . . . . . . . . . . .  12
     4.3.  L3-Only Designs . . . . . . . . . . . . . . . . . . . . .  12
   5.  Routing Protocol Design . . . . . . . . . . . . . . . . . . .  13
     5.1.  Choosing EBGP as the Routing Protocol . . . . . . . . . .  13
     5.2.  EBGP Configuration for Clos Topology  . . . . . . . . . .  15
       5.2.1.  EBGP Configuration Guidelines and Example ASN Scheme   15
       5.2.2.  Private Use ASNs  . . . . . . . . . . . . . . . . . .  16
       5.2.3.  Prefix Advertisement  . . . . . . . . . . . . . . . .  17
       5.2.4.  External Connectivity . . . . . . . . . . . . . . . .  18
       5.2.5.  Route Summarization at the Edge . . . . . . . . . . .  19
   6.  ECMP Considerations . . . . . . . . . . . . . . . . . . . . .  20
     6.1.  Basic ECMP  . . . . . . . . . . . . . . . . . . . . . . .  20
     6.2.  BGP ECMP over Multiple ASNs . . . . . . . . . . . . . . .  21
     6.3.  Weighted ECMP . . . . . . . . . . . . . . . . . . . . . .  21
     6.4.  Consistent Hashing  . . . . . . . . . . . . . . . . . . .  22
        
   7.  Routing Convergence Properties  . . . . . . . . . . . . . . .  22
     7.1.  Fault Detection Timing  . . . . . . . . . . . . . . . . .  22
     7.2.  Event Propagation Timing  . . . . . . . . . . . . . . . .  23
     7.3.  Impact of Clos Topology Fan-Outs  . . . . . . . . . . . .  24
     7.4.  Failure Impact Scope  . . . . . . . . . . . . . . . . . .  24
     7.5.  Routing Micro-Loops . . . . . . . . . . . . . . . . . . .  26
   8.  Additional Options for Design . . . . . . . . . . . . . . . .  26
     8.1.  Third-Party Route Injection . . . . . . . . . . . . . . .  26
     8.2.  Route Summarization within Clos Topology  . . . . . . . .  27
       8.2.1.  Collapsing Tier 1 Devices Layer . . . . . . . . . . .  27
       8.2.2.  Simple Virtual Aggregation  . . . . . . . . . . . . .  29
     8.3.  ICMP Unreachable Message Masquerading . . . . . . . . . .  29
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  30
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  30
     10.2.  Informative References . . . . . . . . . . . . . . . . .  31
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  35
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  35
        
   7.  Routing Convergence Properties  . . . . . . . . . . . . . . .  22
     7.1.  Fault Detection Timing  . . . . . . . . . . . . . . . . .  22
     7.2.  Event Propagation Timing  . . . . . . . . . . . . . . . .  23
     7.3.  Impact of Clos Topology Fan-Outs  . . . . . . . . . . . .  24
     7.4.  Failure Impact Scope  . . . . . . . . . . . . . . . . . .  24
     7.5.  Routing Micro-Loops . . . . . . . . . . . . . . . . . . .  26
   8.  Additional Options for Design . . . . . . . . . . . . . . . .  26
     8.1.  Third-Party Route Injection . . . . . . . . . . . . . . .  26
     8.2.  Route Summarization within Clos Topology  . . . . . . . .  27
       8.2.1.  Collapsing Tier 1 Devices Layer . . . . . . . . . . .  27
       8.2.2.  Simple Virtual Aggregation  . . . . . . . . . . . . .  29
     8.3.  ICMP Unreachable Message Masquerading . . . . . . . . . .  29
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  30
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  30
     10.2.  Informative References . . . . . . . . . . . . . . . . .  31
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  35
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  35
        
1. Introduction
1. 介绍

This document describes a practical routing design that can be used in a large-scale data center (DC) design. Such data centers, also known as "hyper-scale" or "warehouse-scale" data centers, have a unique attribute of supporting over a hundred thousand servers. In order to accommodate networks of this scale, operators are revisiting networking designs and platforms to address this need.

本文档描述了一种可用于大规模数据中心(DC)设计的实用路由设计。这类数据中心,也称为“超规模”或“仓库规模”数据中心,具有支持超过十万台服务器的独特属性。为了适应这种规模的网络,运营商正在重新审视网络设计和平台,以满足这一需求。

The design presented in this document is based on operational experience with data centers built to support large-scale distributed software infrastructure, such as a web search engine. The primary requirements in such an environment are operational simplicity and network stability so that a small group of people can effectively support a significantly sized network.

本文档中介绍的设计基于数据中心的运营经验,这些数据中心用于支持大规模分布式软件基础设施,如web搜索引擎。这种环境的主要要求是操作简单性和网络稳定性,以便一小群人能够有效地支持一个规模巨大的网络。

Experimentation and extensive testing have shown that External BGP (EBGP) [RFC4271] is well suited as a stand-alone routing protocol for these types of data center applications. This is in contrast with more traditional DC designs, which may use simple tree topologies and rely on extending Layer 2 (L2) domains across multiple network devices. This document elaborates on the requirements that led to this design choice and presents details of the EBGP routing design as well as exploring ideas for further enhancements.

实验和大量测试表明,外部BGP(EBGP)[RFC4271]非常适合作为此类数据中心应用的独立路由协议。这与更传统的DC设计形成对比,后者可能使用简单的树形拓扑,并依赖于跨多个网络设备扩展第2层(L2)域。本文件详细阐述了导致本次设计选择的需求,介绍了EBGP路由设计的细节,并探讨了进一步增强的思路。

This document first presents an overview of network design requirements and considerations for large-scale data centers. Then, traditional hierarchical data center network topologies are contrasted with Clos networks [CLOS1953] that are horizontally scaled

本文档首先概述了大型数据中心的网络设计要求和注意事项。然后,将传统的分层数据中心网络拓扑与水平缩放的Clos网络[CLOS1953]进行对比

out. This is followed by arguments for selecting EBGP with a Clos topology as the most appropriate routing protocol to meet the requirements and the proposed design is described in detail. Finally, this document reviews some additional considerations and design options. A thorough understanding of BGP is assumed by a reader planning on deploying the design described within the document.

出来接下来是选择具有Clos拓扑的EBGP作为最合适的路由协议以满足要求的参数,并详细描述了建议的设计。最后,本文回顾了一些额外的注意事项和设计选项。计划部署文档中描述的设计的读者假设对BGP有透彻的理解。

2. Network Design Requirements
2. 网络设计要求

This section describes and summarizes network design requirements for large-scale data centers.

本节描述并总结了大型数据中心的网络设计要求。

2.1. Bandwidth and Traffic Patterns
2.1. 带宽和流量模式

The primary requirement when building an interconnection network for a large number of servers is to accommodate application bandwidth and latency requirements. Until recently it was quite common to see the majority of traffic entering and leaving the data center, commonly referred to as "north-south" traffic. Traditional "tree" topologies were sufficient to accommodate such flows, even with high oversubscription ratios between the layers of the network. If more bandwidth was required, it was added by "scaling up" the network elements, e.g., by upgrading the device's linecards or fabrics or replacing the device with one with higher port density.

为大量服务器构建互连网络时的主要要求是满足应用程序带宽和延迟要求。直到最近,大多数流量进出数据中心还是很常见的,通常称为“南北”流量。传统的“树”拓扑足以容纳这种流量,即使网络各层之间的超额认购率很高。如果需要更多带宽,可以通过“扩展”网络元素来增加带宽,例如升级设备的线路卡或结构,或者用端口密度更高的设备替换设备。

Today many large-scale data centers host applications generating significant amounts of server-to-server traffic, which does not egress the DC, commonly referred to as "east-west" traffic. Examples of such applications could be computer clusters such as Hadoop [HADOOP], massive data replication between clusters needed by certain applications, or virtual machine migrations. Scaling traditional tree topologies to match these bandwidth demands becomes either too expensive or impossible due to physical limitations, e.g., port density in a switch.

如今,许多大型数据中心托管的应用程序产生了大量的服务器到服务器流量,而这些流量不会从DC流出,通常称为“东西”流量。此类应用程序的示例可能是计算机集群,如Hadoop[Hadoop],某些应用程序所需的集群之间的大规模数据复制,或虚拟机迁移。由于物理限制(例如交换机中的端口密度),扩展传统的树形拓扑以满足这些带宽需求变得过于昂贵或不可能。

2.2. CAPEX Minimization
2.2. 资本支出最小化

The Capital Expenditures (CAPEX) associated with the network infrastructure alone constitutes about 10-15% of total data center expenditure (see [GREENBERG2009]). However, the absolute cost is significant, and hence there is a need to constantly drive down the cost of individual network elements. This can be accomplished in two ways:

仅与网络基础设施相关的资本支出(CAPEX)就约占数据中心总支出的10-15%(见[GREENBERG2009])。然而,绝对成本是巨大的,因此需要不断降低单个网络元件的成本。这可以通过两种方式实现:

o Unifying all network elements, preferably using the same hardware type or even the same device. This allows for volume pricing on bulk purchases and reduced maintenance and inventory costs.

o 统一所有网络元素,最好使用相同的硬件类型或甚至相同的设备。这允许批量采购的批量定价,并降低维护和库存成本。

o Driving costs down using competitive pressures, by introducing multiple network equipment vendors.

o 通过引入多家网络设备供应商,利用竞争压力降低成本。

In order to allow for good vendor diversity, it is important to minimize the software feature requirements for the network elements. This strategy provides maximum flexibility of vendor equipment choices while enforcing interoperability using open standards.

为了实现良好的供应商多样性,将网络元件的软件功能要求降至最低非常重要。此策略提供了供应商设备选择的最大灵活性,同时使用开放标准实施互操作性。

2.3. OPEX Minimization
2.3. 运营成本最小化

Operating large-scale infrastructure can be expensive as a larger amount of elements will statistically fail more often. Having a simpler design and operating using a limited software feature set minimizes software issue-related failures.

操作大规模基础设施的成本可能会很高,因为从统计上看,大量的元件会更频繁地发生故障。使用有限的软件功能集进行更简单的设计和操作可以最大限度地减少与软件问题相关的故障。

An important aspect of Operational Expenditure (OPEX) minimization is reducing the size of failure domains in the network. Ethernet networks are known to be susceptible to broadcast or unicast traffic storms that can have a dramatic impact on network performance and availability. The use of a fully routed design significantly reduces the size of the data-plane failure domains, i.e., limits them to the lowest level in the network hierarchy. However, such designs introduce the problem of distributed control-plane failures. This observation calls for simpler and less control-plane protocols to reduce protocol interaction issues, reducing the chance of a network meltdown. Minimizing software feature requirements as described in the CAPEX section above also reduces testing and training requirements.

运营支出(OPEX)最小化的一个重要方面是减少网络中故障域的大小。众所周知,以太网容易受到广播或单播流量风暴的影响,这会对网络性能和可用性产生重大影响。使用完全路由设计可显著减少数据平面故障域的大小,即将其限制在网络层次结构中的最低级别。然而,这种设计引入了分布式控制平面故障的问题。这一观察结果要求使用更简单、更少的控制平面协议来减少协议交互问题,从而降低网络崩溃的可能性。将上述资本支出部分中所述的软件功能需求最小化也会减少测试和培训需求。

2.4. Traffic Engineering
2.4. 交通工程

In any data center, application load balancing is a critical function performed by network devices. Traditionally, load balancers are deployed as dedicated devices in the traffic forwarding path. The problem arises in scaling load balancers under growing traffic demand. A preferable solution would be able to scale the load-balancing layer horizontally, by adding more of the uniform nodes and distributing incoming traffic across these nodes. In situations like this, an ideal choice would be to use network infrastructure itself to distribute traffic across a group of load balancers. The combination of anycast prefix advertisement [RFC4786] and Equal Cost Multipath (ECMP) functionality can be used to accomplish this goal. To allow for more granular load distribution, it is beneficial for the network to support the ability to perform controlled per-hop traffic engineering. For example, it is beneficial to directly control the ECMP next-hop set for anycast prefixes at every level of the network hierarchy.

在任何数据中心,应用程序负载平衡都是由网络设备执行的关键功能。传统上,负载平衡器被部署为流量转发路径中的专用设备。在流量需求不断增长的情况下,在扩展负载平衡器时会出现问题。更好的解决方案是,通过添加更多的统一节点并在这些节点上分布传入流量,能够水平扩展负载平衡层。在这种情况下,理想的选择是使用网络基础设施本身在一组负载平衡器之间分配流量。可以使用选播前缀播发[RFC4786]和等成本多路径(ECMP)功能的组合来实现这一目标。为了允许更细粒度的负载分布,网络支持执行受控每跳流量工程的能力是有益的。例如,在网络层次结构的每个级别直接控制选播前缀的ECMP下一跳集是有益的。

2.5. Summarized Requirements
2.5. 概述的要求

This section summarizes the list of requirements outlined in the previous sections:

本节总结了前几节概述的要求列表:

o REQ1: Select a topology that can be scaled "horizontally" by adding more links and network devices of the same type without requiring upgrades to the network elements themselves.

o 要求1:选择一种拓扑结构,通过添加更多相同类型的链路和网络设备,可以“水平”缩放,而无需升级网络元件本身。

o REQ2: Define a narrow set of software features/protocols supported by a multitude of networking equipment vendors.

o 要求2:定义一组由众多网络设备供应商支持的软件功能/协议。

o REQ3: Choose a routing protocol that has a simple implementation in terms of programming code complexity and ease of operational support.

o REQ3:选择一种路由协议,该协议在编程代码复杂性和易操作性支持方面具有简单的实现。

o REQ4: Minimize the failure domain of equipment or protocol issues as much as possible.

o 要求4:尽可能减少设备或协议问题的故障域。

o REQ5: Allow for some traffic engineering, preferably via explicit control of the routing prefix next hop using built-in protocol mechanics.

o REQ5:考虑一些流量工程,最好通过使用内置协议机制显式控制路由前缀下一跳。

3. Data Center Topologies Overview
3. 数据中心拓扑概述

This section provides an overview of two general types of data center designs -- hierarchical (also known as "tree-based") and Clos-based network designs.

本节概述了两种一般类型的数据中心设计——分层(也称为“基于树”)和基于Clos的网络设计。

3.1. Traditional DC Topology
3.1. 传统直流拓扑

In the networking industry, a common design choice for data centers typically looks like an (upside down) tree with redundant uplinks and three layers of hierarchy namely; core, aggregation/distribution, and access layers (see Figure 1). To accommodate bandwidth demands, each higher layer, from the server towards DC egress or WAN, has higher port density and bandwidth capacity where the core functions as the "trunk" of the tree-based design. To keep terminology uniform and for comparison with other designs, in this document these layers will be referred to as Tier 1, Tier 2 and Tier 3 "tiers", instead of core, aggregation, or access layers.

在网络行业,数据中心的常见设计选择通常看起来像一棵(倒置)树,具有冗余上行链路和三层层次结构,即:;核心层、聚合/分发层和访问层(见图1)。为了满足带宽需求,从服务器到DC出口或WAN的每一更高层都具有更高的端口密度和带宽容量,其中核心作为基于树的设计的“主干”。为了保持术语统一并与其他设计进行比较,在本文件中,这些层将被称为第1层、第2层和第3层“层”,而不是核心层、聚合层或访问层。

             +------+  +------+
             |      |  |      |
             |      |--|      |           Tier 1
             |      |  |      |
             +------+  +------+
               |  |      |  |
     +---------+  |      |  +----------+
     | +-------+--+------+--+-------+  |
     | |       |  |      |  |       |  |
   +----+     +----+    +----+     +----+
   |    |     |    |    |    |     |    |
   |    |-----|    |    |    |-----|    | Tier 2
   |    |     |    |    |    |     |    |
   +----+     +----+    +----+     +----+
      |         |          |         |
      |         |          |         |
      | +-----+ |          | +-----+ |
      +-|     |-+          +-|     |-+    Tier 3
        +-----+              +-----+
         | | |                | | |
     <- Servers ->        <- Servers ->
        
             +------+  +------+
             |      |  |      |
             |      |--|      |           Tier 1
             |      |  |      |
             +------+  +------+
               |  |      |  |
     +---------+  |      |  +----------+
     | +-------+--+------+--+-------+  |
     | |       |  |      |  |       |  |
   +----+     +----+    +----+     +----+
   |    |     |    |    |    |     |    |
   |    |-----|    |    |    |-----|    | Tier 2
   |    |     |    |    |    |     |    |
   +----+     +----+    +----+     +----+
      |         |          |         |
      |         |          |         |
      | +-----+ |          | +-----+ |
      +-|     |-+          +-|     |-+    Tier 3
        +-----+              +-----+
         | | |                | | |
     <- Servers ->        <- Servers ->
        

Figure 1: Typical DC Network Topology

图1:典型直流网络拓扑

Unfortunately, as noted previously, it is not possible to scale a tree-based design to a large enough degree for handling large-scale designs due to the inability to be able to acquire Tier 1 devices with a large enough port density to sufficiently scale Tier 2. Also, continuous upgrades or replacement of the upper-tier devices are required as deployment size or bandwidth requirements increase, which is operationally complex. For this reason, REQ1 is in place, eliminating this type of design from consideration.

不幸的是,如前所述,由于无法获取具有足够大的端口密度的第1层设备以充分扩展第2层,因此无法将基于树的设计扩展到足以处理大规模设计的程度。此外,随着部署规模或带宽需求的增加,需要不断升级或更换上层设备,这在操作上非常复杂。出于这个原因,REQ1已经到位,因此不考虑这种类型的设计。

3.2. Clos Network Topology
3.2. Clos网络拓扑

This section describes a common design for horizontally scalable topology in large-scale data centers in order to meet REQ1.

本节介绍大型数据中心中水平可扩展拓扑的通用设计,以满足需求1。

3.2.1. Overview
3.2.1. 概述

A common choice for a horizontally scalable topology is a folded Clos topology, sometimes called "fat-tree" (for example, [INTERCON] and [ALFARES2008]). This topology features an odd number of stages (sometimes known as "dimensions") and is commonly made of uniform elements, e.g., network switches with the same port count. Therefore, the choice of folded Clos topology satisfies REQ1 and

水平可伸缩拓扑的常见选择是折叠Clos拓扑,有时称为“胖树”(例如,[INTERCON]和[ALFARES2008])。这种拓扑结构具有奇数个阶段(有时称为“维度”),通常由统一的元素组成,例如具有相同端口数的网络交换机。因此,折叠Clos拓扑的选择满足REQ1和

facilitates REQ2. See Figure 2 below for an example of a folded 3-stage Clos topology (3 stages counting Tier 2 stage twice, when tracing a packet flow):

促进需求2。请参见下面的图2,了解折叠的3阶段Clos拓扑的示例(跟踪数据包流时,3个阶段将第2层阶段计数两次):

   +-------+
   |       |----------------------------+
   |       |------------------+         |
   |       |--------+         |         |
   +-------+        |         |         |
   +-------+        |         |         |
   |       |--------+---------+-------+ |
   |       |--------+-------+ |       | |
   |       |------+ |       | |       | |
   +-------+      | |       | |       | |
   +-------+      | |       | |       | |
   |       |------+-+-------+-+-----+ | |
   |       |------+-+-----+ | |     | | |
   |       |----+ | |     | | |     | | |
   +-------+    | | |     | | |   ---------> M links
    Tier 1      | | |     | | |     | | |
              +-------+ +-------+ +-------+
              |       | |       | |       |
              |       | |       | |       | Tier 2
              |       | |       | |       |
              +-------+ +-------+ +-------+
                | | |     | | |     | | |
                | | |     | | |   ---------> N Links
                | | |     | | |     | | |
                O O O     O O O     O O O   Servers
        
   +-------+
   |       |----------------------------+
   |       |------------------+         |
   |       |--------+         |         |
   +-------+        |         |         |
   +-------+        |         |         |
   |       |--------+---------+-------+ |
   |       |--------+-------+ |       | |
   |       |------+ |       | |       | |
   +-------+      | |       | |       | |
   +-------+      | |       | |       | |
   |       |------+-+-------+-+-----+ | |
   |       |------+-+-----+ | |     | | |
   |       |----+ | |     | | |     | | |
   +-------+    | | |     | | |   ---------> M links
    Tier 1      | | |     | | |     | | |
              +-------+ +-------+ +-------+
              |       | |       | |       |
              |       | |       | |       | Tier 2
              |       | |       | |       |
              +-------+ +-------+ +-------+
                | | |     | | |     | | |
                | | |     | | |   ---------> N Links
                | | |     | | |     | | |
                O O O     O O O     O O O   Servers
        

Figure 2: 3-Stage Folded Clos Topology

图2:三级折叠Clos拓扑

This topology is often also referred to as a "Leaf and Spine" network, where "Spine" is the name given to the middle stage of the Clos topology (Tier 1) and "Leaf" is the name of input/output stage (Tier 2). For uniformity, this document will refer to these layers using the "Tier n" notation.

该拓扑通常也称为“叶和脊”网络,“脊”是Clos拓扑中间阶段(第1层)的名称,“叶”是输入/输出阶段(第2层)的名称。为了统一起见,本文件将使用“Tier n”符号来表示这些层。

3.2.2. Clos Topology Properties
3.2.2. Clos拓扑特性

The following are some key properties of the Clos topology:

以下是Clos拓扑的一些关键属性:

o The topology is fully non-blocking, or more accurately non-interfering, if M >= N and oversubscribed by a factor of N/M otherwise. Here M and N is the uplink and downlink port count respectively, for a Tier 2 switch as shown in Figure 2.

o 如果M>=N,则该拓扑是完全无阻塞的,或者更准确地说是无干扰的,否则会被N/M的因子超额订阅。这里M和N分别是第2层交换机的上行链路和下行链路端口计数,如图2所示。

o Utilizing this topology requires control and data-plane support for ECMP with a fan-out of M or more.

o 利用这种拓扑结构需要对具有M或更多扇出的ECMP提供控制和数据平面支持。

o Tier 1 switches have exactly one path to every server in this topology. This is an important property that makes route summarization dangerous in this topology (see Section 8.2 below).

o 第1层交换机与此拓扑中的每个服务器只有一条路径。这是一个重要的属性,使得路由摘要在该拓扑中非常危险(请参见下面的第8.2节)。

o Traffic flowing from server to server is load balanced over all available paths using ECMP.

o 使用ECMP在所有可用路径上对从服务器到服务器的流量进行负载平衡。

3.2.3. Scaling the Clos Topology
3.2.3. 扩展Clos拓扑

A Clos topology can be scaled either by increasing network element port density or by adding more stages, e.g., moving to a 5-stage Clos, as illustrated in Figure 3 below:

Clos拓扑可以通过增加网元端口密度或添加更多阶段(例如,移动到5阶段Clos)来扩展,如下图3所示:

                                      Tier 1
                                     +-----+
          Cluster                    |     |
 +----------------------------+   +--|     |--+
 |                            |   |  +-----+  |
 |                    Tier 2  |   |           |   Tier 2
 |                   +-----+  |   |  +-----+  |  +-----+
 |     +-------------| DEV |------+--|     |--+--|     |-------------+
 |     |       +-----|  C  |------+  |     |  +--|     |-----+       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     |       |              |                              |       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     | +-----------| DEV |------+  |     |  +--|     |-----------+ |
 |     | |     | +---|  D  |------+--|     |--+--|     |---+ |     | |
 |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
 |     | |     | |            |   |           |            | |     | |
 |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
 |   | DEV | | DEV |          |   +--|     |--+          |     | |     |
 |   |  A  | |  B  | Tier 3   |      |     |      Tier 3 |     | |     |
 |   +-----+ +-----+          |      +-----+             +-----+ +-----+
 |     | |     | |            |                            | |     | |
 |     O O     O O            |                            O O     O O
 |       Servers              |                              Servers
 +----------------------------+
        
                                      Tier 1
                                     +-----+
          Cluster                    |     |
 +----------------------------+   +--|     |--+
 |                            |   |  +-----+  |
 |                    Tier 2  |   |           |   Tier 2
 |                   +-----+  |   |  +-----+  |  +-----+
 |     +-------------| DEV |------+--|     |--+--|     |-------------+
 |     |       +-----|  C  |------+  |     |  +--|     |-----+       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     |       |              |                              |       |
 |     |       |     +-----+  |      +-----+     +-----+     |       |
 |     | +-----------| DEV |------+  |     |  +--|     |-----------+ |
 |     | |     | +---|  D  |------+--|     |--+--|     |---+ |     | |
 |     | |     | |   +-----+  |   |  +-----+  |  +-----+   | |     | |
 |     | |     | |            |   |           |            | |     | |
 |   +-----+ +-----+          |   |  +-----+  |          +-----+ +-----+
 |   | DEV | | DEV |          |   +--|     |--+          |     | |     |
 |   |  A  | |  B  | Tier 3   |      |     |      Tier 3 |     | |     |
 |   +-----+ +-----+          |      +-----+             +-----+ +-----+
 |     | |     | |            |                            | |     | |
 |     O O     O O            |                            O O     O O
 |       Servers              |                              Servers
 +----------------------------+
        

Figure 3: 5-Stage Clos Topology

图3:5级Clos拓扑

The small example of topology in Figure 3 is built from devices with a port count of 4. In this document, one set of directly connected Tier 2 and Tier 3 devices along with their attached servers will be referred to as a "cluster". For example, DEV A, B, C, D, and the servers that connect to DEV A and B, on Figure 3 form a cluster. The

图3中的拓扑小示例是从端口数为4的设备构建的。在本文档中,一组直接连接的Tier 2和Tier 3设备及其连接的服务器将被称为“群集”。例如,图3中的deva、B、C、D以及连接到deva和B的服务器构成了一个集群。这个

concept of a cluster may also be a useful concept as a single deployment or maintenance unit that can be operated on at a different frequency than the entire topology.

集群的概念也可能是一个有用的概念,作为一个单独的部署或维护单元,它可以以不同于整个拓扑的频率运行。

In practice, Tier 3 of the network, which is typically Top-of-Rack switches (ToRs), is where oversubscription is introduced to allow for packaging of more servers in the data center while meeting the bandwidth requirements for different types of applications. The main reason to limit oversubscription at a single layer of the network is to simplify application development that would otherwise need to account for multiple bandwidth pools: within rack (Tier 3), between racks (Tier 2), and between clusters (Tier 1). Since oversubscription does not have a direct relationship to the routing design, it is not discussed further in this document.

实际上,网络的第3层(通常为机架顶部交换机(TOR))引入了超额订阅,以允许在数据中心打包更多服务器,同时满足不同类型应用程序的带宽要求。限制网络单层的超额订阅的主要原因是简化应用程序开发,否则需要考虑多个带宽池:机架内(第3层)、机架间(第2层)和集群间(第1层)。由于超额认购与路由设计没有直接关系,因此本文档中不再进一步讨论。

3.2.4. Managing the Size of Clos Topology Tiers
3.2.4. 管理Clos拓扑层的大小

If a data center network size is small, it is possible to reduce the number of switches in Tier 1 or Tier 2 of a Clos topology by a factor of two. To understand how this could be done, take Tier 1 as an example. Every Tier 2 device connects to a single group of Tier 1 devices. If half of the ports on each of the Tier 1 devices are not being used, then it is possible to reduce the number of Tier 1 devices by half and simply map two uplinks from a Tier 2 device to the same Tier 1 device that were previously mapped to different Tier 1 devices. This technique maintains the same bandwidth while reducing the number of elements in Tier 1, thus saving on CAPEX. The tradeoff, in this example, is the reduction of maximum DC size in terms of overall server count by half.

如果数据中心网络规模较小,则可以将Clos拓扑的第1层或第2层中的交换机数量减少两倍。要了解如何做到这一点,请以Tier 1为例。每个Tier 2设备都连接到一组Tier 1设备。如果每个第1层设备上的一半端口未被使用,则可以将第1层设备的数量减少一半,并简单地将两条从第2层设备到相同的第1层设备的上行链路映射到以前映射到不同的第1层设备。这种技术在保持相同带宽的同时减少了第1层中的元素数量,从而节省了资本支出。在本例中,折衷的办法是将最大DC大小(以服务器总数计)减少一半。

In this example, Tier 2 devices will be using two parallel links to connect to each Tier 1 device. If one of these links fails, the other will pick up all traffic of the failed link, possibly resulting in heavy congestion and quality of service degradation if the path determination procedure does not take bandwidth amount into account, since the number of upstream Tier 1 devices is likely wider than two. To avoid this situation, parallel links can be grouped in link aggregation groups (LAGs), e.g., [IEEE8023AD], with widely available implementation settings that take the whole "bundle" down upon a single link failure. Equivalent techniques that enforce "fate sharing" on the parallel links can be used in place of LAGs to achieve the same effect. As a result of such fate-sharing, traffic from two or more failed links will be rebalanced over the multitude of remaining paths that equals the number of Tier 1 devices. This example is using two links for simplicity, having more links in a bundle will have less impact on capacity upon a member-link failure.

在此示例中,第2层设备将使用两个并行链路连接到每个第1层设备。如果其中一条链路发生故障,另一条链路将接收故障链路的所有通信量,如果路径确定过程不考虑带宽量,则可能导致严重拥塞和服务质量下降,因为上游第1层设备的数量可能大于两个。为了避免这种情况,可以将并行链路分组到链路聚合组(LAG)中,例如[IEEE8023AD],使用广泛可用的实现设置,在单个链路故障时将整个“包”关闭。在平行链路上实施“命运共享”的等效技术可以用来代替滞后,以实现相同的效果。由于这种命运共享,来自两个或多个故障链路的流量将在多个剩余路径上重新平衡,这些路径等于第1层设备的数量。本例为简单起见使用了两个链接,在一个捆绑包中有更多的链接将在成员链接失败时对容量的影响较小。

4. Data Center Routing Overview
4. 数据中心路由概述

This section provides an overview of three general types of data center protocol designs -- Layer 2 only, Hybrid Layer L2/L3, and Layer 3 only.

本节概述了三种一般类型的数据中心协议设计—仅限第2层、混合层L2/L3和第3层。

4.1. L2-Only Designs
4.1. 仅限L2设计

Originally, most data center designs used Spanning Tree Protocol (STP) originally defined in [IEEE8021D-1990] for loop-free topology creation, typically utilizing variants of the traditional DC topology described in Section 3.1. At the time, many DC switches either did not support Layer 3 routing protocols or supported them with additional licensing fees, which played a part in the design choice. Although many enhancements have been made through the introduction of Rapid Spanning Tree Protocol (RSTP) in the latest revision of [IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified in [IEEE8021Q] that increase convergence, stability, and load-balancing in larger topologies, many of the fundamentals of the protocol limit its applicability in large-scale DCs. STP and its newer variants use an active/standby approach to path selection, and are therefore hard to deploy in horizontally scaled topologies as described in Section 3.2. Further, operators have had many experiences with large failures due to issues caused by improper cabling, misconfiguration, or flawed software on a single device. These failures regularly affected the entire spanning-tree domain and were very hard to troubleshoot due to the nature of the protocol. For these reasons, and since almost all DC traffic is now IP, therefore requiring a Layer 3 routing protocol at the network edge for external connectivity, designs utilizing STP usually fail all of the requirements of large-scale DC operators. Various enhancements to link-aggregation protocols such as [IEEE8023AD], generally known as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use Layer 2 designs with active-active network paths while relying on STP as the backup for loop prevention. The major downsides of this approach are the lack of ability to scale linearly past two in most implementations, lack of standards-based implementations, and the added failure domain risk of syncing state between the devices.

最初,大多数数据中心设计使用[IEEE8021D-1990]中最初定义的生成树协议(STP)创建无环拓扑,通常使用第3.1节中描述的传统DC拓扑的变体。当时,许多DC交换机要么不支持第3层路由协议,要么支持额外的许可费,这在设计选择中起到了一定的作用。尽管通过在[IEEE8021D-2004]的最新版本中引入快速生成树协议(RSTP)和[IEEE8021Q]中指定的多生成树协议(MST)实现了许多增强,从而提高了更大拓扑中的收敛性、稳定性和负载平衡,该协议的许多基本原理限制了其在大型DCs中的适用性。STP及其较新的变体使用主动/备用方法进行路径选择,因此很难在第3.2节所述的水平缩放拓扑中部署。此外,运营商也有过许多因布线不当、配置错误或单个设备上的软件有缺陷而导致大型故障的经验。这些故障通常会影响整个生成树域,并且由于协议的性质,很难排除故障。由于这些原因,并且由于现在几乎所有DC流量都是IP,因此需要在网络边缘使用第3层路由协议进行外部连接,因此利用STP的设计通常无法满足大规模DC运营商的所有要求。链路聚合协议的各种增强功能,如[IEEE8023AD],通常称为多机箱链路聚合(M-LAG),使得在依赖STP作为环路预防备份的同时,可以使用具有主动-主动网络路径的第2层设计。这种方法的主要缺点是在大多数实现中缺乏线性扩展超过两个的能力,缺乏基于标准的实现,并且增加了设备之间同步状态的故障域风险。

It should be noted that building large, horizontally scalable, L2-only networks without STP is possible recently through the introduction of the Transparent Interconnection of Lots of Links (TRILL) protocol in [RFC6325]. TRILL resolves many of the issues STP has for large-scale DC design however, due to the limited number of implementations, and often the requirement for specific equipment that supports it, this has limited its applicability and increased the cost of such designs.

应该注意的是,通过[RFC6325]中引入大量链路透明互连(TRILL)协议,最近可以在没有STP的情况下构建大型、水平可扩展的二级网络。TRILL解决了STP在大规模DC设计中的许多问题,但是,由于实施数量有限,而且通常需要支持它的特定设备,这限制了它的适用性并增加了此类设计的成本。

Finally, neither the base TRILL specification nor the M-LAG approach totally eliminate the problem of the shared broadcast domain that is so detrimental to the operations of any Layer 2, Ethernet-based solution. Later TRILL extensions have been proposed to solve the this problem statement, primarily based on the approaches outlined in [RFC7067], but this even further limits the number of available interoperable implementations that can be used to build a fabric. Therefore, TRILL-based designs have issues meeting REQ2, REQ3, and REQ4.

最后,无论是基本TRILL规范还是M-LAG方法都不能完全消除共享广播域的问题,这对任何基于以太网的第2层解决方案的操作都是如此有害。后来提出了TRILL扩展来解决此问题陈述,主要基于[RFC7067]中概述的方法,但这进一步限制了可用于构建结构的可用互操作实现的数量。因此,基于颤音的设计在满足REQ2、REQ3和REQ4方面存在问题。

4.2. Hybrid L2/L3 Designs
4.2. 混合L2/L3设计

Operators have sought to limit the impact of data-plane faults and build large-scale topologies through implementing routing protocols in either the Tier 1 or Tier 2 parts of the network and dividing the Layer 2 domain into numerous, smaller domains. This design has allowed data centers to scale up, but at the cost of complexity in managing multiple network protocols. For the following reasons, operators have retained Layer 2 in either the access (Tier 3) or both access and aggregation (Tier 3 and Tier 2) parts of the network:

运营商试图通过在网络的第1层或第2层部分实施路由协议,并将第2层域划分为多个更小的域,来限制数据平面故障的影响,并构建大规模拓扑。这种设计允许数据中心扩大规模,但代价是管理多个网络协议的复杂性。出于以下原因,运营商在网络的访问(第3层)或访问和聚合(第3层和第2层)部分保留了第2层:

o Supporting legacy applications that may require direct Layer 2 adjacency or use non-IP protocols.

o 支持可能需要直接第2层邻接或使用非IP协议的遗留应用程序。

o Seamless mobility for virtual machines that require the preservation of IP addresses when a virtual machine moves to a different Tier 3 switch.

o 当虚拟机移动到不同的第3层交换机时,需要保留IP地址的虚拟机的无缝移动。

o Simplified IP addressing = less IP subnets are required for the data center.

o 简化IP寻址=数据中心需要更少的IP子网。

o Application load balancing may require direct Layer 2 reachability to perform certain functions such as Layer 2 Direct Server Return (DSR). See [L3DSR].

o 应用程序负载平衡可能需要直接的第2层可达性来执行某些功能,如第2层直接服务器返回(DSR)。见[L3DSR]。

o Continued CAPEX differences between L2- and L3-capable switches.

o 支持L2和L3的交换机之间的持续资本支出差异。

4.3. L3-Only Designs
4.3. 仅限L3设计

Network designs that leverage IP routing down to Tier 3 of the network have gained popularity as well. The main benefit of these designs is improved network stability and scalability, as a result of confining L2 broadcast domains. Commonly, an Interior Gateway Protocol (IGP) such as Open Shortest Path First (OSPF) [RFC2328] is used as the primary routing protocol in such a design. As data centers grow in scale, and server count exceeds tens of thousands, such fully routed designs have become more attractive.

利用IP路由到网络第三层的网络设计也越来越受欢迎。由于限制了L2广播域,这些设计的主要好处是提高了网络的稳定性和可伸缩性。通常,在这样的设计中,内部网关协议(IGP),例如开放最短路径优先(OSPF)[RFC2328]被用作主要路由协议。随着数据中心规模的扩大,服务器数量超过数万台,这种完全路由的设计变得更具吸引力。

Choosing a L3-only design greatly simplifies the network, facilitating the meeting of REQ1 and REQ2, and has widespread adoption in networks where large Layer 2 adjacency and larger size Layer 3 subnets are not as critical compared to network scalability and stability. Application providers and network operators continue to develop new solutions to meet some of the requirements that previously had driven large Layer 2 domains by using various overlay or tunneling techniques.

选择仅限L3的设计极大地简化了网络,有助于满足REQ1和REK2的要求,并在网络中得到广泛采用,因为与网络可扩展性和稳定性相比,大型第2层邻接和较大尺寸的第3层子网并不重要。应用程序提供商和网络运营商继续开发新的解决方案,以满足以前使用各种覆盖或隧道技术驱动大型第2层域的一些需求。

5. Routing Protocol Design
5. 路由协议设计

In this section, the motivations for using External BGP (EBGP) as the single routing protocol for data center networks having a Layer 3 protocol design and Clos topology are reviewed. Then, a practical approach for designing an EBGP-based network is provided.

在本节中,将回顾使用外部BGP(EBGP)作为具有第3层协议设计和Clos拓扑的数据中心网络的单一路由协议的动机。然后,给出了一种设计基于EBGP网络的实用方法。

5.1. Choosing EBGP as the Routing Protocol
5.1. 选择EBGP作为路由协议

REQ2 would give preference to the selection of a single routing protocol to reduce complexity and interdependencies. While it is common to rely on an IGP in this situation, sometimes with either the addition of EBGP at the device bordering the WAN or Internal BGP (IBGP) throughout, this document proposes the use of an EBGP-only design.

REK2将优先选择单一路由协议,以降低复杂性和相互依赖性。虽然在这种情况下通常依赖IGP,有时在与WAN相邻的设备上添加EBGP或在整个过程中使用内部BGP(IBGP),但本文档建议使用仅EBGP的设计。

Although EBGP is the protocol used for almost all Inter-Domain Routing in the Internet and has wide support from both vendor and service provider communities, it is not generally deployed as the primary routing protocol within the data center for a number of reasons (some of which are interrelated):

尽管EBGP是用于Internet中几乎所有域间路由的协议,并且得到了供应商和服务提供商社区的广泛支持,但由于一些原因(其中一些是相互关联的),它通常不被部署为数据中心内的主路由协议:

o BGP is perceived as a "WAN-only, protocol-only" and not often considered for enterprise or data center applications.

o BGP被视为“仅限广域网,仅限协议”,通常不考虑用于企业或数据中心应用程序。

o BGP is believed to have a "much slower" routing convergence compared to IGPs.

o 与IGP相比,BGP被认为具有“慢得多”的路由收敛。

o Large-scale BGP deployments typically utilize an IGP for BGP next-hop resolution as all nodes in the IBGP topology are not directly connected.

o 大规模BGP部署通常利用IGP进行BGP下一跳解析,因为IBGP拓扑中的所有节点都不是直接连接的。

o BGP is perceived to require significant configuration overhead and does not support neighbor auto-discovery.

o BGP被认为需要大量配置开销,不支持邻居自动发现。

This document discusses some of these perceptions, especially as applicable to the proposed design, and highlights some of the advantages of using the protocol such as:

本文件讨论了其中一些观点,特别是适用于拟议设计的观点,并强调了使用协议的一些优势,例如:

o BGP has less complexity in parts of its protocol design -- internal data structures and state machine are simpler as compared to most link-state IGPs such as OSPF. For example, instead of implementing adjacency formation, adjacency maintenance and/or flow-control, BGP simply relies on TCP as the underlying transport. This fulfills REQ2 and REQ3.

o BGP在部分协议设计方面的复杂性较低——与大多数链路状态IGP(如OSPF)相比,内部数据结构和状态机更简单。例如,BGP没有实现邻接形成、邻接维护和/或流控制,而是简单地依赖TCP作为底层传输。这符合要求2和要求3。

o BGP information flooding overhead is less when compared to link-state IGPs. Since every BGP router calculates and propagates only the best-path selected, a network failure is masked as soon as the BGP speaker finds an alternate path, which exists when highly symmetric topologies, such as Clos, are coupled with an EBGP-only design. In contrast, the event propagation scope of a link-state IGP is an entire area, regardless of the failure type. In this way, BGP better meets REQ3 and REQ4. It is also worth mentioning that all widely deployed link-state IGPs feature periodic refreshes of routing information while BGP does not expire routing state, although this rarely impacts modern router control planes.

o 与链路状态IGP相比,BGP信息洪泛开销更小。由于每个BGP路由器只计算和传播选定的最佳路径,因此只要BGP扬声器找到备用路径,就会掩盖网络故障,当高度对称的拓扑(如CLO)与仅EBGP设计耦合时,就会出现这种情况。相反,链路状态IGP的事件传播范围是整个区域,而与故障类型无关。这样,BGP更好地满足了需求3和需求4。还值得一提的是,所有广泛部署的链路状态IGP都定期刷新路由信息,而BGP不会使路由状态过期,尽管这很少影响现代路由器控制平面。

o BGP supports third-party (recursively resolved) next hops. This allows for manipulating multipath to be non-ECMP-based or forwarding-based on application-defined paths, through establishment of a peering session with an application "controller" that can inject routing information into the system, satisfying REQ5. OSPF provides similar functionality using concepts such as "Forwarding Address", but with more difficulty in implementation and far less control of information propagation scope.

o BGP支持第三方(递归解析)下一跳。这允许通过与应用程序“控制器”建立对等会话,将多路径操作为非ECMP路径或基于应用程序定义的路径进行转发,该应用程序“控制器”可将路由信息注入系统,满足REQ5的要求。OSPF使用“转发地址”等概念提供了类似的功能,但实现起来更困难,对信息传播范围的控制也更少。

o Using a well-defined Autonomous System Number (ASN) allocation scheme and standard AS_PATH loop detection, "BGP path hunting" (see [JAKMA2008]) can be controlled and complex unwanted paths will be ignored. See Section 5.2 for an example of a working ASN allocation scheme. In a link-state IGP, accomplishing the same goal would require multi-(instance/topology/process) support, typically not available in all DC devices and quite complex to configure and troubleshoot. Using a traditional single flooding domain, which most DC designs utilize, under certain failure conditions may pick up unwanted lengthy paths, e.g., traversing multiple Tier 2 devices.

o 使用定义良好的自主系统编号(ASN)分配方案和标准AS_路径循环检测,可以控制“BGP路径搜索”(参见[JAKMA2008]),并忽略复杂的不需要的路径。请参见第5.2节,以了解有效ASN分配方案的示例。在链路状态IGP中,实现相同的目标需要多(实例/拓扑/流程)支持,通常不适用于所有DC设备,并且配置和故障排除相当复杂。在某些故障条件下,使用大多数DC设计使用的传统单泛洪域可能会拾取不需要的长路径,例如,穿越多个Tier 2设备。

o EBGP configuration that is implemented with minimal routing policy is easier to troubleshoot for network reachability issues. In most implementations, it is straightforward to view contents of the BGP Loc-RIB and compare it to the router's Routing Information Base (RIB). Also, in most implementations, an operator can view every BGP neighbors Adj-RIB-In and Adj-RIB-Out structures, and therefore incoming and outgoing Network Layer Reachability Information (NLRI) information can be easily correlated on both sides of a BGP session. Thus, BGP satisfies REQ3.

o 使用最小路由策略实现的EBGP配置更容易解决网络可达性问题。在大多数实现中,可以直接查看BGP Loc RIB的内容,并将其与路由器的路由信息库(RIB)进行比较。此外,在大多数实现中,操作员可以查看每个BGP邻居Adj RIB in和Adj RIB Out结构,因此传入和传出网络层可达性信息(NLRI)信息可以在BGP会话的两侧轻松关联。因此,BGP满足REQ3。

5.2. EBGP Configuration for Clos Topology
5.2. Clos拓扑的EBGP配置

Clos topologies that have more than 5 stages are very uncommon due to the large numbers of interconnects required by such a design. Therefore, the examples below are made with reference to the 5-stage Clos topology (in unfolded state).

由于此类设计需要大量互连,因此具有5级以上的Clos拓扑非常少见。因此,以下示例参考5级Clos拓扑(处于展开状态)。

5.2.1. EBGP Configuration Guidelines and Example ASN Scheme
5.2.1. EBGP配置指南和ASN方案示例

The diagram below illustrates an example of an ASN allocation scheme. The following is a list of guidelines that can be used:

下图显示了ASN分配方案的示例。以下是可使用的指南列表:

o EBGP single-hop sessions are established over direct point-to-point links interconnecting the network nodes, no multi-hop or loopback sessions are used, even in the case of multiple links between the same pair of nodes.

o EBGP单跳会话通过互连网络节点的直接点对点链路建立,不使用多跳或环回会话,即使在同一对节点之间存在多个链路的情况下也是如此。

o Private Use ASNs from the range 64512-65534 are used to avoid ASN conflicts.

o 64512-65534范围内的专用ASN用于避免ASN冲突。

o A single ASN is allocated to all of the Clos topology's Tier 1 devices.

o 单个ASN分配给Clos拓扑的所有第1层设备。

o A unique ASN is allocated to each set of Tier 2 devices in the same cluster.

o 为同一集群中的每组第2层设备分配一个唯一的ASN。

o A unique ASN is allocated to every Tier 3 device (e.g., ToR) in this topology.

o 此拓扑中的每个第3层设备(如ToR)都分配有一个唯一的ASN。

                                ASN 65534
                               +---------+
                               | +-----+ |
                               | |     | |
                             +-|-|     |-|-+
                             | | +-----+ | |
                  ASN 646XX  | |         | |  ASN 646XX
                 +---------+ | |         | | +---------+
                 | +-----+ | | | +-----+ | | | +-----+ |
     +-----------|-|     |-|-+-|-|     |-|-+-|-|     |-|-----------+
     |       +---|-|     |-|-+ | |     | | +-|-|     |-|---+       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     | +-----+---|-|     |-|-+ | |     | | +-|-|     |-|---+-----+ |
     | |     | +-|-|     |-|-+-|-|     |-|-+-|-|     |-|-+ |     | |
     | |     | | | +-----+ | | | +-----+ | | | +-----+ | | |     | |
     | |     | | +---------+ | |         | | +---------+ | |     | |
     | |     | |             | |         | |             | |     | |
   +-----+ +-----+           | | +-----+ | |           +-----+ +-----+
   | ASN | |     |           +-|-|     |-|-+           |     | |     |
   |65YYY| | ... |             | |     | |             | ... | | ... |
   +-----+ +-----+             | +-----+ |             +-----+ +-----+
     | |     | |               +---------+               | |     | |
     O O     O O              <- Servers ->              O O     O O
        
                                ASN 65534
                               +---------+
                               | +-----+ |
                               | |     | |
                             +-|-|     |-|-+
                             | | +-----+ | |
                  ASN 646XX  | |         | |  ASN 646XX
                 +---------+ | |         | | +---------+
                 | +-----+ | | | +-----+ | | | +-----+ |
     +-----------|-|     |-|-+-|-|     |-|-+-|-|     |-|-----------+
     |       +---|-|     |-|-+ | |     | | +-|-|     |-|---+       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   |         |   |         |   |         |   |       |
     |       |   | +-----+ |   | +-----+ |   | +-----+ |   |       |
     | +-----+---|-|     |-|-+ | |     | | +-|-|     |-|---+-----+ |
     | |     | +-|-|     |-|-+-|-|     |-|-+-|-|     |-|-+ |     | |
     | |     | | | +-----+ | | | +-----+ | | | +-----+ | | |     | |
     | |     | | +---------+ | |         | | +---------+ | |     | |
     | |     | |             | |         | |             | |     | |
   +-----+ +-----+           | | +-----+ | |           +-----+ +-----+
   | ASN | |     |           +-|-|     |-|-+           |     | |     |
   |65YYY| | ... |             | |     | |             | ... | | ... |
   +-----+ +-----+             | +-----+ |             +-----+ +-----+
     | |     | |               +---------+               | |     | |
     O O     O O              <- Servers ->              O O     O O
        

Figure 4: BGP ASN Layout for 5-Stage Clos

图4:5级CLO的BGP ASN布局

5.2.2. Private Use ASNs
5.2.2. 自用ASN

The original range of Private Use ASNs [RFC6996] limited operators to 1023 unique ASNs. Since it is quite likely that the number of network devices may exceed this number, a workaround is required. One approach is to re-use the ASNs assigned to the Tier 3 devices across different clusters. For example, Private Use ASNs 65001, 65002 ... 65032 could be used within every individual cluster and assigned to Tier 3 devices.

私人使用ASN的原始范围[RFC6996]将运营商限制为1023个独特的ASN。由于网络设备的数量很可能超过此数量,因此需要一种变通方法。一种方法是跨不同集群重用分配给第3层设备的ASN。例如,私人使用ASN6500165002。。。65032可以在每个集群中使用,并分配给第3层设备。

To avoid route suppression due to the AS_PATH loop detection mechanism in BGP, upstream EBGP sessions on Tier 3 devices must be configured with the "Allowas-in" feature [ALLOWASIN] that allows accepting a device's own ASN in received route advertisements. Although this feature is not standardized, it is widely available across multiple vendors implementations. Introducing this feature does not make routing loops more likely in the design since the AS_PATH is being added to by routers at each of the topology tiers and AS_PATH length is an early tie breaker in the BGP path selection

为了避免由于BGP中的AS_路径循环检测机制而导致路由抑制,第3层设备上的上游EBGP会话必须配置“Allowas in”功能[ALLOWASIN],该功能允许在接收到的路由广告中接受设备自己的ASN。尽管此功能尚未标准化,但它在多个供应商的实现中广泛可用。引入此功能不会使路由环路在设计中更可能出现,因为每个拓扑层的路由器都在添加AS_路径,并且AS_路径长度是BGP路径选择中的一个早期连接断路器

process. Further loop protection is still in place at the Tier 1 device, which will not accept routes with a path including its own ASN. Tier 2 devices do not have direct connectivity with each other.

过程第1层设备仍有进一步的环路保护,该设备将不接受包含其自身ASN的路径的路由。第2层设备之间没有直接连接。

Another solution to this problem would be to use Four-Octet ASNs ([RFC6793]), where there are additional Private Use ASNs available, see [IANA.AS]. Use of Four-Octet ASNs puts additional protocol complexity in the BGP implementation and should be balanced against the complexity of re-use when considering REQ3 and REQ4. Perhaps more importantly, they are not yet supported by all BGP implementations, which may limit vendor selection of DC equipment. When supported, ensure that deployed implementations are able to remove the Private Use ASNs when external connectivity (Section 5.2.4) to these ASNs is required.

这个问题的另一个解决方案是使用四个八位组ASN([RFC6793]),如果有其他可用的专用ASN,请参阅[IANA.AS]。四个八位组ASN的使用增加了BGP实施中的协议复杂性,在考虑REQ3和REQ4时,应与重复使用的复杂性相平衡。也许更重要的是,它们尚未得到所有BGP实施的支持,这可能会限制DC设备的供应商选择。在支持的情况下,确保部署的实现能够在需要与专用ASN进行外部连接(第5.2.4节)时删除这些ASN。

5.2.3. Prefix Advertisement
5.2.3. 前缀广告

A Clos topology features a large number of point-to-point links and associated prefixes. Advertising all of these routes into BGP may create Forwarding Information Base (FIB) overload in the network devices. Advertising these links also puts additional path computation stress on the BGP control plane for little benefit. There are two possible solutions:

Clos拓扑具有大量点到点链接和相关前缀。在BGP中公布所有这些路由可能会在网络设备中造成转发信息库(FIB)过载。宣传这些链路也会给BGP控制平面带来额外的路径计算压力,但效益甚微。有两种可能的解决方案:

o Do not advertise any of the point-to-point links into BGP. Since the EBGP-based design changes the next-hop address at every device, distant networks will automatically be reachable via the advertising EBGP peer and do not require reachability to these prefixes. However, this may complicate operations or monitoring: e.g., using the popular "traceroute" tool will display IP addresses that are not reachable.

o 不要将任何点对点链接播发到BGP中。由于基于EBGP的设计改变了每个设备上的下一跳地址,远程网络将通过广告EBGP对等机自动访问,并且不需要访问这些前缀。但是,这可能会使操作或监视复杂化:例如,使用流行的“traceroute”工具将显示无法访问的IP地址。

o Advertise point-to-point links, but summarize them on every device. This requires an address allocation scheme such as allocating a consecutive block of IP addresses per Tier 1 and Tier 2 device to be used for point-to-point interface addressing to the lower layers (Tier 2 uplinks will be allocated from Tier 1 address blocks and so forth).

o 宣传点到点链接,但在每台设备上汇总它们。这需要一个地址分配方案,例如为每个Tier 1和Tier 2设备分配一个连续的IP地址块,用于对较低层的点对点接口寻址(Tier 2上行链路将从Tier 1地址块分配,等等)。

Server subnets on Tier 3 devices must be announced into BGP without using route summarization on Tier 2 and Tier 1 devices. Summarizing subnets in a Clos topology results in route black-holing under a single link failure (e.g., between Tier 2 and Tier 3 devices), and hence must be avoided. The use of peer links within the same tier to resolve the black-holing problem by providing "bypass paths" is undesirable due to O(N^2) complexity of the peering-mesh and waste of ports on the devices. An alternative to the full mesh of peer links would be to use a simpler bypass topology, e.g., a "ring" as

第3层设备上的服务器子网必须在不使用第2层和第1层设备上的路由摘要的情况下发布到BGP中。在Clos拓扑中汇总子网会导致单链路故障(例如,第2层和第3层设备之间)下的路由黑洞,因此必须避免。由于对等网格的O(N^2)复杂性和设备上端口的浪费,在同一层中使用对等链路通过提供“旁路路径”来解决黑洞问题是不可取的。对等链路的完整网格的另一种选择是使用更简单的旁路拓扑,例如“环”作为节点

described in [FB4POST], but such a topology adds extra hops and has limited bandwidth. It may require special tweaks to make BGP routing work, e.g., splitting every device into an ASN of its own. Later in this document, Section 8.2 introduces a less intrusive method for performing a limited form of route summarization in Clos networks and discusses its associated tradeoffs.

如[FB4POST]所述,但这种拓扑增加了额外的跳数,带宽有限。为了使BGP路由工作,可能需要进行特殊调整,例如,将每个设备拆分为自己的ASN。在本文件的后面部分,第8.2节介绍了一种侵入性较小的方法,用于在Clos网络中执行有限形式的路由摘要,并讨论了其相关权衡。

5.2.4. External Connectivity
5.2.4. 外部连接

A dedicated cluster (or clusters) in the Clos topology could be used for the purpose of connecting to the Wide Area Network (WAN) edge devices, or WAN Routers. Tier 3 devices in such a cluster would be replaced with WAN routers, and EBGP peering would be used again, though WAN routers are likely to belong to a public ASN if Internet connectivity is required in the design. The Tier 2 devices in such a dedicated cluster will be referred to as "Border Routers" in this document. These devices have to perform a few special functions:

Clos拓扑中的专用群集可用于连接广域网(WAN)边缘设备或WAN路由器。这样一个集群中的第3层设备将被广域网路由器取代,并且将再次使用EBGP对等,尽管如果设计中需要互联网连接,广域网路由器可能属于公共ASN。在本文件中,此类专用集群中的第2层设备称为“边界路由器”。这些设备必须执行一些特殊功能:

o Hide network topology information when advertising paths to WAN routers, i.e., remove Private Use ASNs [RFC6996] from the AS_PATH attribute. This is typically done to avoid ASN number collisions between different data centers and also to provide a uniform AS_PATH length to the WAN for purposes of WAN ECMP to anycast prefixes originated in the topology. An implementation-specific BGP feature typically called "Remove Private AS" is commonly used to accomplish this. Depending on implementation, the feature should strip a contiguous sequence of Private Use ASNs found in an AS_PATH attribute prior to advertising the path to a neighbor. This assumes that all ASNs used for intra data center numbering are from the Private Use ranges. The process for stripping the Private Use ASNs is not currently standardized, see [REMOVAL]. However, most implementations at least follow the logic described in this vendor's document [VENDOR-REMOVE-PRIVATE-AS], which is enough for the design specified.

o 在公布到WAN路由器的路径时隐藏网络拓扑信息,即从AS_路径属性中删除专用ASN[RFC6996]。这样做通常是为了避免不同数据中心之间的ASN编号冲突,并为WAN ECMP到源自拓扑的选播前缀提供到WAN的统一AS_路径长度。通常使用一种特定于实现的BGP特性“删除专用AS”来实现这一点。根据实现的不同,该功能应在向邻居公布路径之前,剥离AS_路径属性中发现的连续的专用ASN序列。这假设用于数据中心内部编号的所有ASN都来自专用范围。剥离专用ASN的过程目前尚未标准化,请参见[移除]。然而,大多数实现至少遵循本供应商文档[vendor-REMOVE-PRIVATE-AS]中描述的逻辑,这对于指定的设计已经足够了。

o Originate a default route to the data center devices. This is the only place where a default route can be originated, as route summarization is risky for the unmodified Clos topology. Alternatively, Border Routers may simply relay the default route learned from WAN routers. Advertising the default route from Border Routers requires that all Border Routers be fully connected to the WAN Routers upstream, to provide resistance to a single-link failure causing the black-holing of traffic. To prevent black-holing in the situation when all of the EBGP sessions to the WAN routers fail simultaneously on a given device, it is more desirable to readvertise the default route rather than originating the default route via complicated conditional route origination schemes provided by some implementations [CONDITIONALROUTE].

o 发起到数据中心设备的默认路由。这是唯一可以创建默认路由的地方,因为路由摘要对于未修改的Clos拓扑是有风险的。或者,边界路由器可以简单地中继从WAN路由器学习到的默认路由。公布来自边界路由器的默认路由要求所有边界路由器完全连接到WAN路由器上游,以抵抗导致流量黑洞的单链路故障。为了防止在给定设备上与WAN路由器的所有EBGP会话同时失败的情况下出现黑洞,更希望读取默认路由,而不是通过一些实现提供的复杂条件路由发起方案来发起默认路由[CONDITIONALROUTE]。

5.2.5. Route Summarization at the Edge
5.2.5. 边缘路由摘要

It is often desirable to summarize network reachability information prior to advertising it to the WAN network due to the high amount of IP prefixes originated from within the data center in a fully routed network design. For example, a network with 2000 Tier 3 devices will have at least 2000 servers subnets advertised into BGP, along with the infrastructure prefixes. However, as discussed in Section 5.2.3, the proposed network design does not allow for route summarization due to the lack of peer links inside every tier.

由于在全路由网络设计中,大量IP前缀源自数据中心内部,因此通常需要在将网络可达性信息发布到WAN网络之前对其进行汇总。例如,具有2000个第3层设备的网络将至少有2000个服务器子网与基础设施前缀一起发布到BGP中。然而,如第5.2.3节所述,由于每一层内缺少对等链路,因此拟议的网络设计不允许路由汇总。

However, it is possible to lift this restriction for the Border Routers by devising a different connectivity model for these devices. There are two options possible:

然而,通过为这些设备设计不同的连接模型,可以解除对边界路由器的这种限制。有两种可能的选择:

o Interconnect the Border Routers using a full-mesh of physical links or using any other "peer-mesh" topology, such as ring or hub-and-spoke. Configure BGP accordingly on all Border Leafs to exchange network reachability information, e.g., by adding a mesh of IBGP sessions. The interconnecting peer links need to be appropriately sized for traffic that will be present in the case of a device or link failure in the mesh connecting the Border Routers.

o 使用物理链路的完整网格或任何其他“对等网格”拓扑(如环形或集线器和辐条)互连边界路由器。在所有边界页上相应地配置BGP以交换网络可达性信息,例如,通过添加IBGP会话网格。互连对等链路需要适当调整大小,以便在连接边界路由器的网状网中出现设备或链路故障的情况下出现流量。

o Tier 1 devices may have additional physical links provisioned toward the Border Routers (which are Tier 2 devices from the perspective of Tier 1). Specifically, if protection from a single link or node failure is desired, each Tier 1 device would have to connect to at least two Border Routers. This puts additional requirements on the port count for Tier 1 devices and Border Routers, potentially making it a nonuniform, larger port count, device compared with the other devices in the Clos. This also reduces the number of ports available to "regular" Tier 2 switches, and hence the number of clusters that could be interconnected via Tier 1.

o 第1层设备可能具有向边界路由器(从第1层的角度来看是第2层设备)提供的附加物理链路。具体来说,如果需要防止单个链路或节点故障,则每个第1层设备必须连接到至少两个边界路由器。这对第1层设备和边界路由器的端口数提出了额外的要求,与CLO中的其他设备相比,可能使其成为一个不均匀、端口数更大的设备。这还减少了“常规”第2层交换机可用的端口数量,从而减少了可通过第1层互连的集群数量。

If any of the above options are implemented, it is possible to perform route summarization at the Border Routers toward the WAN network core without risking a routing black-hole condition under a single link failure. Both of the options would result in nonuniform topology as additional links have to be provisioned on some network devices.

如果实现了上述任一选项,则可以在边界路由器处执行朝向WAN网络核心的路由汇总,而不会在单链路故障下冒路由黑洞的风险。这两个选项都会导致拓扑不均匀,因为必须在某些网络设备上提供额外的链路。

6. ECMP Considerations
6. ECMP注意事项

This section covers the Equal Cost Multipath (ECMP) functionality for Clos topology and discusses a few special requirements.

本节介绍Clos拓扑的等成本多路径(ECMP)功能,并讨论一些特殊要求。

6.1. Basic ECMP
6.1. 基本ECMP

ECMP is the fundamental load-sharing mechanism used by a Clos topology. Effectively, every lower-tier device will use all of its directly attached upper-tier devices to load-share traffic destined to the same IP prefix. The number of ECMP paths between any two Tier 3 devices in Clos topology is equal to the number of the devices in the middle stage (Tier 1). For example, Figure 5 illustrates a topology where Tier 3 device A has four paths to reach servers X and Y, via Tier 2 devices B and C and then Tier 1 devices 1, 2, 3, and 4, respectively.

ECMP是Clos拓扑使用的基本负载共享机制。实际上,每个下层设备将使用其所有直接连接的上层设备来加载发送到相同IP前缀的共享流量。CLOS拓扑中任意两层3设备之间的ECMP路径的数目等于中间阶段中的设备的数量(第1层)。例如,图5说明了一种拓扑结构,其中第3层设备a有四条路径分别通过第2层设备B和C以及第1层设备1、2、3和4到达服务器X和Y。

                                Tier 1
                               +-----+
                               | DEV |
                            +->|  1  |--+
                            |  +-----+  |
                    Tier 2  |           |   Tier 2
                   +-----+  |  +-----+  |  +-----+
     +------------>| DEV |--+->| DEV |--+--|     |-------------+
     |       +-----|  B  |--+  |  2  |  +--|     |-----+       |
     |       |     +-----+     +-----+     +-----+     |       |
     |       |                                         |       |
     |       |     +-----+     +-----+     +-----+     |       |
     | +-----+---->| DEV |--+  | DEV |  +--|     |-----+-----+ |
     | |     | +---|  C  |--+->|  3  |--+--|     |---+ |     | |
     | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
     | |     | |            |           |            | |     | |
   +-----+ +-----+          |  +-----+  |          +-----+ +-----+
   | DEV | |     | Tier 3   +->| DEV |--+   Tier 3 |     | |     |
   |  A  | |     |             |  4  |             |     | |     |
   +-----+ +-----+             +-----+             +-----+ +-----+
     | |     | |                                     | |     | |
     O O     O O            <- Servers ->            X Y     O O
        
                                Tier 1
                               +-----+
                               | DEV |
                            +->|  1  |--+
                            |  +-----+  |
                    Tier 2  |           |   Tier 2
                   +-----+  |  +-----+  |  +-----+
     +------------>| DEV |--+->| DEV |--+--|     |-------------+
     |       +-----|  B  |--+  |  2  |  +--|     |-----+       |
     |       |     +-----+     +-----+     +-----+     |       |
     |       |                                         |       |
     |       |     +-----+     +-----+     +-----+     |       |
     | +-----+---->| DEV |--+  | DEV |  +--|     |-----+-----+ |
     | |     | +---|  C  |--+->|  3  |--+--|     |---+ |     | |
     | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |
     | |     | |            |           |            | |     | |
   +-----+ +-----+          |  +-----+  |          +-----+ +-----+
   | DEV | |     | Tier 3   +->| DEV |--+   Tier 3 |     | |     |
   |  A  | |     |             |  4  |             |     | |     |
   +-----+ +-----+             +-----+             +-----+ +-----+
     | |     | |                                     | |     | |
     O O     O O            <- Servers ->            X Y     O O
        

Figure 5: ECMP Fan-Out Tree from A to X and Y

图5:从A到X和Y的ECMP扇出树

The ECMP requirement implies that the BGP implementation must support multipath fan-out for up to the maximum number of devices directly attached at any point in the topology in the upstream or downstream direction. Normally, this number does not exceed half of the ports found on a device in the topology. For example, an ECMP fan-out of 32 would be required when building a Clos network using 64-port

ECMP要求意味着BGP实现必须支持多路径扇出,最多可支持在上游或下游方向拓扑中任何点直接连接的最大数量的设备。通常,此数量不超过拓扑中设备上端口的一半。例如,当使用64端口构建Clos网络时,需要32个ECMP风扇

devices. The Border Routers may need to have wider fan-out to be able to connect to a multitude of Tier 1 devices if route summarization at Border Router level is implemented as described in Section 5.2.5. If a device's hardware does not support wider ECMP, logical link-grouping (link-aggregation at Layer 2) could be used to provide "hierarchical" ECMP (Layer 3 ECMP coupled with Layer 2 ECMP) to compensate for fan-out limitations. However, this approach increases the risk of flow polarization, as less entropy will be available at the second stage of ECMP.

设备。如果按照第5.2.5节所述在边界路由器级别实施路由汇总,则边界路由器可能需要具有更宽的扇出,以便能够连接到多个第1层设备。如果设备的硬件不支持更宽的ECMP,则可以使用逻辑链路分组(第2层的链路聚合)来提供“分层”ECMP(第3层ECMP与第2层ECMP耦合),以补偿扇出限制。然而,这种方法增加了流动极化的风险,因为在ECMP的第二阶段,可用的熵会减少。

Most BGP implementations declare paths to be equal from an ECMP perspective if they match up to and including step (e) in Section 9.1.2.2 of [RFC4271]. In the proposed network design there is no underlying IGP, so all IGP costs are assumed to be zero or otherwise the same value across all paths and policies may be applied as necessary to equalize BGP attributes that vary in vendor defaults, such as the MULTI_EXIT_DISC (MED) attribute and origin code. For historical reasons, it is also useful to not use 0 as the equalized MED value; this and some other useful BGP information is available in [RFC4277]. Routing loops are unlikely due to the BGP best-path selection process (which prefers shorter AS_PATH length), and longer paths through the Tier 1 devices (which don't allow their own ASN in the path) are not possible.

如果符合[RFC4271]第9.1.2.2节中的步骤(e),则大多数BGP实现从ECMP角度声明路径相等。在拟议的网络设计中,没有基础IGP,因此假设所有IGP成本为零,或者根据需要在所有路径和策略中应用相同的值,以均衡供应商默认值不同的BGP属性,例如多出口光盘(MED)属性和源代码。出于历史原因,不使用0作为均衡中值也很有用;[RFC4277]中提供了此信息和其他一些有用的BGP信息。由于BGP最佳路径选择过程(首选较短的AS_路径长度),不太可能出现路由循环,并且不可能通过第1层设备的较长路径(路径中不允许自己的ASN)。

6.2. BGP ECMP over Multiple ASNs
6.2. 多个ASN上的BGP ECMP

For application load-balancing purposes, it is desirable to have the same prefix advertised from multiple Tier 3 devices. From the perspective of other devices, such a prefix would have BGP paths with different AS_PATH attribute values, while having the same AS_PATH attribute lengths. Therefore, BGP implementations must support load-sharing over the above-mentioned paths. This feature is sometimes known as "multipath relax" or "multipath multiple-AS" and effectively allows for ECMP to be done across different neighboring ASNs if all other attributes are equal as already described in the previous section.

出于应用程序负载平衡的目的,希望从多个第3层设备发布相同的前缀。从其他设备的角度来看,这样的前缀将具有具有不同AS_路径属性值的BGP路径,同时具有相同AS_路径属性长度。因此,BGP实现必须支持上述路径上的负载共享。此功能有时被称为“多路径松弛”或“多路径多个as”,如果所有其他属性如前一节所述相等,则有效地允许跨不同的相邻ASN执行ECMP。

6.3. Weighted ECMP
6.3. 加权ECMP

It may be desirable for the network devices to implement "weighted" ECMP, to be able to send more traffic over some paths in ECMP fan-out. This could be helpful to compensate for failures in the network and send more traffic over paths that have more capacity. The prefixes that require weighted ECMP would have to be injected using remote BGP speaker (central agent) over a multi-hop session as described further in Section 8.1. If support in implementations is available, weight distribution for multiple BGP paths could be signaled using the technique described in [LINK].

网络设备可能希望实现“加权”ECMP,以便能够在ECMP扇出中的一些路径上发送更多流量。这有助于补偿网络故障,并通过容量更大的路径发送更多流量。需要加权ECMP的前缀必须通过多跳会话使用远程BGP扬声器(中央代理)注入,如第8.1节所述。如果在实现中提供支持,则可以使用[LINK]中描述的技术通知多个BGP路径的权重分布。

6.4. Consistent Hashing
6.4. 一致性哈希

It is often desirable to have the hashing function used for ECMP to be consistent (see [CONS-HASH]), to minimize the impact on flow to next-hop affinity changes when a next hop is added or removed to an ECMP group. This could be used if the network device is used as a load balancer, mapping flows toward multiple destinations -- in this case, losing or adding a destination will not have a detrimental effect on currently established flows. One particular recommendation on implementing consistent hashing is provided in [RFC2992], though other implementations are possible. This functionality could be naturally combined with weighted ECMP, with the impact of the next hop changes being proportional to the weight of the given next hop. The downside of consistent hashing is increased load on hardware resource utilization, as typically more resources (e.g., Ternary Content-Addressable Memory (TCAM) space) are required to implement a consistent-hashing function.

通常希望ECMP使用的哈希函数保持一致(请参见[CONS-HASH]),以便在ECMP组中添加或删除下一个跃点时,将对下一个跃点的流亲和力更改的影响降至最低。如果将网络设备用作负载平衡器,将流映射到多个目的地,则可以使用这种方法。在这种情况下,丢失或添加目的地不会对当前建立的流产生有害影响。[RFC2992]中提供了关于实现一致性哈希的一个特别建议,但也可以使用其他实现。此功能可以自然地与加权ECMP结合,下一跳变化的影响与给定下一跳的权重成比例。一致性哈希的缺点是硬件资源利用率的负载增加,因为通常需要更多的资源(例如,三元内容寻址内存(TCAM)空间)来实现一致性哈希功能。

7. Routing Convergence Properties
7. 路由收敛性

This section reviews routing convergence properties in the proposed design. A case is made that sub-second convergence is achievable if the implementation supports fast EBGP peering session deactivation and timely RIB and FIB updates upon failure of the associated link.

本节回顾了拟议设计中的路由收敛特性。一种情况是,如果实现支持快速EBGP对等会话停用以及在相关链路发生故障时及时更新RIB和FIB,则可以实现亚秒收敛。

7.1. Fault Detection Timing
7.1. 故障检测定时

BGP typically relies on an IGP to route around link/node failures inside an AS, and implements either a polling-based or an event-driven mechanism to obtain updates on IGP state changes. The proposed routing design does not use an IGP, so the remaining mechanisms that could be used for fault detection are BGP keep-alive time-out (or any other type of keep-alive mechanism) and link-failure triggers.

BGP通常依赖IGP在AS内围绕链路/节点故障进行路由,并实现基于轮询或事件驱动的机制来获取IGP状态更改的更新。建议的路由设计不使用IGP,因此可用于故障检测的其余机制是BGP保持活动超时(或任何其他类型的保持活动机制)和链路故障触发器。

Relying solely on BGP keep-alive packets may result in high convergence delays, on the order of multiple seconds (on many BGP implementations the minimum configurable BGP hold timer value is three seconds). However, many BGP implementations can shut down local EBGP peering sessions in response to the "link down" event for the outgoing interface used for BGP peering. This feature is sometimes called "fast fallover". Since links in modern data centers are predominantly point-to-point fiber connections, a physical interface failure is often detected in milliseconds and subsequently triggers a BGP reconvergence.

仅依赖BGP保持活动的数据包可能会导致高收敛延迟,大约为几秒(在许多BGP实现中,可配置的BGP保持计时器的最小值为三秒)。但是,许多BGP实现可以关闭本地EBGP对等会话,以响应用于BGP对等的传出接口的“链路关闭”事件。此功能有时称为“快速衰减”。由于现代数据中心中的链路主要是点对点光纤连接,因此通常会在毫秒内检测到物理接口故障,并随后触发BGP重新聚合。

Ethernet links may support failure signaling or detection standards such as Connectivity Fault Management (CFM) as described in [IEEE8021Q]; this may make failure detection more robust. Alternatively, some platforms may support Bidirectional Forwarding Detection (BFD) [RFC5880] to allow for sub-second failure detection and fault signaling to the BGP process. However, the use of either of these presents additional requirements to vendor software and possibly hardware, and may contradict REQ1. Until recently with [RFC7130], BFD also did not allow detection of a single member link failure on a LAG, which would have limited its usefulness in some designs.

以太网链路可支持故障信号或检测标准,如[IEEE8021Q]中所述的连接故障管理(CFM);这可能使故障检测更加稳健。或者,一些平台可支持双向转发检测(BFD)[RFC5880],以允许亚秒故障检测和向BGP进程发送故障信号。然而,使用其中任何一种都会对供应商软件和硬件提出额外要求,并且可能与要求1相矛盾。直到最近,在[RFC7130]中,BFD还不允许检测滞后的单成员链路故障,这将限制其在某些设计中的实用性。

7.2. Event Propagation Timing
7.2. 事件传播定时

In the proposed design, the impact of the BGP MinRouteAdvertisementIntervalTimer (MRAI timer), as specified in Section 9.2.1.1 of [RFC4271], should be considered. Per the standard, it is required for BGP implementations to space out consecutive BGP UPDATE messages by at least MRAI seconds, which is often a configurable value. The initial BGP UPDATE messages after an event carrying withdrawn routes are commonly not affected by this timer. The MRAI timer may present significant convergence delays when a BGP speaker "waits" for the new path to be learned from its peers and has no local backup path information.

在拟定设计中,应考虑[RFC4271]第9.2.1.1节中规定的BGP MinRouteAdVertisementTinterValimer(MRAI定时器)的影响。根据该标准,BGP实现需要将连续的BGP更新消息间隔至少MRAI秒,这通常是一个可配置的值。携带撤回路由的事件之后的初始BGP更新消息通常不受此计时器的影响。当BGP演讲者“等待”从其对等方学习新路径并且没有本地备份路径信息时,MRAI定时器可能会出现显著的收敛延迟。

In a Clos topology, each EBGP speaker typically has either one path (Tier 2 devices don't accept paths from other Tier 2 in the same cluster due to same ASN) or N paths for the same prefix, where N is a significantly large number, e.g., N=32 (the ECMP fan-out to the next tier). Therefore, if a link fails to another device from which a path is received there is either no backup path at all (e.g., from the perspective of a Tier 2 switch losing the link to a Tier 3 device), or the backup is readily available in BGP Loc-RIB (e.g., from the perspective of a Tier 2 device losing the link to a Tier 1 switch). In the former case, the BGP withdrawal announcement will propagate without delay and trigger reconvergence on affected devices. In the latter case, the best path will be re-evaluated, and the local ECMP group corresponding to the new next-hop set will be changed. If the BGP path was the best path selected previously, an "implicit withdraw" will be sent via a BGP UPDATE message as described as Option b in Section 3.1 of [RFC4271] due to the BGP AS_PATH attribute changing.

在Clos拓扑中,每个EBGP扬声器通常具有一条路径(由于相同的ASN,第2层设备不接受来自同一集群中其他第2层的路径)或相同前缀的N条路径,其中N是非常大的数字,例如N=32(ECMP扇出到下一层)。因此,如果到接收路径的另一个设备的链路失败,则根本没有备份路径(例如,从第2层交换机失去到第3层设备的链路的角度来看),或者备份在BGP Loc RIB中随时可用(例如,从第2层设备失去到第1层交换机的链路的角度来看)。在前一种情况下,BGP撤回通知将毫不延迟地传播,并触发受影响设备上的重新聚合。在后一种情况下,将重新评估最佳路径,并更改与新下一跳集对应的本地ECMP组。如果BGP路径是之前选择的最佳路径,由于BGP as_路径属性发生变化,将通过BGP更新消息发送“隐式撤回”,如[RFC4271]第3.1节中的选项b所述。

7.3. Impact of Clos Topology Fan-Outs
7.3. Clos拓扑扇出的影响

Clos topology has large fan-outs, which may impact the "Up->Down" convergence in some cases, as described in this section. In a situation when a link between Tier 3 and Tier 2 device fails, the Tier 2 device will send BGP UPDATE messages to all upstream Tier 1 devices, withdrawing the affected prefixes. The Tier 1 devices, in turn, will relay these messages to all downstream Tier 2 devices (except for the originator). Tier 2 devices other than the one originating the UPDATE should then wait for ALL upstream Tier 1 devices to send an UPDATE message before removing the affected prefixes and sending corresponding UPDATE downstream to connected Tier 3 devices. If the original Tier 2 device or the relaying Tier 1 devices introduce some delay into their UPDATE message announcements, the result could be UPDATE message "dispersion", that could be as long as multiple seconds. In order to avoid such a behavior, BGP implementations must support "update groups". The "update group" is defined as a collection of neighbors sharing the same outbound policy -- the local speaker will send BGP updates to the members of the group synchronously.

Clos拓扑具有较大的扇出,这在某些情况下可能会影响“向上->向下”收敛,如本节所述。当第3层和第2层设备之间的链路出现故障时,第2层设备将向所有上游第1层设备发送BGP更新消息,并收回受影响的前缀。反过来,第1层设备将这些消息中继到所有下游第2层设备(发起者除外)。然后,发起更新的第2层设备以外的第2层设备应等待所有上游第1层设备发送更新消息,然后再删除受影响的前缀并向连接的第3层设备发送相应的下游更新。如果原始Tier 2设备或中继Tier 1设备在其更新消息公告中引入一些延迟,则结果可能是更新消息“分散”,这可能长达数秒。为了避免这种行为,BGP实现必须支持“更新组”。“更新组”定义为共享相同出站策略的邻居的集合——本地说话人将同步向组成员发送BGP更新。

The impact of such "dispersion" grows with the size of topology fan-out and could also grow under network convergence churn. Some operators may be tempted to introduce "route flap dampening" type features that vendors include to reduce the control-plane impact of rapidly flapping prefixes. However, due to issues described with false positives in these implementations especially under such "dispersion" events, it is not recommended to enable this feature in this design. More background and issues with "route flap dampening" and possible implementation changes that could affect this are well described in [RFC7196].

这种“分散”的影响随着拓扑扇出的大小而增长,并且在网络融合的情况下也可能增长。一些运营商可能会尝试引入供应商提供的“路由襟翼阻尼”类型功能,以减少快速襟翼前缀对控制平面的影响。然而,由于这些实现中的误报问题,特别是在这种“分散”事件下,不建议在此设计中启用此功能。[RFC7196]中详细介绍了“路由襟翼阻尼”的更多背景和问题,以及可能影响这一点的可能实施更改。

7.4. Failure Impact Scope
7.4. 故障影响范围

A network is declared to converge in response to a failure once all devices within the failure impact scope are notified of the event and have recalculated their RIBs and consequently updated their FIBs. Larger failure impact scope typically means slower convergence since more devices have to be notified, and results in a less stable network. In this section, we describe BGP's advantages over link-state routing protocols in reducing failure impact scope for a Clos topology.

一旦故障影响范围内的所有设备收到事件通知,并重新计算了它们的RIB,从而更新了它们的FIB,则宣布网络收敛以响应故障。较大的故障影响范围通常意味着较慢的收敛速度,因为必须通知更多的设备,并导致网络不太稳定。在本节中,我们将介绍BGP在减少Clos拓扑的故障影响范围方面优于链路状态路由协议的优势。

BGP behaves like a distance-vector protocol in the sense that only the best path from the point of view of the local router is sent to neighbors. As such, some failures are masked if the local node can immediately find a backup path and does not have to send any updates further. Notice that in the worst case, all devices in a data center

BGP的行为类似于距离向量协议,因为从本地路由器的角度来看,只有最佳路径被发送到邻居。因此,如果本地节点可以立即找到备份路径,并且不必进一步发送任何更新,则会掩盖某些故障。请注意,在最坏的情况下,数据中心中的所有设备

topology have to either withdraw a prefix completely or update the ECMP groups in their FIBs. However, many failures will not result in such a wide impact. There are two main failure types where impact scope is reduced:

拓扑必须完全撤回前缀或更新FIB中的ECMP组。然而,许多失败不会造成如此广泛的影响。影响范围减小的主要故障类型有两种:

o Failure of a link between Tier 2 and Tier 1 devices: In this case, a Tier 2 device will update the affected ECMP groups, removing the failed link. There is no need to send new information to downstream Tier 3 devices, unless the path was selected as best by the BGP process, in which case only an "implicit withdraw" needs to be sent and this should not affect forwarding. The affected Tier 1 device will lose the only path available to reach a particular cluster and will have to withdraw the associated prefixes. Such a prefix withdrawal process will only affect Tier 2 devices directly connected to the affected Tier 1 device. The Tier 2 devices receiving the BGP UPDATE messages withdrawing prefixes will simply have to update their ECMP groups. The Tier 3 devices are not involved in the reconvergence process.

o 第2层和第1层设备之间的链接失败:在这种情况下,第2层设备将更新受影响的ECMP组,删除失败的链接。无需向下游第3层设备发送新信息,除非BGP进程选择了最佳路径,在这种情况下,只需发送“隐式撤回”,这不应影响转发。受影响的第1层设备将丢失可到达特定群集的唯一路径,并将不得不撤回相关前缀。这样的前缀提取过程只会影响直接连接到受影响的Tier 1设备的Tier 2设备。接收提取前缀的BGP更新消息的第2层设备只需更新其ECMP组。Tier 3设备不参与重新聚合过程。

o Failure of a Tier 1 device: In this case, all Tier 2 devices directly attached to the failed node will have to update their ECMP groups for all IP prefixes from a non-local cluster. The Tier 3 devices are once again not involved in the reconvergence process, but may receive "implicit withdraws" as described above.

o 第1层设备故障:在这种情况下,直接连接到故障节点的所有第2层设备都必须更新其ECMP组,以获取来自非本地群集的所有IP前缀。第3层设备再次不参与重新聚合过程,但可能会收到如上所述的“隐式撤回”。

Even in the case of such failures where multiple IP prefixes will have to be reprogrammed in the FIB, it is worth noting that all of these prefixes share a single ECMP group on a Tier 2 device. Therefore, in the case of implementations with a hierarchical FIB, only a single change has to be made to the FIB. "Hierarchical FIB" here means FIB structure where the next-hop forwarding information is stored separately from the prefix lookup table, and the latter only stores pointers to the respective forwarding information. See [BGP-PIC] for discussion of FIB hierarchies and fast convergence.

即使在FIB中必须重新编程多个IP前缀的故障情况下,值得注意的是,所有这些前缀在Tier 2设备上共享一个ECMP组。因此,对于具有层次结构FIB的实现,只需对FIB进行一次更改。“分层FIB”在此指FIB结构,其中下一跳转发信息与前缀查找表分开存储,而前缀查找表仅存储指向相应转发信息的指针。有关FIB层次结构和快速收敛的讨论,请参见[BGP-PIC]。

Even though BGP offers reduced failure scope for some cases, further reduction of the fault domain using summarization is not always possible with the proposed design, since using this technique may create routing black-holes as mentioned previously. Therefore, the worst failure impact scope on the control plane is the network as a whole -- for instance, in the case of a link failure between Tier 2 and Tier 3 devices. The amount of impacted prefixes in this case would be much less than in the case of a failure in the upper layers of a Clos network topology. The property of having such large failure scope is not a result of choosing EBGP in the design but rather a result of using the Clos topology.

尽管BGP在某些情况下提供了更小的故障范围,但在所提出的设计中,使用摘要进一步减少故障域并不总是可能的,因为使用这种技术可能会产生前面提到的路由黑洞。因此,控制平面上最严重的故障影响范围是整个网络——例如,在第2层和第3层设备之间发生链路故障的情况下。在这种情况下,受影响的前缀数量将比Clos网络拓扑上层发生故障时少得多。具有如此大故障范围的特性不是设计中选择EBGP的结果,而是使用Clos拓扑的结果。

7.5. Routing Micro-Loops
7.5. 布线微环

When a downstream device, e.g., Tier 2 device, loses all paths for a prefix, it normally has the default route pointing toward the upstream device -- in this case, the Tier 1 device. As a result, it is possible to get in the situation where a Tier 2 switch loses a prefix, but a Tier 1 switch still has the path pointing to the Tier 2 device; this results in a transient micro-loop, since the Tier 1 switch will keep passing packets to the affected prefix back to the Tier 2 device, and the Tier 2 will bounce them back again using the default route. This micro-loop will last for the time it takes the upstream device to fully update its forwarding tables.

当下游设备(例如,第2层设备)丢失前缀的所有路径时,它通常具有指向上游设备的默认路由——在本例中为第1层设备。结果,可能出现第2层交换机丢失前缀,但第1层交换机仍然具有指向第2层设备的路径的情况;这将导致一个瞬时的微循环,因为Tier 1交换机将继续将数据包传递到受影响的前缀,并返回到Tier 2设备,而Tier 2将使用默认路由再次将数据包弹回。该微循环将持续上游设备完全更新其转发表所需的时间。

To minimize impact of such micro-loops, Tier 2 and Tier 1 switches can be configured with static "discard" or "null" routes that will be more specific than the default route for prefixes missing during network convergence. For Tier 2 switches, the discard route should be a summary route, covering all server subnets of the underlying Tier 3 devices. For Tier 1 devices, the discard route should be a summary covering the server IP address subnets allocated for the whole data center. Those discard routes will only take precedence for the duration of network convergence, until the device learns a more specific prefix via a new path.

为了最大限度地减少这种微环路的影响,可以为第2层和第1层交换机配置静态“丢弃”或“空”路由,这些路由对于网络聚合期间丢失的前缀比默认路由更为具体。对于第2层交换机,丢弃路由应为摘要路由,覆盖底层第3层设备的所有服务器子网。对于第1层设备,丢弃路由应该是一个摘要,涵盖为整个数据中心分配的服务器IP地址子网。这些丢弃路由仅在网络融合期间优先,直到设备通过新路径学习到更具体的前缀。

8. Additional Options for Design
8. 设计的附加选项
8.1. Third-Party Route Injection
8.1. 第三方路径注入

BGP allows for a "third-party", i.e., a directly attached BGP speaker, to inject routes anywhere in the network topology, meeting REQ5. This can be achieved by peering via a multi-hop BGP session with some or even all devices in the topology. Furthermore, BGP diverse path distribution [RFC6774] could be used to inject multiple BGP next hops for the same prefix to facilitate load balancing, or using the BGP ADD-PATH capability [RFC7911] if supported by the implementation. Unfortunately, in many implementations, ADD-PATH has been found to only support IBGP properly in the use cases for which it was originally optimized; this limits the "third-party" peering to IBGP only.

BGP允许“第三方”,即直接连接的BGP扬声器,在网络拓扑中的任何位置注入路由,满足要求5。这可以通过多跳BGP会话与拓扑中的一些甚至所有设备进行对等来实现。此外,BGP多样化路径分布[RFC6774]可用于为同一前缀注入多个BGP下一个跃点,以促进负载平衡,或在实现支持的情况下使用BGP添加路径功能[RFC7911]。不幸的是,在许多实现中,发现ADD-PATH仅在最初优化的用例中正确支持IBGP;这将“第三方”对等仅限于IBGP。

To implement route injection in the proposed design, a third-party BGP speaker may peer with Tier 3 and Tier 1 switches, injecting the same prefix, but using a special set of BGP next hops for Tier 1 devices. Those next hops are assumed to resolve recursively via BGP, and could be, for example, IP addresses on Tier 3 devices. The resulting forwarding table programming could provide desired traffic proportion distribution among different clusters.

为了在所提议的设计中实现路由注入,第三方BGP扬声器可以与第3层和第1层交换机对等,注入相同的前缀,但为第1层设备使用一组特殊的BGP下一跳。假设这些下一个跃点通过BGP递归解析,例如,可以是第3层设备上的IP地址。由此产生的转发表编程可以在不同集群之间提供所需的流量比例分布。

8.2. Route Summarization within Clos Topology
8.2. Clos拓扑中的路由摘要

As mentioned previously, route summarization is not possible within the proposed Clos topology since it makes the network susceptible to route black-holing under single link failures. The main problem is the limited number of redundant paths between network elements, e.g., there is only a single path between any pair of Tier 1 and Tier 3 devices. However, some operators may find route aggregation desirable to improve control-plane stability.

如前所述,在提议的Clos拓扑内不可能进行路由摘要,因为它使网络在单链路故障下容易发生路由黑洞。主要问题是网元之间的冗余路径数量有限,例如,任何一对Tier 1和Tier 3设备之间只有一条路径。然而,一些运营商可能会发现需要路由聚合来提高控制平面的稳定性。

If any technique to summarize within the topology is planned, modeling of the routing behavior and potential for black-holing should be done not only for single or multiple link failures, but also for fiber pathway failures or optical domain failures when the topology extends beyond a physical location. Simple modeling can be done by checking the reachability on devices doing summarization under the condition of a link or pathway failure between a set of devices in every tier as well as to the WAN routers when external connectivity is present.

如果计划在拓扑中总结任何技术,则不仅应针对单个或多个链路故障,而且还应针对拓扑扩展到物理位置以外时的光纤路径故障或光域故障,对路由行为和黑洞可能性进行建模。简单的建模可以通过检查在每一层中的一组设备之间以及存在外部连接时WAN路由器之间的链路或路径故障情况下进行汇总的设备的可达性来完成。

Route summarization would be possible with a small modification to the network topology, though the tradeoff would be reduction of the total size of the network as well as network congestion under specific failures. This approach is very similar to the technique described above, which allows Border Routers to summarize the entire data center address space.

通过对网络拓扑进行小的修改,就可以进行路由汇总,但折衷的办法是减少网络的总规模以及特定故障下的网络拥塞。这种方法与上述技术非常相似,允许边界路由器汇总整个数据中心地址空间。

8.2.1. Collapsing Tier 1 Devices Layer
8.2.1. 正在折叠第1层设备层

In order to add more paths between Tier 1 and Tier 3 devices, group Tier 2 devices into pairs, and then connect the pairs to the same group of Tier 1 devices. This is logically equivalent to "collapsing" Tier 1 devices into a group of half the size, merging the links on the "collapsed" devices. The result is illustrated in Figure 6. For example, in this topology DEV C and DEV D connect to the same set of Tier 1 devices (DEV 1 and DEV 2), whereas before they were connecting to different groups of Tier 1 devices.

为了在第1层和第3层设备之间添加更多路径,请将第2层设备分组为对,然后将对连接到同一组第1层设备。这在逻辑上相当于将第1层设备“折叠”为一组大小的一半,合并“折叠”设备上的链接。结果如图6所示。例如,在此拓扑中,DEV C和DEV D连接到同一组第1层设备(DEV 1和DEV 2),而在此之前,它们连接到不同组的第1层设备。

                    Tier 2       Tier 1       Tier 2
                   +-----+      +-----+      +-----+
     +-------------| DEV |------| DEV |------|     |-------------+
     |       +-----|  C  |--++--|  1  |--++--|     |-----+       |
     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
     |       |              ||           ||              |       |
     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
     | +-----+-----| DEV |--++--| DEV |--++--|     |-----+-----+ |
     | |     | +---|  D  |------|  2  |------|     |---+ |     | |
     | |     | |   +-----+      +-----+      +-----+   | |     | |
     | |     | |                                       | |     | |
   +-----+ +-----+                                   +-----+ +-----+
   | DEV | | DEV |                                   |     | |     |
   |  A  | |  B  | Tier 3                     Tier 3 |     | |     |
   +-----+ +-----+                                   +-----+ +-----+
     | |     | |                                       | |     | |
     O O     O O             <- Servers ->             O O     O O
        
                    Tier 2       Tier 1       Tier 2
                   +-----+      +-----+      +-----+
     +-------------| DEV |------| DEV |------|     |-------------+
     |       +-----|  C  |--++--|  1  |--++--|     |-----+       |
     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
     |       |              ||           ||              |       |
     |       |     +-----+  ||  +-----+  ||  +-----+     |       |
     | +-----+-----| DEV |--++--| DEV |--++--|     |-----+-----+ |
     | |     | +---|  D  |------|  2  |------|     |---+ |     | |
     | |     | |   +-----+      +-----+      +-----+   | |     | |
     | |     | |                                       | |     | |
   +-----+ +-----+                                   +-----+ +-----+
   | DEV | | DEV |                                   |     | |     |
   |  A  | |  B  | Tier 3                     Tier 3 |     | |     |
   +-----+ +-----+                                   +-----+ +-----+
     | |     | |                                       | |     | |
     O O     O O             <- Servers ->             O O     O O
        

Figure 6: 5-Stage Clos Topology

图6:5级Clos拓扑

Having this design in place, Tier 2 devices may be configured to advertise only a default route down to Tier 3 devices. If a link between Tier 2 and Tier 3 fails, the traffic will be re-routed via the second available path known to a Tier 2 switch. It is still not possible to advertise a summary route covering prefixes for a single cluster from Tier 2 devices since each of them has only a single path down to this prefix. It would require dual-homed servers to accomplish that. Also note that this design is only resilient to single link failures. It is possible for a double link failure to isolate a Tier 2 device from all paths toward a specific Tier 3 device, thus causing a routing black-hole.

有了这种设计,第2层设备可以配置为只公布向下至第3层设备的默认路由。如果第2层和第3层之间的链路出现故障,通信量将通过第2层交换机已知的第二条可用路径重新路由。仍然不可能从Tier 2设备播发覆盖单个群集前缀的摘要路由,因为每个群集只有一条到该前缀的路径。这需要双主机服务器来完成。还请注意,此设计仅对单链路故障具有弹性。双链路故障可能会将第2层设备与通向特定第3层设备的所有路径隔离开来,从而导致路由黑洞。

A result of the proposed topology modification would be a reduction of the port capacity of Tier 1 devices. This limits the maximum number of attached Tier 2 devices, and therefore will limit the maximum DC network size. A larger network would require different Tier 1 devices that have higher port density to implement this change.

拟议拓扑修改的结果将是第1层设备的端口容量减少。这限制了连接的第2层设备的最大数量,因此将限制最大直流网络大小。更大的网络需要具有更高端口密度的不同第1层设备来实现此更改。

Another problem is traffic rebalancing under link failures. Since there are two paths from Tier 1 to Tier 3, a failure of the link between Tier 1 and Tier 2 switch would result in all traffic that was taking the failed link to switch to the remaining path. This will result in doubling the link utilization on the remaining link.

另一个问题是链路故障下的流量重新平衡。由于从第1层到第3层有两条路径,因此第1层和第2层交换机之间的链路故障将导致所有将故障链路切换到剩余路径的通信量。这将导致剩余链路上的链路利用率加倍。

8.2.2. Simple Virtual Aggregation
8.2.2. 简单虚拟聚合

A completely different approach to route summarization is possible, provided that the main goal is to reduce the FIB size, while allowing the control plane to disseminate full routing information. Firstly, it could be easily noted that in many cases multiple prefixes, some of which are less specific, share the same set of the next hops (same ECMP group). For example, from the perspective of Tier 3 devices, all routes learned from upstream Tier 2 devices, including the default route, will share the same set of BGP next hops, provided that there are no failures in the network. This makes it possible to use the technique similar to that described in [RFC6769] and only install the least specific route in the FIB, ignoring more specific routes if they share the same next-hop set. For example, under normal network conditions, only the default route needs to be programmed into the FIB.

如果主要目标是减少FIB大小,同时允许控制平面传播完整的路由信息,则可以采用完全不同的路由摘要方法。首先,很容易注意到,在许多情况下,多个前缀(其中一些不太具体)共享同一组下一跳(相同的ECMP组)。例如,从第3层设备的角度来看,如果网络中没有故障,则从上游第2层设备学习的所有路由(包括默认路由)将共享同一组BGP下一跳。这使得可以使用与[RFC6769]中所述类似的技术,只在FIB中安装最不特定的路由,如果它们共享相同的下一跳集,则忽略更特定的路由。例如,在正常网络条件下,只有默认路由需要编程到FIB中。

Furthermore, if the Tier 2 devices are configured with summary prefixes covering all of their attached Tier 3 device's prefixes, the same logic could be applied in Tier 1 devices as well and, by induction to Tier 2/Tier 3 switches in different clusters. These summary routes should still allow for more specific prefixes to leak to Tier 1 devices, to enable detection of mismatches in the next-hop sets if a particular link fails, thus changing the next-hop set for a specific prefix.

此外,如果第2层设备配置的摘要前缀覆盖了其连接的所有第3层设备的前缀,那么同样的逻辑也可以应用于第1层设备,并通过感应到不同集群中的第2层/第3层交换机。这些摘要路由仍应允许更多特定前缀泄漏到第1层设备,以便在特定链路失败时检测下一跳集中的不匹配,从而更改特定前缀的下一跳集。

Restating once again, this technique does not reduce the amount of control-plane state (i.e., BGP UPDATEs, BGP Loc-RIB size), but only allows for more efficient FIB utilization, by detecting more specific prefixes that share their next-hop set with a subsuming less specific prefix.

再次重申,该技术不会减少控制平面状态的数量(即BGP更新、BGP Loc RIB size),但只允许更有效地利用FIB,方法是检测更具体的前缀,这些前缀与包含的不太具体的前缀共享其下一跳集。

8.3. ICMP Unreachable Message Masquerading
8.3. ICMP不可访问消息伪装

This section discusses some operational aspects of not advertising point-to-point link subnets into BGP, as previously identified as an option in Section 5.2.3. The operational impact of this decision could be seen when using the well-known "traceroute" tool. Specifically, IP addresses displayed by the tool will be the link's point-to-point addresses, and hence will be unreachable for management connectivity. This makes some troubleshooting more complicated.

本节讨论了在BGP中不公布点到点链路子网的一些操作方面,如之前在第5.2.3节中确定的选项。在使用著名的“追踪路线”工具时,可以看到该决策的操作影响。具体来说,该工具显示的IP地址将是链接的点到点地址,因此对于管理连接来说是不可访问的。这使得一些故障排除更加复杂。

One way to overcome this limitation is by using the DNS subsystem to create the "reverse" entries for these point-to-point IP addresses pointing to the same name as the loopback address. The connectivity then can be made by resolving this name to the "primary" IP address

克服此限制的一种方法是使用DNS子系统为指向与环回地址相同名称的这些点到点IP地址创建“反向”条目。然后,可以通过将此名称解析为“主”IP地址来建立连接

of the device, e.g., its Loopback interface, which is always advertised into BGP. However, this creates a dependency on the DNS subsystem, which may be unavailable during an outage.

设备的功能,例如,其环回接口,该接口始终发布到BGP中。但是,这会创建对DNS子系统的依赖关系,该子系统在停机期间可能不可用。

Another option is to make the network device perform IP address masquerading, that is, rewriting the source IP addresses of the appropriate ICMP messages sent by the device with the "primary" IP address of the device. Specifically, the ICMP Destination Unreachable Message (type 3) code 3 (port unreachable) and ICMP Time Exceeded (type 11) code 0 are required for correct operation of the "traceroute" tool. With this modification, the "traceroute" probes sent to the devices will always be sent back with the "primary" IP address as the source, allowing the operator to discover the "reachable" IP address of the box. This has the downside of hiding the address of the "entry point" into the device. If the devices support [RFC5837], this may allow the best of both worlds by providing the information about the incoming interface even if the return address is the "primary" IP address.

另一个选项是使网络设备执行IP地址伪装,即使用设备的“主”IP地址重写设备发送的适当ICMP消息的源IP地址。具体而言,ICMP目的地不可访问消息(类型3)代码3(端口不可访问)和ICMP超时(类型11)代码0是正确操作“跟踪路由”工具所必需的。通过此修改,发送到设备的“跟踪路由”探测将始终以“主”IP地址作为源发送回,从而允许操作员发现盒子的“可访问”IP地址。这样做的缺点是将“入口点”的地址隐藏到设备中。如果设备支持[RFC5837],则即使返回地址是“主”IP地址,也可以通过提供有关传入接口的信息来实现两全其美。

9. Security Considerations
9. 安全考虑

The design does not introduce any additional security concerns. General BGP security considerations are discussed in [RFC4271] and [RFC4272]. Since a DC is a single-operator domain, this document assumes that edge filtering is in place to prevent attacks against the BGP sessions themselves from outside the perimeter of the DC. This may be a more feasible option for most deployments than having to deal with key management for TCP MD5 as described in [RFC2385] or dealing with the lack of implementations of the TCP Authentication Option [RFC5925] available at the time of publication of this document. The Generalized TTL Security Mechanism [RFC5082] could also be used to further reduce the risk of BGP session spoofing.

该设计没有引入任何额外的安全问题。[RFC4271]和[RFC4272]中讨论了一般BGP安全注意事项。由于DC是一个单一运营商域,因此本文档假设已实施边缘过滤,以防止来自DC外围的对BGP会话本身的攻击。对于大多数部署来说,这可能是一个比必须处理[RFC2385]中所述的TCP MD5密钥管理或处理本文档发布时缺少可用的TCP身份验证选项[RFC5925]更可行的选项。广义TTL安全机制[RFC5082]也可用于进一步降低BGP会话欺骗的风险。

10. References
10. 工具书类
10.1. Normative References
10.1. 规范性引用文件

[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271, January 2006, <http://www.rfc-editor.org/info/rfc4271>.

[RFC4271]Rekhter,Y.,Ed.,Li,T.,Ed.,和S.Hares,Ed.,“边境网关协议4(BGP-4)”,RFC 4271,DOI 10.17487/RFC4271,2006年1月<http://www.rfc-editor.org/info/rfc4271>.

[RFC6996] Mitchell, J., "Autonomous System (AS) Reservation for Private Use", BCP 6, RFC 6996, DOI 10.17487/RFC6996, July 2013, <http://www.rfc-editor.org/info/rfc6996>.

[RFC6996]Mitchell,J.,“供私人使用的自主系统(AS)预订”,BCP 6,RFC 6996,DOI 10.17487/RFC6996,2013年7月<http://www.rfc-editor.org/info/rfc6996>.

10.2. Informative References
10.2. 资料性引用

[ALFARES2008] Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, Commodity Data Center Network Architecture", DOI 10.1145/1402958.1402967, August 2008, <http://dl.acm.org/citation.cfm?id=1402967>.

[ALFARES2008]Al Fares,M.,Loukissas,A.和A.Vahdat,“可扩展的商品数据中心网络架构”,DOI 10.1145/1402958.14029672008年8月<http://dl.acm.org/citation.cfm?id=1402967>.

[ALLOWASIN] Cisco Systems, "Allowas-in Feature in BGP Configuration Example", February 2015, <http://www.cisco.com/c/en/us/support/docs/ip/ border-gateway-protocol-bgp/112236-allowas-in-bgp-config-example.html>.

[ALLOWASIN]思科系统,“BGP配置示例中的ALLOWASIN功能”,2015年2月<http://www.cisco.com/c/en/us/support/docs/ip/ 边界网关协议bgp/112236在bgp config example.html>中。

[BGP-PIC] Bashandy, A., Ed., Filsfils, C., and P. Mohapatra, "BGP Prefix Independent Convergence", Work in Progress, draft-ietf-rtgwg-bgp-pic-02, August 2016.

[BGP-PIC]Bashandy,A.,Ed.,Filsfils,C.,和P.Mohapatra,“BGP前缀独立收敛”,正在进行的工作,草稿-ietf-rtgwg-BGP-PIC-022016年8月。

[CLOS1953] Clos, C., "A Study of Non-Blocking Switching Networks", The Bell System Technical Journal, Vol. 32(2), DOI 10.1002/j.1538-7305.1953.tb01433.x, March 1953.

[CLOS1953]Clos,C.,“非阻塞交换网络的研究”,贝尔系统技术期刊,第32卷(2),DOI 10.1002/j.1538-7305.1953.tb01433.x,1953年3月。

[CONDITIONALROUTE] Cisco Systems, "Configuring and Verifying the BGP Conditional Advertisement Feature", August 2005, <http://www.cisco.com/c/en/us/support/docs/ip/ border-gateway-protocol-bgp/16137-cond-adv.html>.

[CONDITIONALROUTE]Cisco Systems,“配置和验证BGP条件公告功能”,2005年8月<http://www.cisco.com/c/en/us/support/docs/ip/ 边界网关协议bgp/16137-cond-adv.html>。

[CONS-HASH] Wikipedia, "Consistent Hashing", July 2016, <https://en.wikipedia.org/w/ index.php?title=Consistent_hashing&oldid=728825684>.

[CONS-HASH]维基百科,“一致哈希”,2016年7月<https://en.wikipedia.org/w/ index.php?title=Consistent\u hashing&oldid=728825684>。

[FB4POST] Farrington, N. and A. Andreyev, "Facebook's Data Center Network Architecture", May 2013, <http://nathanfarrington.com/papers/facebook-oic13.pdf>.

[FB4POST]Farrington,N.和A.Andreyev,“Facebook的数据中心网络架构”,2013年5月<http://nathanfarrington.com/papers/facebook-oic13.pdf>.

[GREENBERG2009] Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a Cloud: Research Problems in Data Center Networks", DOI 10.1145/1496091.1496103, January 2009, <http://dl.acm.org/citation.cfm?id=1496103>.

[GREENBERG2009]Greenberg,A.,Hamilton,J.,和D.Maltz,“云的成本:数据中心网络中的研究问题”,DOI 10.1145/149691.1496103,2009年1月<http://dl.acm.org/citation.cfm?id=1496103>.

[HADOOP] Apache, "Apache Hadoop", April 2016, <https://hadoop.apache.org/>.

[HADOOP]Apache,“Apache HADOOP”,2016年4月<https://hadoop.apache.org/>.

[IANA.AS] IANA, "Autonomous System (AS) Numbers", <http://www.iana.org/assignments/as-numbers>.

[IANA.AS]IANA,“自主系统(AS)编号”<http://www.iana.org/assignments/as-numbers>.

[IEEE8021D-1990] IEEE, "IEEE Standard for Local and Metropolitan Area Networks: Media Access Control (MAC) Bridges", IEEE Std 802.1D, DOI 10.1109/IEEESTD.1991.101050, 1991, <http://ieeexplore.ieee.org/servlet/opac?punumber=2255>.

[IEEE8021D-1990]IEEE,“局域网和城域网的IEEE标准:媒体访问控制(MAC)网桥”,IEEE标准802.1D,DOI 10.1109/IEEESTD.1991.1010501991<http://ieeexplore.ieee.org/servlet/opac?punumber=2255>.

[IEEE8021D-2004] IEEE, "IEEE Standard for Local and Metropolitan Area Networks: Media Access Control (MAC) Bridges", IEEE Std 802.1D, DOI 10.1109/IEEESTD.2004.94569, June 2004, <http://ieeexplore.ieee.org/servlet/opac?punumber=9155>.

[IEEE8021D-2004]IEEE,“局域网和城域网的IEEE标准:媒体访问控制(MAC)网桥”,IEEE标准802.1D,DOI 10.1109/IEEESTD.2004.94569,2004年6月<http://ieeexplore.ieee.org/servlet/opac?punumber=9155>.

[IEEE8021Q] IEEE, "IEEE Standard for Local and Metropolitan Area Networks: Bridges and Bridged Networks", IEEE Std 802.1Q, DOI 10.1109/IEEESTD.2014.6991462, <http://ieeexplore.ieee.org/servlet/ opac?punumber=6991460>.

[IEEE8021Q]IEEE,“局域网和城域网的IEEE标准:网桥和桥接网络”,IEEE标准802.1Q,DOI 10.1109/IEEESTD.2014.6991462<http://ieeexplore.ieee.org/servlet/ opac?punumber=6991460>。

[IEEE8023AD] IEEE, "Amendment to Carrier Sense Multiple Access With Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications - Aggregation of Multiple Link Segments", IEEE Std 802.3ad, DOI 10.1109/IEEESTD.2000.91610, October 2000, <http://ieeexplore.ieee.org/servlet/opac?punumber=6867>.

[IEEE8023AD]IEEE,“带冲突检测的载波侦听多址接入(CSMA/CD)接入方法和物理层规范的修订-多链路段的聚合”,IEEE标准802.3ad,DOI 10.1109/IEEESTD.2000.91610,2000年10月<http://ieeexplore.ieee.org/servlet/opac?punumber=6867>.

[INTERCON] Dally, W. and B. Towles, "Principles and Practices of Interconnection Networks", ISBN 978-0122007514, January 2004, <http://dl.acm.org/citation.cfm?id=995703>.

[INTERCON]Dally,W.和B.Towles,“互连网络的原则和实践”,ISBN 978-0122007514,2004年1月<http://dl.acm.org/citation.cfm?id=995703>.

[JAKMA2008] Jakma, P., "BGP Path Hunting", 2008, <https://blogs.oracle.com/paulj/entry/bgp_path_hunting>.

[JAKMA2008]Jakma,P.,“BGP路径搜索”,2008年<https://blogs.oracle.com/paulj/entry/bgp_path_hunting>.

[L3DSR] Schaumann, J., "L3DSR - Overcoming Layer 2 Limitations of Direct Server Return Load Balancing", 2011, <https://www.nanog.org/meetings/nanog51/presentations/ Monday/NANOG51.Talk45.nanog51-Schaumann.pdf>.

[L3DSR]Schaumann,J.,“L3DSR-克服直接服务器返回负载平衡的第2层限制”,2011年<https://www.nanog.org/meetings/nanog51/presentations/ 周一/NANOG51.Talk45.NANOG51-Schaumann.pdf>。

[LINK] Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Extended Community", Work in Progress, draft-ietf-idr-link-bandwidth-06, January 2013.

[链接]Mohapatra,P.和R.Fernando,“BGP链路带宽扩展社区”,正在进行的工作,草案-ietf-idr-LINK-BLANDWARD-062013年1月。

[REMOVAL] Mitchell, J., Rao, D., and R. Raszuk, "Private Autonomous System (AS) Removal Requirements", Work in Progress, draft-mitchell-grow-remove-private-as-04, April 2015.

[删除]Mitchell,J.,Rao,D.,和R.Raszuk,“专用自治系统(AS)删除要求”,正在进行的工作,草稿-Mitchell-grow-remove-Private-AS-042015年4月。

[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, DOI 10.17487/RFC2328, April 1998, <http://www.rfc-editor.org/info/rfc2328>.

[RFC2328]Moy,J.,“OSPF版本2”,STD 54,RFC 2328,DOI 10.17487/RFC2328,1998年4月<http://www.rfc-editor.org/info/rfc2328>.

[RFC2385] Heffernan, A., "Protection of BGP Sessions via the TCP MD5 Signature Option", RFC 2385, DOI 10.17487/RFC2385, August 1998, <http://www.rfc-editor.org/info/rfc2385>.

[RFC2385]Heffernan,A.,“通过TCP MD5签名选项保护BGP会话”,RFC 2385,DOI 10.17487/RFC2385,1998年8月<http://www.rfc-editor.org/info/rfc2385>.

[RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000, <http://www.rfc-editor.org/info/rfc2992>.

[RFC2992]Hopps,C.,“等成本多路径算法的分析”,RFC 2992,DOI 10.17487/RFC2992,2000年11月<http://www.rfc-editor.org/info/rfc2992>.

[RFC4272] Murphy, S., "BGP Security Vulnerabilities Analysis", RFC 4272, DOI 10.17487/RFC4272, January 2006, <http://www.rfc-editor.org/info/rfc4272>.

[RFC4272]Murphy,S.,“BGP安全漏洞分析”,RFC 4272,DOI 10.17487/RFC4272,2006年1月<http://www.rfc-editor.org/info/rfc4272>.

[RFC4277] McPherson, D. and K. Patel, "Experience with the BGP-4 Protocol", RFC 4277, DOI 10.17487/RFC4277, January 2006, <http://www.rfc-editor.org/info/rfc4277>.

[RFC4277]McPherson,D.和K.Patel,“BGP-4协议的经验”,RFC 4277,DOI 10.17487/RFC4277,2006年1月<http://www.rfc-editor.org/info/rfc4277>.

[RFC4786] Abley, J. and K. Lindqvist, "Operation of Anycast Services", BCP 126, RFC 4786, DOI 10.17487/RFC4786, December 2006, <http://www.rfc-editor.org/info/rfc4786>.

[RFC4786]Abley,J.和K.Lindqvist,“任意广播服务的运营”,BCP 126,RFC 4786,DOI 10.17487/RFC4786,2006年12月<http://www.rfc-editor.org/info/rfc4786>.

[RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C. Pignataro, "The Generalized TTL Security Mechanism (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007, <http://www.rfc-editor.org/info/rfc5082>.

[RFC5082]Gill,V.,Heasley,J.,Meyer,D.,Savola,P.,Ed.,和C.Pignataro,“广义TTL安全机制(GTSM)”,RFC 5082,DOI 10.17487/RFC5082,2007年10月<http://www.rfc-editor.org/info/rfc5082>.

[RFC5837] Atlas, A., Ed., Bonica, R., Ed., Pignataro, C., Ed., Shen, N., and JR. Rivers, "Extending ICMP for Interface and Next-Hop Identification", RFC 5837, DOI 10.17487/RFC5837, April 2010, <http://www.rfc-editor.org/info/rfc5837>.

[RFC5837]Atlas,A.,Ed.,Bonica,R.,Ed.,Pignataro,C.,Ed.,Shen,N.,和JR.Rivers,“为接口和下一跳识别扩展ICMP”,RFC 5837,DOI 10.17487/RFC5837,2010年4月<http://www.rfc-editor.org/info/rfc5837>.

[RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010, <http://www.rfc-editor.org/info/rfc5880>.

[RFC5880]Katz,D.和D.Ward,“双向转发检测(BFD)”,RFC 5880,DOI 10.17487/RFC5880,2010年6月<http://www.rfc-editor.org/info/rfc5880>.

[RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP Authentication Option", RFC 5925, DOI 10.17487/RFC5925, June 2010, <http://www.rfc-editor.org/info/rfc5925>.

[RFC5925]Touch,J.,Mankin,A.,和R.Bonica,“TCP认证选项”,RFC 5925,DOI 10.17487/RFC5925,2010年6月<http://www.rfc-editor.org/info/rfc5925>.

[RFC6325] Perlman, R., Eastlake 3rd, D., Dutt, D., Gai, S., and A. Ghanwani, "Routing Bridges (RBridges): Base Protocol Specification", RFC 6325, DOI 10.17487/RFC6325, July 2011, <http://www.rfc-editor.org/info/rfc6325>.

[RFC6325]Perlman,R.,Eastlake 3rd,D.,Dutt,D.,Gai,S.,和A.Ghanwani,“路由桥(RBridges):基本协议规范”,RFC 6325DOI 10.17487/RFC6325,2011年7月<http://www.rfc-editor.org/info/rfc6325>.

[RFC6769] Raszuk, R., Heitz, J., Lo, A., Zhang, L., and X. Xu, "Simple Virtual Aggregation (S-VA)", RFC 6769, DOI 10.17487/RFC6769, October 2012, <http://www.rfc-editor.org/info/rfc6769>.

[RFC6769]Raszuk,R.,Heitz,J.,Lo,A.,Zhang,L.,和Xu,“简单虚拟聚合(S-VA)”,RFC 6769,DOI 10.17487/RFC6769,2012年10月<http://www.rfc-editor.org/info/rfc6769>.

[RFC6774] Raszuk, R., Ed., Fernando, R., Patel, K., McPherson, D., and K. Kumaki, "Distribution of Diverse BGP Paths", RFC 6774, DOI 10.17487/RFC6774, November 2012, <http://www.rfc-editor.org/info/rfc6774>.

[RFC6774]Raszuk,R.,Ed.,Fernando,R.,Patel,K.,McPherson,D.,和K.Kumaki,“不同BGP路径的分布”,RFC 6774,DOI 10.17487/RFC6774,2012年11月<http://www.rfc-editor.org/info/rfc6774>.

[RFC6793] Vohra, Q. and E. Chen, "BGP Support for Four-Octet Autonomous System (AS) Number Space", RFC 6793, DOI 10.17487/RFC6793, December 2012, <http://www.rfc-editor.org/info/rfc6793>.

[RFC6793]Vohra,Q.和E.Chen,“BGP对四个八位组自治系统(AS)数字空间的支持”,RFC 6793,DOI 10.17487/RFC6793,2012年12月<http://www.rfc-editor.org/info/rfc6793>.

[RFC7067] Dunbar, L., Eastlake 3rd, D., Perlman, R., and I. Gashinsky, "Directory Assistance Problem and High-Level Design Proposal", RFC 7067, DOI 10.17487/RFC7067, November 2013, <http://www.rfc-editor.org/info/rfc7067>.

[RFC7067]Dunbar,L.,Eastlake 3rd,D.,Perlman,R.,和I.Gashinsky,“目录协助问题和高层设计方案”,RFC 7067,DOI 10.17487/RFC7067,2013年11月<http://www.rfc-editor.org/info/rfc7067>.

[RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional Forwarding Detection (BFD) on Link Aggregation Group (LAG) Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 2014, <http://www.rfc-editor.org/info/rfc7130>.

[RFC7130]Bhatia,M.,Ed.,Chen,M.,Ed.,Boutros,S.,Ed.,Binderberger,M.,Ed.,和J.Haas,Ed.,“链路聚合组(LAG)接口上的双向转发检测(BFD)”,RFC 7130,DOI 10.17487/RFC7130,2014年2月<http://www.rfc-editor.org/info/rfc7130>.

[RFC7196] Pelsser, C., Bush, R., Patel, K., Mohapatra, P., and O. Maennel, "Making Route Flap Damping Usable", RFC 7196, DOI 10.17487/RFC7196, May 2014, <http://www.rfc-editor.org/info/rfc7196>.

[RFC7196]Pelsser,C.,Bush,R.,Patel,K.,Mohapatra,P.,和O.Maennel,“使路线襟翼阻尼可用”,RFC 7196,DOI 10.17487/RFC71962014年5月<http://www.rfc-editor.org/info/rfc7196>.

[RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, "Advertisement of Multiple Paths in BGP", RFC 7911, DOI 10.17487/RFC7911, July 2016, <http://www.rfc-editor.org/info/rfc7911>.

[RFC7911]Walton,D.,Retana,A.,Chen,E.,和J.Scudder,“BGP中多路径的广告”,RFC 7911,DOI 10.17487/RFC7911,2016年7月<http://www.rfc-editor.org/info/rfc7911>.

[VENDOR-REMOVE-PRIVATE-AS] Cisco Systems, "Removing Private Autonomous System Numbers in BGP", August 2005, <http://www.cisco.com/en/US/tech/tk365/ technologies_tech_note09186a0080093f27.shtml>.

[VENDOR-REMOVE-PRIVATE-AS]Cisco Systems,“删除BGP中的专用自治系统编号”,2005年8月<http://www.cisco.com/en/US/tech/tk365/ 技术\u tech\u note09186a0080093f27.shtml>。

Acknowledgements

致谢

This publication summarizes the work of many people who participated in developing, testing, and deploying the proposed network design, some of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong, Robert Toomey, and Lihua Yuan. The authors would also like to thank Linda Dunbar, Anoop Ghanwani, Susan Hares, Danny McPherson, Robert Raszuk, and Russ White for reviewing this document and providing valuable feedback, and Mary Mitchell for initial grammar and style suggestions.

本出版物总结了许多参与拟定网络设计的开发、测试和部署人员的工作,其中包括George Chen、Parantap Lahiri、Dave Maltz、Edet Nkbosong、Robert Toomey和Lihua Yuan。作者还要感谢Linda Dunbar、Anoop Ghanwani、Susan Hares、Danny McPherson、Robert Raszuk和Russ White审阅本文件并提供有价值的反馈,以及Mary Mitchell提供的初步语法和风格建议。

Authors' Addresses

作者地址

Petr Lapukhov Facebook 1 Hacker Way Menlo Park, CA 94025 United States of America

Petr Lapukhov Facebook 1 Hacker Way Menlo Park,加利福尼亚州美利坚合众国94025

   Email: petr@fb.com
        
   Email: petr@fb.com
        

Ariff Premji Arista Networks 5453 Great America Parkway Santa Clara, CA 95054 United States of America

Ariff Premji Arista Networks 5453大美洲大道圣克拉拉,加利福尼亚州95054美利坚合众国

   Email: ariff@arista.com
   URI:   http://arista.com/
        
   Email: ariff@arista.com
   URI:   http://arista.com/
        

Jon Mitchell (editor)

乔恩·米切尔(编辑)

   Email: jrmitche@puck.nether.net
        
   Email: jrmitche@puck.nether.net