Internet Engineering Task Force (IETF)                       R. Krishnan
Request for Comments: 7424                        Brocade Communications
Category: Informational                                          L. Yong
ISSN: 2070-1721                                               Huawei USA
                                                             A. Ghanwani
                                                                    Dell
                                                                   N. So
                                                           Vinci Systems
                                                           B. Khasnabish
                                                         ZTE Corporation
                                                            January 2015
        
Internet Engineering Task Force (IETF)                       R. Krishnan
Request for Comments: 7424                        Brocade Communications
Category: Informational                                          L. Yong
ISSN: 2070-1721                                               Huawei USA
                                                             A. Ghanwani
                                                                    Dell
                                                                   N. So
                                                           Vinci Systems
                                                           B. Khasnabish
                                                         ZTE Corporation
                                                            January 2015
        

Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks

优化网络中链路聚合组(LAG)和等成本多路径(ECMP)组件链路利用率的机制

Abstract

摘要

Demands on networking infrastructure are growing exponentially due to bandwidth-hungry applications such as rich media applications and inter-data-center communications. In this context, it is important to optimally use the bandwidth in wired networks that extensively use link aggregation groups and equal-cost multipaths as techniques for bandwidth scaling. This document explores some of the mechanisms useful for achieving this.

由于富媒体应用程序和数据中心间通信等高带宽应用程序,对网络基础设施的需求呈指数级增长。在这种情况下,重要的是在广泛使用链路聚合组和等成本多路径作为带宽扩展技术的有线网络中优化使用带宽。本文件探讨了实现这一目标的一些有用机制。

Status of This Memo

关于下段备忘

This document is not an Internet Standards Track specification; it is published for informational purposes.

本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 5741.

本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。并非IESG批准的所有文件都适用于任何级别的互联网标准;见RFC 5741第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7424.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc7424.

Copyright Notice

版权公告

Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2015 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。

Table of Contents

目录

   1. Introduction ....................................................4
      1.1. Acronyms ...................................................4
      1.2. Terminology ................................................5
   2. Flow Categorization .............................................6
   3. Hash-Based Load Distribution in LAG/ECMP ........................6
   4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization ...8
      4.1. Differences in LAG vs. ECMP ................................9
      4.2. Operational Overview ......................................10
      4.3. Large Flow Recognition ....................................11
           4.3.1. Flow Identification ................................11
           4.3.2. Criteria and Techniques for Large Flow
                  Recognition ........................................12
           4.3.3. Sampling Techniques ................................12
           4.3.4. Inline Data Path Measurement .......................14
           4.3.5. Use of Multiple Methods for Large Flow
                  Recognition ........................................15
      4.4. Options for Load Rebalancing ..............................15
           4.4.1. Alternative Placement of Large Flows ...............15
           4.4.2. Redistributing Small Flows .........................16
           4.4.3. Component Link Protection Considerations ...........16
           4.4.4. Algorithms for Load Rebalancing ....................17
           4.4.5. Example of Load Rebalancing ........................17
   5. Information Model for Flow Rebalancing .........................18
      5.1. Configuration Parameters for Flow Rebalancing .............18
      5.2. System Configuration and Identification Parameters ........19
      5.3. Information for Alternative Placement of Large Flows ......20
      5.4. Information for Redistribution of Small Flows .............21
      5.5. Export of Flow Information ................................21
      5.6. Monitoring Information ....................................21
           5.6.1. Interface (Link) Utilization .......................21
           5.6.2. Other Monitoring Information .......................22
   6. Operational Considerations .....................................23
      6.1. Rebalancing Frequency .....................................23
      6.2. Handling Route Changes ....................................23
      6.3. Forwarding Resources ......................................23
   7. Security Considerations ........................................23
   8. References .....................................................24
      8.1. Normative References ......................................24
      8.2. Informative References ....................................25
   Appendix A.  Internet Traffic Analysis and Load-Balancing
                Simulation ...........................................28
   Acknowledgements ..................................................28
   Contributors ......................................................28
   Authors' Addresses ................................................29
        
   1. Introduction ....................................................4
      1.1. Acronyms ...................................................4
      1.2. Terminology ................................................5
   2. Flow Categorization .............................................6
   3. Hash-Based Load Distribution in LAG/ECMP ........................6
   4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization ...8
      4.1. Differences in LAG vs. ECMP ................................9
      4.2. Operational Overview ......................................10
      4.3. Large Flow Recognition ....................................11
           4.3.1. Flow Identification ................................11
           4.3.2. Criteria and Techniques for Large Flow
                  Recognition ........................................12
           4.3.3. Sampling Techniques ................................12
           4.3.4. Inline Data Path Measurement .......................14
           4.3.5. Use of Multiple Methods for Large Flow
                  Recognition ........................................15
      4.4. Options for Load Rebalancing ..............................15
           4.4.1. Alternative Placement of Large Flows ...............15
           4.4.2. Redistributing Small Flows .........................16
           4.4.3. Component Link Protection Considerations ...........16
           4.4.4. Algorithms for Load Rebalancing ....................17
           4.4.5. Example of Load Rebalancing ........................17
   5. Information Model for Flow Rebalancing .........................18
      5.1. Configuration Parameters for Flow Rebalancing .............18
      5.2. System Configuration and Identification Parameters ........19
      5.3. Information for Alternative Placement of Large Flows ......20
      5.4. Information for Redistribution of Small Flows .............21
      5.5. Export of Flow Information ................................21
      5.6. Monitoring Information ....................................21
           5.6.1. Interface (Link) Utilization .......................21
           5.6.2. Other Monitoring Information .......................22
   6. Operational Considerations .....................................23
      6.1. Rebalancing Frequency .....................................23
      6.2. Handling Route Changes ....................................23
      6.3. Forwarding Resources ......................................23
   7. Security Considerations ........................................23
   8. References .....................................................24
      8.1. Normative References ......................................24
      8.2. Informative References ....................................25
   Appendix A.  Internet Traffic Analysis and Load-Balancing
                Simulation ...........................................28
   Acknowledgements ..................................................28
   Contributors ......................................................28
   Authors' Addresses ................................................29
        
1. Introduction
1. 介绍

Networks extensively use link aggregation groups (LAGs) [802.1AX] and equal-cost multipaths (ECMPs) [RFC2991] as techniques for capacity scaling. For the problems addressed by this document, network traffic can be predominantly categorized into two traffic types: long-lived large flows and other flows. These other flows, which include long-lived small flows, short-lived small flows, and short-lived large flows, are referred to as "small flows" in this document. Long-lived large flows are simply referred to as "large flows".

网络广泛使用链路聚合组(LAG)[802.1AX]和等成本多路径(ECMP)[RFC2991]作为容量扩展的技术。对于本文所述的问题,网络流量主要可分为两种流量类型:长寿命大流量和其他流量。这些其他流,包括长寿命小流、短命小流和短命大流,在本文件中称为“小流”。长寿命大流量简称为“大流量”。

Stateless hash-based techniques [ITCOM] [RFC2991] [RFC2992] [RFC6790] are often used to distribute both large flows and small flows over the component links in a LAG/ECMP. However, the traffic may not be evenly distributed over the component links due to the traffic pattern.

基于无状态哈希的技术[ITCOM][RFC2991][RFC2992][RFC6790]通常用于在LAG/ECMP中的组件链接上分布大流和小流。但是,由于流量模式的原因,流量可能不会均匀分布在组件链路上。

This document describes mechanisms for optimizing LAG/ECMP component link utilization when using hash-based techniques. The mechanisms comprise the following steps: 1) recognizing large flows in a router, and 2) assigning the large flows to specific LAG/ECMP component links or redistributing the small flows when a component link on the router is congested.

本文档描述了在使用基于哈希的技术时优化LAG/ECMP组件链路利用率的机制。这些机制包括以下步骤:1)识别路由器中的大流量,2)将大流量分配给特定的LAG/ECMP组件链路,或在路由器上的组件链路拥塞时重新分配小流量。

It is useful to keep in mind that in typical use cases for these mechanisms, the large flows consume a significant amount of bandwidth on a link, e.g., greater than 5% of link bandwidth. The number of such flows would necessarily be fairly small, e.g., on the order of 10s or 100s per LAG/ECMP. In other words, the number of large flows is NOT expected to be on the order of millions of flows. Examples of such large flows would be IPsec tunnels in service provider backbone networks or storage backup traffic in data center networks.

请记住,在这些机制的典型用例中,大流量消耗链路上的大量带宽,例如,大于链路带宽的5%,这一点很有用。此类流量的数量必须相当小,例如,每滞后/ECMP约为10秒或100秒。换句话说,大流量的数量预计不会达到数百万次。此类大流量的示例包括服务提供商主干网中的IPsec隧道或数据中心网络中的存储备份流量。

1.1. Acronyms
1.1. 缩略词

DoS: Denial of Service

拒绝服务

ECMP: Equal-Cost Multipath

ECMP:等成本多路径

GRE: Generic Routing Encapsulation

GRE:通用路由封装

IPFIX: IP Flow Information Export

IPFIX:IP流信息导出

LAG: Link Aggregation Group

滞后:链路聚合组

MPLS: Multiprotocol Label Switching

多协议标签交换

NVGRE: Network Virtualization using Generic Routing Encapsulation

NVGRE:使用通用路由封装的网络虚拟化

PBR: Policy-Based Routing

PBR:基于策略的路由

QoS: Quality of Service

QoS:服务质量

STT: Stateless Transport Tunneling

STT:无状态传输隧道

VXLAN: Virtual eXtensible LAN

虚拟可扩展局域网

1.2. Terminology
1.2. 术语

Central management entity: An entity that is capable of monitoring information about link utilization and flows in routers across the network and may be capable of making traffic-engineering decisions for placement of large flows. It may include the functions of a collector [RFC7011].

中央管理实体:能够监控网络中路由器的链路利用率和流量信息的实体,并且可能能够做出流量工程决策以放置大型流量。它可能包括收集器的功能[RFC7011]。

ECMP component link: An individual next hop within an ECMP group. An ECMP component link may itself comprise a LAG.

ECMP组件链接:ECMP组中的单个下一跳。ECMP组件链路本身可包括滞后。

ECMP table: A table that is used as the next hop of an ECMP route that comprises the set of ECMP component links and the weights associated with each of those ECMP component links. The input for looking up the table is the hash value for the packet, and the weights are used to determine which values of the hash function map to a given ECMP component link.

ECMP表:用作ECMP路由的下一个跃点的表,该路由包含一组ECMP组件链接以及与这些ECMP组件链接中的每个链接相关联的权重。用于查找表的输入是数据包的哈希值,权重用于确定哈希函数的哪些值映射到给定的ECMP组件链接。

Flow (large or small): A sequence of packets for which ordered delivery should be maintained, e.g., packets belonging to the same TCP connection.

流(大或小):应保持有序传递的数据包序列,例如,属于同一TCP连接的数据包。

LAG component link: An individual link within a LAG. A LAG component link is typically a physical link.

滞后组件链接:滞后中的单个链接。滞后分量链路通常是物理链路。

LAG table: A table that is used as the output port, which is a LAG, that comprises the set of LAG component links and the weights associated with each of those component links. The input for looking up the table is the hash value for the packet, and the weights are used to determine which values of the hash function map to a given LAG component link.

LAG表格:用作输出端口的表格,它是一个LAG,包括一组滞后组件链接和与每个组件链接相关的权重。用于查找表的输入是分组的哈希值,权重用于确定哈希函数的哪些值映射到给定的滞后分量链路。

Large flow(s): Refers to long-lived large flow(s).

大流量:指长寿命大流量。

Small flow(s): Refers to any of, or a combination of, long-lived small flow(s), short-lived small flows, and short-lived large flow(s).

小流量:指长寿命小流量、短命小流量和短命大流量中的任何一种或其组合。

2. Flow Categorization
2. 流分类

In general, based on the size and duration, a flow can be categorized into any one of the following four types, as shown in Figure 1:

通常,根据大小和持续时间,流可以分为以下四种类型中的任意一种,如图1所示:

o short-lived large flow (SLLF),

o 短寿命大流量(SLLF),

o short-lived small flow (SLSF),

o 短寿命小流量(SLSF),

o long-lived large flow (LLLF), and

o 长寿命大流量(LLLF),以及

o long-lived small flow (LLSF).

o 长寿命小流量(LLSF)。

        Flow Bandwidth
            ^
            |--------------------|--------------------|
            |                    |                    |
      Large |      SLLF          |       LLLF         |
      Flow  |                    |                    |
            |--------------------|--------------------|
            |                    |                    |
      Small |      SLSF          |       LLSF         |
      Flow  |                    |                    |
            +--------------------+--------------------+-->Flow Duration
                 Short-Lived            Long-Lived
                 Flow                   Flow
        
        Flow Bandwidth
            ^
            |--------------------|--------------------|
            |                    |                    |
      Large |      SLLF          |       LLLF         |
      Flow  |                    |                    |
            |--------------------|--------------------|
            |                    |                    |
      Small |      SLSF          |       LLSF         |
      Flow  |                    |                    |
            +--------------------+--------------------+-->Flow Duration
                 Short-Lived            Long-Lived
                 Flow                   Flow
        

Figure 1: Flow Categorization

图1:流分类

In this document, as mentioned earlier, we categorize long-lived large flows as "large flows", and all of the others (long-lived small flows, short-lived small flows, and short-lived large flows) as "small flows".

在本文件中,如前所述,我们将长寿命大流量归类为“大流量”,将所有其他流量(长寿命小流量、短命小流量和短命大流量)归类为“小流量”。

3. Hash-Based Load Distribution in LAG/ECMP
3. LAG/ECMP中基于哈希的负载分配

Hash-based techniques are often used for load balancing of traffic to select among multiple available paths within a LAG/ECMP group. The advantages of hash-based techniques for load distribution are the preservation of the packet sequence in a flow and the real-time distribution without maintaining per-flow state in the router. Hash-based techniques use a combination of fields in the packet's headers

基于散列的技术通常用于流量负载平衡,以便在LAG/ECMP组内的多个可用路径中进行选择。用于负载分配的基于散列的技术的优点是在不维护路由器中的每个流状态的情况下,在流中保留分组序列和实时分配。基于散列的技术使用数据包头中的字段组合

to identify a flow, and the hash function computed using these fields is used to generate a unique number that identifies a link/path in a LAG/ECMP group. The result of the hashing procedure is a many-to-one mapping of flows to component links.

用于标识流,使用这些字段计算的哈希函数用于生成标识LAG/ECMP组中链接/路径的唯一编号。散列过程的结果是流到组件链接的多对一映射。

Hash-based techniques produce good results with respect to utilization of the individual component links if:

如果满足以下条件,则基于散列的技术在利用单个组件链接方面会产生良好的结果:

o the traffic mix constitutes flows such that the result of the hash function across these flows is fairly uniform so that a similar number of flows is mapped to each component link,

o 流量混合构成流,使得这些流上的散列函数的结果相当一致,以便将相似数量的流映射到每个组件链路,

o the individual flow rates are much smaller as compared to the link capacity, and

o the individual flow rates are much smaller as compared to the link capacity, andtranslate error, please retry

o the differences in flow rates are not dramatic.

o 流速的差异并不显著。

However, if one or more of these conditions are not met, hash-based techniques may result in imbalance in the loads on individual component links.

然而,如果不满足这些条件中的一个或多个,则基于散列的技术可能导致各个组件链接上的负载不平衡。

An example is illustrated in Figure 2. As shown, there are two routers, R1 and R2, and there is a LAG between them that has three component links (1), (2), and (3). A total of ten flows need to be distributed across the links in this LAG. The result of applying the hash-based technique is as follows:

图2显示了一个示例。如图所示,有两个路由器R1和R2,它们之间有一个延迟,有三个组件链路(1)、(2)和(3)。在此延迟中,总共需要在链路上分布10个流。应用基于散列的技术的结果如下:

o Component link (1) has three flows (two small flows and one large flow), and the link utilization is normal.

o 组件链路(1)有三个流(两个小流和一个大流),链路利用率正常。

o Component link (2) has three flows (three small flows and no large flows), and the link utilization is light.

o 组件链路(2)有三个流(三个小流,没有大流),链路利用率很低。

- The absence of any large flow causes the component link to be underutilized.

- 没有任何大流量会导致组件链接未充分利用。

o Component link (3) has four flows (two small flows and two large flows), and the link capacity is exceeded resulting in congestion.

o 组件链路(3)有四个流(两个小流和两个大流),超过链路容量导致拥塞。

- The presence of two large flows causes congestion on this component link.

- 两个大流量的存在会导致此组件链路上的拥塞。

                  +-----------+ ->     +-----------+
                  |           | ->     |           |
                  |           | ===>   |           |
                  |        (1)|--------|(1)        |
                  |           | ->     |           |
                  |           | ->     |           |
                  |   (R1)    | ->     |     (R2)  |
                  |        (2)|--------|(2)        |
                  |           | ->     |           |
                  |           | ->     |           |
                  |           | ===>   |           |
                  |           | ===>   |           |
                  |        (3)|--------|(3)        |
                  |           |        |           |
                  +-----------+        +-----------+
        
                  +-----------+ ->     +-----------+
                  |           | ->     |           |
                  |           | ===>   |           |
                  |        (1)|--------|(1)        |
                  |           | ->     |           |
                  |           | ->     |           |
                  |   (R1)    | ->     |     (R2)  |
                  |        (2)|--------|(2)        |
                  |           | ->     |           |
                  |           | ->     |           |
                  |           | ===>   |           |
                  |           | ===>   |           |
                  |        (3)|--------|(3)        |
                  |           |        |           |
                  +-----------+        +-----------+
        
            Where: ->   small flow
                   ===> large flow
        
            Where: ->   small flow
                   ===> large flow
        

Figure 2: Unevenly Utilized Component Links

图2:使用不均的组件链接

This document presents mechanisms for addressing the imbalance in load distribution resulting from commonly used hash-based techniques for LAG/ECMP that are shown in the above example. The mechanisms use large flow awareness to compensate for the imbalance in load distribution.

本文档介绍了解决负载分布不平衡的机制,该不平衡是由上述示例中所示的LAG/ECMP的常用基于哈希的技术引起的。这些机制使用大流量感知来补偿负载分布的不平衡。

4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization
4. 优化LAG/ECMP组件链路利用率的机制

The suggested mechanisms in this document are local optimization solutions; they are local in the sense that both the identification of large flows and rebalancing of the load can be accomplished completely within individual routers in the network without the need for interaction with other routers.

本文中建议的机制是局部优化解决方案;它们是本地的,因为大流量的识别和负载的重新平衡都可以在网络中的单个路由器内完全完成,而无需与其他路由器交互。

This approach may not yield a global optimization of the placement of large flows across multiple routers in a network, which may be desirable in some networks. On the other hand, a local approach may be adequate for some environments for the following reasons:

这种方法可能不会对网络中多个路由器之间的大流量布局产生全局优化,这在某些网络中可能是可取的。另一方面,由于以下原因,局部方法可能适用于某些环境:

1) Different links within a network experience different levels of utilization; thus, a "targeted" solution is needed for those hot spots in the network. An example is the utilization of a LAG between two routers that needs to be optimized.

1) 网络中的不同链路的利用率不同;因此,网络中的这些热点需要“有针对性”的解决方案。一个例子是需要优化的两个路由器之间的延迟利用。

2) Some networks may lack end-to-end visibility, e.g., when a certain network, under the control of a given operator, is a transit network for traffic from other networks that are not under the control of the same operator.

2) 某些网络可能缺乏端到端可视性,例如,当某个网络在给定运营商的控制下,是来自不在同一运营商控制下的其他网络的交通的中转网络时。

4.1. Differences in LAG vs. ECMP
4.1. 滞后与ECMP的差异

While the mechanisms explained herein are applicable to both LAGs and ECMP groups, it is useful to note that there are some key differences between the two that may impact how effective the mechanisms are. This relates, in part, to the localized information with which the mechanisms are intended to operate.

虽然本文解释的机制适用于LAG和ECMP组,但值得注意的是,两者之间存在一些可能影响机制有效性的关键差异。这在一定程度上与机制运行的本地化信息有关。

A LAG is usually established across links that are between two adjacent routers. As a result, the scope of the problem of optimizing the bandwidth utilization on the component links is fairly narrow. It simply involves rebalancing the load across the component links between these two routers, and there is no impact whatsoever to other parts of the network. The scheme works equally well for unicast and multicast flows.

通常在两个相邻路由器之间的链路上建立延迟。因此,优化组件链路上的带宽利用率的问题的范围相当狭窄。它只涉及在这两个路由器之间的组件链路上重新平衡负载,对网络的其他部分没有任何影响。该方案同样适用于单播和多播流。

On the other hand, with ECMP, redistributing the load across component links that are part of the ECMP group may impact traffic patterns at all of the routers that are downstream of the given router between itself and the destination. The local optimization may result in congestion at a downstream node. (In its simplest form, an ECMP group may be used to distribute traffic on component links that are between two adjacent routers, and in that case, the ECMP group is no different than a LAG for the purpose of this discussion. It should be noted that an ECMP component link may itself comprise a LAG, in which case the scheme may be further applied to the component links within the LAG.)

另一方面,对于ECMP,在作为ECMP组的一部分的组件链路上重新分配负载可能会影响在给定路由器的下游在其自身和目的地之间的所有路由器上的流量模式。局部优化可能导致下游节点处的拥塞。(在其最简单的形式中,ECMP组可用于在两个相邻路由器之间的组件链路上分配通信量,在这种情况下,ECMP组与本讨论中的延迟并无区别。应注意,ECMP组件链路本身可包括延迟,在这种情况下,该方案可进一步应用于组件将链接到LAG中。)

To demonstrate the limitations of local optimization, consider a two-level Clos network topology as shown in Figure 3 with three leaf routers (L1, L2, and L3) and two spine routers (S1 and S2). Assume all of the links are 10 Gbps.

为了证明局部优化的局限性,考虑图三所示的两级CLOS网络拓扑结构,其中有三个叶路由器(L1、L2和L3)和两个脊柱路由器(S1和S2)。假设所有链路均为10 Gbps。

Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one flow of 7 Gbps also towards L3. If L1 balances the load optimally between S1 and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 would get congested, resulting in packet discards. On the other hand, if L1 had sent both its flows towards S1 and L2 had sent its flow towards S2, there would have been no congestion at either S1 or S2.

设L1有两个流向L3的4 Gbps流,L2也有一个流向L3的7 Gbps流。如果L1在S1和S2之间以最佳方式平衡负载,并且L2通过S1发送流,则从S1到L3的下行链路将变得拥挤,从而导致分组丢弃。另一方面,如果L1将其流发送到S1,L2将其流发送到S2,则S1或S2都不会出现拥塞。

                    +-----+     +-----+
                    | S1  |     | S2  |
                    +-----+     +-----+
                     / \ \       / /\
                    / +---------+ /  \
                   / /  \  \     /    \
                  / /    \  +------+   \
                 / /      \    /    \   \
              +-----+    +-----+   +-----+
              | L1  |    | L2  |   | L3  |
              +-----+    +-----+   +-----+
        
                    +-----+     +-----+
                    | S1  |     | S2  |
                    +-----+     +-----+
                     / \ \       / /\
                    / +---------+ /  \
                   / /  \  \     /    \
                  / /    \  +------+   \
                 / /      \    /    \   \
              +-----+    +-----+   +-----+
              | L1  |    | L2  |   | L3  |
              +-----+    +-----+   +-----+
        

Figure 3: Two-Level Clos Network

图3:两级Clos网络

The other issue with applying this scheme to ECMP groups is that it may not apply equally to unicast and multicast traffic because of the way multicast trees are constructed.

将此方案应用于ECMP组的另一个问题是,由于多播树的构造方式,它可能不会同样适用于单播和多播流量。

Finally, it is possible for a single physical link to participate as a component link in multiple ECMP groups, whereas with LAGs, a link can participate as a component link of only one LAG.

最后,单个物理链路可以作为多个ECMP组中的组件链路参与,而对于LAG,链路可以作为只有一个LAG的组件链路参与。

4.2. Operational Overview
4.2. 操作概述

The various steps in optimizing LAG/ECMP component link utilization in networks are detailed below:

优化网络中LAG/ECMP组件链路利用率的各个步骤如下所述:

Step 1: This step involves recognizing large flows in routers and maintaining the mapping for each large flow to the component link that it uses. Recognition of large flows is explained in Section 4.3.

步骤1:此步骤涉及识别路由器中的大流,并维护每个大流到其使用的组件链接的映射。第4.3节解释了大流量的识别。

Step 2: The egress component links are periodically scanned for link utilization, and the imbalance for the LAG/ECMP group is monitored. If the imbalance exceeds a certain threshold, then rebalancing is triggered. Measurement of the imbalance is discussed further in Section 5.1. In addition to the imbalance, further criteria (such as the maximum utilization of any of the component links) may also be used to determine whether or not to trigger rebalancing. The use of sampling techniques for the measurement of egress component link utilization, including the issues of depending on ingress sampling for these measurements, are discussed in Section 4.3.3.

步骤2:定期扫描出口组件链路以获取链路利用率,并监控LAG/ECMP组的不平衡。如果不平衡超过某个阈值,则会触发再平衡。第5.1节将进一步讨论不平衡的测量。除了不平衡之外,还可以使用其他标准(例如任何组件链接的最大利用率)来确定是否触发再平衡。第4.3.3节讨论了用于测量出口组件链路利用率的采样技术的使用,包括这些测量依赖入口采样的问题。

Step 3: As a part of rebalancing, the operator can choose to rebalance the large flows by placing them on lightly loaded component links of the LAG/ECMP group, redistribute the small flows on the congested link to other component links of the group, or a combination of both.

步骤3:作为重新平衡的一部分,运营商可以选择将大流量重新平衡,方法是将其放置在LAG/ECMP组的轻负载组件链路上,将拥塞链路上的小流量重新分配到该组的其他组件链路上,或两者的组合。

All of the steps identified above can be done locally within the router itself or could involve the use of a central management entity.

上述所有步骤都可以在路由器内部本地完成,也可以使用中央管理实体。

Providing large flow information to a central management entity provides the capability to globally optimize flow distribution as described in Section 4.1. Consider the following example. A router may have three ECMP next hops that lead down paths P1, P2, and P3. A couple of hops downstream on path P1, there may be a congested link, while paths P2 and P3 may be underutilized. This is something that the local router does not have visibility into. With the help of a central management entity, the operator could redistribute some of the flows from P1 to P2 and/or P3, resulting in a more optimized flow of traffic.

如第4.1节所述,向中央管理实体提供大流量信息可提供全局优化流量分配的能力。考虑下面的例子。一个路由器可能有三个ECMP下一跳,它们沿路径P1、P2和P3向下。在路径P1下游的两个跃点上,可能存在拥塞链路,而路径P2和P3可能未充分利用。这是本地路由器无法看到的。在中央管理实体的帮助下,运营商可以将一些流量从P1重新分配到P2和/或P3,从而产生更优化的流量。

The steps described above are especially useful when bundling links of different bandwidths, e.g., 10 Gbps and 100 Gbps as described in [RFC7226].

上述步骤在捆绑不同带宽的链路时特别有用,例如,如[RFC7226]中所述的10 Gbps和100 Gbps。

4.3. Large Flow Recognition
4.3. 大流量识别
4.3.1. Flow Identification
4.3.1. 流量识别

Flows are typically identified using one or more fields from the packet header, for example:

流通常使用来自分组报头的一个或多个字段来标识,例如:

o Layer 2: Source Media Access Control (MAC) address, destination MAC address, VLAN ID.

o 第2层:源媒体访问控制(MAC)地址、目标MAC地址、VLAN ID。

o IP header: IP protocol, IP source address, IP destination address, flow label (IPv6 only).

o IP头:IP协议、IP源地址、IP目标地址、流标签(仅限IPv6)。

o Transport protocol header: Source port number, destination port number. These apply to protocols such as TCP, UDP, and the Stream Control Transmission Protocol (SCTP).

o 传输协议头:源端口号、目标端口号。这些协议适用于TCP、UDP和流控制传输协议(SCTP)等协议。

o MPLS labels.

o MPLS标签。

For tunneling protocols like Generic Routing Encapsulation (GRE) [RFC2784], Virtual eXtensible LAN (VXLAN) [RFC7348], Network Virtualization using Generic Routing Encapsulation (NVGRE) [NVGRE],

对于隧道协议,如通用路由封装(GRE)[RFC2784]、虚拟可扩展LAN(VXLAN)[RFC7348]、使用通用路由封装的网络虚拟化(NVGRE)[NVGRE],

Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling Protocol (L2TP) [RFC3931], etc., flow identification is possible based on inner and/or outer headers as well as fields introduced by the tunnel header, as any or all such fields may be used for load balancing decisions [RFC5640].

无状态传输隧道(STT)[STT],第2层隧道协议(L2TP)[RFC3931]等,可以基于内部和/或外部头以及隧道头引入的字段进行流识别,因为任何或所有此类字段都可用于负载平衡决策[RFC5640]。

The above list is not exhaustive.

上述清单并非详尽无遗。

The mechanisms described in this document are agnostic to the fields that are used for flow identification.

本文件中描述的机制与用于流量识别的字段无关。

This method of flow identification is consistent with that of IPFIX [RFC7011].

此流量识别方法与IPFIX[RFC7011]的方法一致。

4.3.2. Criteria and Techniques for Large Flow Recognition
4.3.2. 大流量识别准则与技术

From the perspective of bandwidth and time duration, in order to recognize large flows, we define an observation interval and measure the bandwidth of the flow over that interval. A flow that exceeds a certain minimum bandwidth threshold over that observation interval would be considered a large flow.

从带宽和持续时间的角度来看,为了识别大流量,我们定义了一个观察间隔并测量该间隔内流量的带宽。在该观测间隔内超过某个最小带宽阈值的流量将被视为大流量。

The two parameters -- the observation interval and the minimum bandwidth threshold over that observation interval -- should be programmable to facilitate handling of different use cases and traffic characteristics. For example, a flow that is at or above 10% of link bandwidth for a time period of at least one second could be declared a large flow [DEVOFLOW].

这两个参数——观察间隔和观察间隔上的最小带宽阈值——应该是可编程的,以便于处理不同的用例和流量特性。例如,在至少1秒的时间段内处于或高于链路带宽10%的流可以被声明为大流[DevFlow]。

In order to avoid excessive churn in the rebalancing, once a flow has been recognized as a large flow, it should continue to be recognized as a large flow for as long as the traffic received during an observation interval exceeds some fraction of the bandwidth threshold, for example, 80% of the bandwidth threshold.

为了避免再平衡中的过度搅动,一旦流量被识别为大流量,只要在观察间隔期间接收的流量超过带宽阈值的某一部分(例如,带宽阈值的80%),就应继续将其识别为大流量。

Various techniques to recognize a large flow are described in Sections 4.3.3, 4.3.4, and 4.3.5.

第4.3.3节、第4.3.4节和第4.3.5节描述了识别大流量的各种技术。

4.3.3. Sampling Techniques
4.3.3. 抽样技术

A number of routers support sampling techniques such as sFlow [sFlow-v5] [sFlow-LAG], Packet Sampling (PSAMP) [RFC5475], and NetFlow Sampling [RFC3954]. For the purpose of large flow recognition, sampling needs to be enabled on all of the egress ports in the router where such measurements are desired.

许多路由器支持采样技术,例如sFlow[sFlow-v5][sFlow LAG]、数据包采样(PSAMP)[RFC5475]和网络流采样[RFC3954]。为了实现大流量识别,需要在路由器中需要进行此类测量的所有出口端口上启用采样。

Using sFlow as an example, processing in an sFlow collector can provide an approximate indication of the mapping of large flows to each of the component links in each LAG/ECMP group. Assuming sufficient control plane resources are available, it is possible to implement this part of the collector function in the control plane of the router to reduce dependence on a central management entity.

以sFlow为例,sFlow收集器中的处理可以提供大流到每个LAG/ECMP组中每个组件链路的映射的近似指示。假设有足够的控制平面资源可用,则可以在路由器的控制平面中实现收集器功能的这一部分,以减少对中央管理实体的依赖。

If egress sampling is not available, ingress sampling can suffice since the central management entity used by the sampling technique typically has visibility across multiple routers in a network and can use the samples from an immediately downstream router to make measurements for egress traffic at the local router.

如果出口采样不可用,则入口采样就足够了,因为采样技术使用的中央管理实体通常在网络中的多个路由器之间具有可见性,并且可以使用来自直接下游路由器的样本来测量本地路由器处的出口流量。

The option of using ingress sampling for this purpose may not be available if the downstream router is under the control of a different operator or if the downstream device does not support sampling.

如果下游路由器在不同运营商的控制下,或者如果下游设备不支持采样,则使用入口采样的选项可能不可用。

Alternatively, since sampling techniques require that the sample be annotated with the packet's egress port information, ingress sampling may suffice. However, this means that sampling would have to be enabled on all ports, rather than only on those ports where such monitoring is desired. There is one situation in which this approach may not work. If there are tunnels that originate from the given router and if the resulting tunnel comprises the large flow, then this cannot be deduced from ingress sampling at the given router. Instead, for this scenario, if egress sampling is unavailable, then ingress sampling from the downstream router must be used.

或者,由于采样技术要求用分组的出口端口信息对样本进行注释,因此入口采样可能就足够了。但是,这意味着必须在所有端口上启用采样,而不是仅在需要此类监控的端口上启用采样。有一种情况下,这种方法可能不起作用。如果存在源自给定路由器的隧道,并且如果产生的隧道包含大流量,则无法从给定路由器处的入口采样推断出这一点。相反,对于这种情况,如果出口采样不可用,则必须使用来自下游路由器的入口采样。

To illustrate the use of ingress versus egress sampling, we refer to Figure 2. Since we are looking at rebalancing flows at R1, we would need to enable egress sampling on ports (1), (2), and (3) on R1. If egress sampling is not available and if R2 is also under the control of the same administrator, enabling ingress sampling on R2's ports (1), (2), and (3) would also work, but it would necessitate the involvement of a central management entity in order for R1 to obtain large flow information for each of its links. Finally, R1 can only enable ingress sampling on all of its ports (not just the ports that are part of the LAG/ECMP group being monitored), and that would suffice if the sampling technique annotates the samples with the egress port information.

为了说明入口和出口采样的使用,我们参考图2。由于我们正在研究R1的再平衡流,因此需要在R1的端口(1)、(2)和(3)上启用出口采样。如果出口采样不可用,并且R2也在同一管理员的控制下,则在R2的端口(1)、(2)和(3)上启用入口采样也会起作用,但这需要中央管理实体的参与,以便R1获得其每个链路的大流量信息。最后,R1只能在其所有端口上启用入口采样(而不仅仅是作为被监控的LAG/ECMP组一部分的端口),如果采样技术使用出口端口信息对样本进行注释,这就足够了。

The advantages and disadvantages of sampling techniques are as follows.

取样技术的优点和缺点如下。

Advantages:

优势:

o Supported in most existing routers.

o 支持在大多数现有路由器。

o Requires minimal router resources.

o 需要最少的路由器资源。

Disadvantage:

缺点:

o In order to minimize the error inherent in sampling, there is a minimum delay for the recognition time of large flows, and in the time that it takes to react to this information.

o 为了最小化采样中固有的误差,大流量的识别时间以及对该信息作出反应所需的时间存在最小延迟。

With sampling, the detection of large flows can be done on the order of one second [DEVOFLOW]. A discussion on determining the appropriate sampling frequency is available in [SAMP-BASIC].

通过采样,大流量的检测可以在1秒[DevFlow]的量级上完成。有关确定适当采样频率的讨论,请参见[SAMP-BASIC]。

4.3.4. Inline Data Path Measurement
4.3.4. 内联数据路径测量

Implementations may perform recognition of large flows by performing measurements on traffic in the data path of a router. Such an approach would be expected to operate at the interface speed on every interface, accounting for all packets processed by the data path of the router. An example of such an approach is described in IPFIX [RFC5470].

实现可以通过对路由器的数据路径中的流量执行测量来执行大流量的识别。这种方法将在每个接口上以接口速度运行,考虑路由器数据路径处理的所有数据包。IPFIX[RFC5470]中描述了这种方法的一个示例。

Using inline data path measurement, a faster and more accurate indication of large flows mapped to each of the component links in a LAG/ECMP group may be possible (as compared to the sampling-based approach).

使用内联数据路径测量,可以更快、更准确地指示映射到LAG/ECMP组中每个组件链接的大流量(与基于采样的方法相比)。

The advantages and disadvantages of inline data path measurement are as follows:

内联数据路径测量的优点和缺点如下:

Advantages:

优势:

o As link speeds get higher, sampling rates are typically reduced to keep the number of samples manageable, which places a lower bound on the detection time. With inline data path measurement, large flows can be recognized in shorter windows on higher link speeds since every packet is accounted for [NDTM].

o 随着链路速度的提高,采样率通常会降低,以保持样本数量可控,从而降低检测时间。通过内联数据路径测量,可以在更短的窗口内以更高的链路速度识别出较大的流量,因为每个数据包都被记录在[NDTM]中。

o Inline data path measurement eliminates the potential dependence on a central management entity for large flow recognition.

o 内联数据路径测量消除了对中央管理实体进行大流量识别的潜在依赖。

Disadvantage:

缺点:

o Inline data path measurement is more resource intensive in terms of the table sizes required for monitoring all flows.

o 就监视所有流所需的表大小而言,内联数据路径度量更占用资源。

As mentioned earlier, the observation interval for determining a large flow and the bandwidth threshold for classifying a flow as a large flow should be programmable parameters in a router.

如前所述,用于确定大流量的观察间隔和用于将流量分类为大流量的带宽阈值应该是路由器中的可编程参数。

The implementation details of inline data path measurement of large flows is vendor dependent and beyond the scope of this document.

大型流的内联数据路径度量的实现细节取决于供应商,超出了本文档的范围。

4.3.5. Use of Multiple Methods for Large Flow Recognition
4.3.5. 使用多种方法识别大流量

It is possible that a router may have line cards that support a sampling technique while other line cards support inline data path measurement. As long as there is a way for the router to reliably determine the mapping of large flows to component links of a LAG/ECMP group, it is acceptable for the router to use more than one method for large flow recognition.

路由器可能具有支持采样技术的线路卡,而其他线路卡支持内联数据路径测量。只要路由器能够可靠地确定大流量到LAG/ECMP组的组件链路的映射,路由器就可以使用多种方法进行大流量识别。

If both methods are supported, inline data path measurement may be preferable because of its speed of detection [FLOW-ACC].

如果支持这两种方法,则内联数据路径测量可能更可取,因为其检测速度[FLOW-ACC]。

4.4. Options for Load Rebalancing
4.4. 负载再平衡选项

The following subsections describe suggested techniques for load balancing. Equipment vendors may implement more than one technique, including those not described in this document, and allow the operator to choose between them.

以下小节描述了建议的负载平衡技术。设备供应商可实施多种技术,包括本文件中未描述的技术,并允许操作员在其中进行选择。

Note that regardless of the method used, perfect rebalancing of large flows may not be possible since flows arrive and depart at different times. Also, any flows that are moved from one component link to another may experience momentary packet reordering.

请注意,无论使用何种方法,由于流量在不同时间到达和离开,可能无法实现大流量的完美再平衡。此外,从一个组件链路移动到另一个组件链路的任何流可能会经历瞬时的数据包重新排序。

4.4.1. Alternative Placement of Large Flows
4.4.1. 大流量的替代布置

Within a LAG/ECMP group, member component links with the least average link utilization are identified. Some large flow(s) from the heavily loaded component links are then moved to those lightly loaded member component links using a PBR rule in the ingress processing element(s) in the routers.

在LAG/ECMP组中,识别平均链路利用率最低的成员组件链路。然后,使用路由器中的入口处理元件中的PBR规则,将来自重负载组件链路的一些大流移动到那些轻负载成员组件链路。

With this approach, only certain large flows are subjected to momentary flow reordering.

使用这种方法,只有某些大流量会受到瞬时流量重新排序的影响。

Moving a large flow will increase the utilization of the link that it is moved to, potentially once again creating an imbalance in the utilization across the component links. Therefore, when moving a large flow, care must be taken to account for the existing load and the future load after the large flow has been moved. Further, the appearance of new large flows may require a rearrangement of the placement of existing flows.

移动大流量将提高其所移动到的链接的利用率,可能再次造成组件链接利用率的不平衡。因此,在移动大流量时,必须注意现有负荷和移动大流量后的未来负荷。此外,新的大型流的出现可能需要重新安排现有流的位置。

Consider a case where there is a LAG compromising four 10 Gbps component links and there are four large flows, each of 1 Gbps. These flows are each placed on one of the component links. Subsequently, a fifth large flow of 2 Gbps is recognized, and to maintain equitable load distribution, it may require placement of one of the existing 1 Gbps flow to a different component link. This would still result in some imbalance in the utilization across the component links.

考虑一个滞后四个10 Gbps组件链路的情况,并且有1个大流量,每个1 Gbps。这些流分别放置在一个组件链接上。随后,识别出2 Gbps的第五大流,并且为了保持公平的负载分布,可能需要将现有1 Gbps流中的一个放置到不同的组件链路。这仍然会导致组件链接之间的利用率不平衡。

4.4.2. Redistributing Small Flows
4.4.2. 重新分配小流量

Some large flows may consume the entire bandwidth of the component link(s). In this case, it would be desirable for the small flows to not use the congested component link(s).

一些大型流可能会占用组件链路的整个带宽。在这种情况下,希望小流量不使用拥挤的组件链路。

o The LAG/ECMP table is modified to include only non-congested component link(s). Small flows hash into this table to be mapped to a destination component link. Alternatively, if certain component links are heavily loaded but not congested, the output of the hash function can be adjusted to account for large flow loading on each of the component links.

o LAG/ECMP表被修改为仅包括非拥塞组件链路。小数据流散列到此表中,以映射到目标组件链接。或者,如果某些组件链接负载很重但不拥挤,则可以调整哈希函数的输出以考虑每个组件链接上的大流量负载。

o The PBR rules for large flows (refer to Section 4.4.1) must have strict precedence over the LAG/ECMP table lookup result.

o 大流量的PBR规则(参考第4.4.1节)必须严格优先于LAG/ECMP表格查找结果。

This method works on some existing router hardware. The idea is to prevent, or reduce the probability, that a small flow hashes into the congested component link(s).

这种方法适用于一些现有的路由器硬件。这样做的目的是防止或降低小流量散列进入拥塞组件链路的可能性。

With this approach, the small flows that are moved would be subject to reordering.

使用这种方法,移动的小流量将受到重新排序的影响。

4.4.3. Component Link Protection Considerations
4.4.3. 组件链接保护注意事项

If desired, certain component links may be reserved for link protection. These reserved component links are not used for any flows in the absence of any failures. When there is a failure of one or more component links, all the flows on the failed component link(s) are moved to the reserved component link(s). The mapping table of large flows to component links simply replaces the failed

如果需要,可以保留某些组件链接用于链接保护。在没有任何故障的情况下,这些保留组件链接不会用于任何流。当一个或多个组件链接发生故障时,故障组件链接上的所有流都将移动到保留组件链接。大型流到组件链接的映射表只是替换失败的

component link with the reserved component link. Likewise, the LAG/ECMP table replaces the failed component link with the reserved component link.

与保留组件链接的组件链接。同样,LAG/ECMP表将故障组件链接替换为保留组件链接。

4.4.4. Algorithms for Load Rebalancing
4.4.4. 负载再平衡算法

Specific algorithms for placement of large flows are out of the scope of this document. One possibility is to formulate the problem for large flow placement as the well-known bin-packing problem and make use of the various heuristics that are available for that problem [BIN-PACK].

用于放置大型流的特定算法不在本文档的范围内。一种可能性是将大流量布局问题表述为众所周知的装箱问题,并利用可用于该问题的各种启发式方法[bin-PACK]。

4.4.5. Example of Load Rebalancing
4.4.5. 负载再平衡示例

Optimizing LAG/ECMP component utilization for the use case in Figure 2 is depicted below in Figure 4. The large flow rebalancing explained in Section 4.4.1 is used. The improved link utilization is as follows:

优化图2中用例的LAG/ECMP组件利用率如下图4所示。使用第4.4.1节中解释的大流量再平衡。改进的链路利用率如下所示:

o Component link (1) has three flows (two small flows and one large flow), and the link utilization is normal.

o 组件链路(1)有三个流(两个小流和一个大流),链路利用率正常。

o Component link (2) has four flows (three small flows and one large flow), and the link utilization is normal now.

o 组件链路(2)有四个流(三个小流和一个大流),链路利用率现在正常。

o Component link (3) has three flows (two small flows and one large flow), and the link utilization is normal now.

o 组件链路(3)有三个流(两个小流和一个大流),链路利用率现在正常。

                +-----------+ ->     +-----------+
                |           | ->     |           |
                |           | ===>   |           |
                |        (1)|--------|(1)        |
                |           |        |           |
                |           | ===>   |           |
                |           | ->     |           |
                |           | ->     |           |
                |   (R1)    | ->     |     (R2)  |
                |        (2)|--------|(2)        |
                |           |        |           |
                |           | ->     |           |
                |           | ->     |           |
                |           | ===>   |           |
                |        (3)|--------|(3)        |
                |           |        |           |
                +-----------+        +-----------+
        
                +-----------+ ->     +-----------+
                |           | ->     |           |
                |           | ===>   |           |
                |        (1)|--------|(1)        |
                |           |        |           |
                |           | ===>   |           |
                |           | ->     |           |
                |           | ->     |           |
                |   (R1)    | ->     |     (R2)  |
                |        (2)|--------|(2)        |
                |           |        |           |
                |           | ->     |           |
                |           | ->     |           |
                |           | ===>   |           |
                |        (3)|--------|(3)        |
                |           |        |           |
                +-----------+        +-----------+
        
          Where: ->   small flow
                 ===> large flow
        
          Where: ->   small flow
                 ===> large flow
        

Figure 4: Evenly Utilized Composite Links

图4:均匀利用的复合链接

Basically, the use of the mechanisms described in Section 4.4.1 resulted in a rebalancing of flows where one of the large flows on component link (3), which was previously congested, was moved to component link (2), which was previously underutilized.

基本上,第4.4.1节所述机制的使用导致了流量的再平衡,其中组件链路(3)上的一个大流量(先前拥挤)被移动到组件链路(2),而组件链路(2)先前未充分利用。

5. Information Model for Flow Rebalancing
5. 流量再平衡的信息模型

In order to support flow rebalancing in a router from an external system, the exchange of some information is necessary between the router and the external system. This section provides an exemplary information model covering the various components needed for this purpose. The model is intended to be informational and may be used as a guide for the development of a data model.

为了从外部系统支持路由器中的流重新平衡,路由器和外部系统之间需要交换一些信息。本节提供了一个示例性信息模型,涵盖了为此目的所需的各种组件。该模型旨在提供信息,可作为开发数据模型的指南。

5.1. Configuration Parameters for Flow Rebalancing
5.1. 流量再平衡的配置参数

The following parameters are required for configuration of this feature:

配置此功能需要以下参数:

o Large flow recognition parameters:

o 大流量识别参数:

- Observation interval: The observation interval is the time period in seconds over which packet arrivals are observed for the purpose of large flow recognition.

- 观察间隔:观察间隔是指为了识别大流量而观察数据包到达的时间段(以秒为单位)。

- Minimum bandwidth threshold: The minimum bandwidth threshold would be configured as a percentage of link speed and translated into a number of bytes over the observation interval. A flow for which the number of bytes received over a given observation interval exceeds this number would be recognized as a large flow.

- 最小带宽阈值:最小带宽阈值将配置为链路速度的百分比,并在观察间隔内转换为字节数。在给定观察间隔内接收的字节数超过此数的流将被视为大流。

- Minimum bandwidth threshold for large flow maintenance: The minimum bandwidth threshold for large flow maintenance is used to provide hysteresis for large flow recognition. Once a flow is recognized as a large flow, it continues to be recognized as a large flow until it falls below this threshold. This is also configured as a percentage of link speed and is typically lower than the minimum bandwidth threshold defined above.

- 大流量维护的最小带宽阈值:大流量维护的最小带宽阈值用于为大流量识别提供滞后。一旦一个流被识别为大流,它将继续被识别为大流,直到它低于此阈值。这也被配置为链路速度的百分比,并且通常低于上面定义的最小带宽阈值。

o Imbalance threshold: A measure of the deviation of the component link utilizations from the utilization of the overall LAG/ECMP group. Since component links can be different speeds, the imbalance can be computed as follows. Let the utilization of each component link in a LAG/ECMP group with n links of speed b_1, b_2 .. b_n be u_1, u_2 .. u_n. The mean utilization is computed as

o 不平衡阈值:组件链路利用率与总体滞后/ECMP组利用率的偏差度量。由于部件连杆的速度可能不同,因此不平衡可按如下方式计算。让每个组件链路在一个LAG/ECMP组中的利用率为n个链路,速度为b_1,b_2。。b_n是u_1,u_2。。乌恩。平均利用率计算如下:

      u_ave = [ (u_1 * b_1) + (u_2 * b_2) + .. + (u_n * b_n) ] /
              [b_1 + b_2 + .. + b_n].
        
      u_ave = [ (u_1 * b_1) + (u_2 * b_2) + .. + (u_n * b_n) ] /
              [b_1 + b_2 + .. + b_n].
        

The imbalance is then computed as

然后将不平衡计算为:

max_{i=1..n} | u_i - u_ave |.

max|u{i=1..n}u|u i-u|u ave |。

o Rebalancing interval: The minimum amount of time between rebalancing events. This parameter ensures that rebalancing is not invoked too frequently as it impacts packet ordering.

o 重新平衡间隔:重新平衡事件之间的最小时间量。此参数确保不会频繁调用重新平衡,因为它会影响数据包排序。

These parameters may be configured on a system-wide basis or may apply to an individual LAG/ECMP group. They may be applied to an ECMP group, provided that the component links are not shared with any other ECMP group.

这些参数可以在系统范围内配置,也可以应用于单个LAG/ECMP组。它们可以应用于ECMP组,前提是组件链接不与任何其他ECMP组共享。

5.2. System Configuration and Identification Parameters
5.2. 系统配置和识别参数

The following parameters are useful for router configuration and operation when using the mechanisms in this document.

使用本文档中的机制时,以下参数对于路由器配置和操作非常有用。

o IP address: The IP address of a specific router that the feature is being configured on or that the large flow placement is being applied to.

o IP地址:配置功能或应用大流量布局的特定路由器的IP地址。

o LAG ID: Identifies the LAG on a given router. The LAG ID may be required when configuring this feature (to apply a specific set of large flow identification parameters to the LAG) and will be required when specifying flow placement to achieve the desired rebalancing.

o 滞后ID:标识给定路由器上的滞后。配置此功能时可能需要滞后ID(将一组特定的大流量识别参数应用于滞后),并且在指定流量布置以实现所需的再平衡时可能需要滞后ID。

o Component Link ID: Identifies the component link within a LAG or ECMP group. This is required when specifying flow placement to achieve the desired rebalancing.

o 组件链接ID:标识LAG或ECMP组中的组件链接。当指定流量布置以实现所需的再平衡时,这是必需的。

o Component Link Weight: The relative weight to be applied to traffic for a given component link when using hash-based techniques for load distribution.

o 组件链接权重:当使用基于哈希的技术进行负载分配时,应用于给定组件链接的流量的相对权重。

o ECMP group: Identifies a particular ECMP group. The ECMP group may be required when configuring this feature (to apply a specific set of large flow identification parameters to the ECMP group) and will be required when specifying flow placement to achieve the desired rebalancing. We note that multiple ECMP groups can share an overlapping set (or non-overlapping subset) of component links. This document does not deal with the complexity of addressing such configurations.

o ECMP组:标识特定的ECMP组。配置此功能时可能需要ECMP组(将一组特定的大流量识别参数应用于ECMP组),并且在指定流量布置以实现所需的再平衡时可能需要ECMP组。我们注意到,多个ECMP组可以共享组件链接的重叠集(或非重叠子集)。本文档不涉及解决此类配置的复杂性。

The feature may be configured globally for all LAGs and/or for all ECMP groups, or it may be configured specifically for a given LAG or ECMP group.

可以为所有LAG和/或所有ECMP组全局配置该功能,也可以为给定LAG或ECMP组专门配置该功能。

5.3. Information for Alternative Placement of Large Flows
5.3. 关于大流量替代布置的信息

In cases where large flow recognition is handled by a central management entity (see Section 4.3.3), an information model for flows is required to allow the import of large flow information to the router.

如果大流量识别由中央管理实体处理(见第4.3.3节),则需要一个流量信息模型,以允许将大流量信息导入路由器。

Typical fields used for identifying large flows were discussed in Section 4.3.1. The IPFIX information model [RFC7012] can be leveraged for large flow identification.

第4.3.1节讨论了用于识别大流量的典型字段。IPFIX信息模型[RFC7012]可用于大流量识别。

Large flow placement is achieved by specifying the relevant flow information along with the following:

通过指定相关流量信息以及以下内容,可以实现大流量布局:

o For LAG: router's IP address, LAG ID, LAG component link ID.

o 对于滞后:路由器的IP地址、滞后ID、滞后组件链路ID。

o For ECMP: router's IP address, ECMP group, ECMP component link ID.

o 对于ECMP:路由器的IP地址、ECMP组、ECMP组件链接ID。

In the case where the ECMP component link itself comprises a LAG, we would have to specify the parameters for both the ECMP group as well as the LAG to which the large flow is being directed.

在ECMP组件链路本身包含滞后的情况下,我们必须为ECMP组以及大流量所指向的滞后指定参数。

5.4. Information for Redistribution of Small Flows
5.4. 小流量再分配信息

Redistribution of small flows is done using the following:

使用以下方法重新分配小流量:

o For LAG: The LAG ID and the component link IDs along with the relative weight of traffic to be assigned to each component link ID are required.

o 对于延迟:需要延迟ID和组件链接ID以及分配给每个组件链接ID的流量的相对权重。

o For ECMP: The ECMP group and the ECMP next hop along with the relative weight of traffic to be assigned to each ECMP next hop are required.

o 对于ECMP:需要ECMP组和ECMP下一跳以及分配给每个ECMP下一跳的流量的相对权重。

It is possible to have an ECMP next hop that itself comprises a LAG. In that case, we would have to specify the new weights for both the ECMP component links and the LAG component links.

有可能有一个ECMP下一跳,它本身包含一个滞后。在这种情况下,我们必须为ECMP组件链接和滞后组件链接指定新的权重。

In the case where an ECMP component link itself comprises a LAG, we would have to specify new weights for both the component links within the ECMP group as well as the component links within the LAG.

在ECMP组件链接本身包含滞后的情况下,我们必须为ECMP组内的组件链接以及滞后内的组件链接指定新的权重。

5.5. Export of Flow Information
5.5. 流量信息的导出

Exporting large flow information is required when large flow recognition is being done on a router but the decision to rebalance is being made in a central management entity. Large flow information includes flow identification and the component link ID that the flow is currently assigned to. Other information such as flow QoS and bandwidth may be exported too.

当在路由器上进行大流量识别,但在中央管理实体中做出重新平衡的决定时,需要导出大流量信息。大流量信息包括流量标识和当前分配给该流量的组件链接ID。还可以导出诸如流QoS和带宽之类的其他信息。

The IPFIX information model [RFC7012] can be leveraged for large flow identification.

IPFIX信息模型[RFC7012]可用于大流量识别。

5.6. Monitoring Information
5.6. 监测信息
5.6.1. Interface (Link) Utilization
5.6.1. 接口(链路)利用率

The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets), and interface speed (ifSpeed) can be obtained, for example, from the Interfaces table (ifTable) in the MIB module defined in [RFC1213].

例如,可以从[RFC1213]中定义的MIB模块中的接口表(ifTable)中获取传入字节(ifInOctets)、传出字节(IFOUTPoctets)和接口速度(ifSpeed)。

The link utilization can then be computed as follows:

然后,可以按照以下方式计算链路利用率:

   Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T)
        
   Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T)
        
   Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T)
        
   Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T)
        

Where T is the interval over which the utilization is being measured, delta_ifInOctets is the change in ifInOctets over that interval, and delta_ifOutOctets is the change in ifOutOctets over that interval.

其中,T是测量利用率的时间间隔,delta_ifOutOctets是该时间间隔内ifOutOctets的变化,delta_ifOutOctets是该时间间隔内ifOutOctets的变化。

For high-speed Ethernet links, the etherStatsHighCapacityTable in the MIB module defined in [RFC3273] can be used.

对于高速以太网链路,可以使用[RFC3273]中定义的MIB模块中的EtherstStatsHighCapacityTable。

Similar results may be achieved using the corresponding objects of other interface management data models such as YANG [RFC7223] if those are used instead of MIBs.

如果使用其他接口管理数据模型(如YANG[RFC7223])的相应对象而不是MIB,则可以实现类似的结果。

For scalability, it is recommended to use the counter push mechanism in [sFlow-v5] for the interface counters. Doing so would help avoid counter polling through the MIB interface.

为实现可扩展性,建议对接口计数器使用[sFlow-v5]中的计数器推送机制。这样做将有助于避免通过MIB接口进行反轮询。

The outgoing link utilization of the component links within a LAG/ECMP group can be used to compute the imbalance (see Section 5.1) for the LAG/ECMP group.

LAG/ECMP组内组件链路的传出链路利用率可用于计算LAG/ECMP组的不平衡(见第5.1节)。

5.6.2. Other Monitoring Information
5.6.2. 其他监测信息

Additional monitoring information that is useful includes:

有用的其他监控信息包括:

o Number of times rebalancing was done.

o 完成再平衡的次数。

o Time since the last rebalancing event.

o 自上次重新平衡事件以来的时间。

o The number of large flows currently rebalanced by the scheme.

o 该计划目前重新平衡的大流量数量。

o A list of the large flows that have been rebalanced including

o 已重新平衡的大流量列表,包括

- the rate of each large flow at the time of the last rebalancing for that flow,

- 上次重新平衡流量时各大流量的速率,

- the time that rebalancing was last performed for the given large flow, and

- 对于给定的大流量,上次执行再平衡的时间,以及

- the interfaces that the large flows was (re)directed to.

- 大流量(重新)指向的接口。

o The settings for the weights of the interfaces within a LAG/ECMP group used by the small flows that depend on hashing.

o LAG/ECMP组中接口权重的设置,由依赖于散列的小流使用。

6. Operational Considerations
6. 业务考虑
6.1. Rebalancing Frequency
6.1. 再平衡频率

Flows should be rebalanced only when the imbalance in the utilization across component links exceeds a certain threshold. Frequent rebalancing to achieve precise equitable utilization across component links could be counterproductive as it may result in moving flows back and forth between the component links, impacting packet ordering and system stability. This applies regardless of whether large flows or small flows are redistributed. It should be noted that reordering is a concern for TCP flows with even a few packets because three out-of-order packets would trigger sufficient duplicate ACKs to the sender, resulting in a retransmission [RFC5681].

只有当组件链接之间的利用率不平衡超过某个阈值时,才应该重新平衡流。频繁重新平衡以实现组件链路之间的精确公平利用可能会适得其反,因为它可能导致组件链路之间的流来回移动,从而影响数据包排序和系统稳定性。无论大流量还是小流量被重新分配,这都适用。应该注意的是,重新排序是TCP流的一个问题,即使只有几个数据包,因为三个无序数据包将触发对发送方的足够多的重复确认,从而导致重新传输[RFC5681]。

The operator would have to experiment with various values of the large flow recognition parameters (minimum bandwidth threshold, minimum bandwidth threshold for large flow maintenance, and observation interval) and the imbalance threshold across component links to tune the solution for their environment.

操作员必须对大流量识别参数的各种值(最小带宽阈值、大流量维护的最小带宽阈值和观察间隔)和组件链路间的不平衡阈值进行试验,以调整其环境的解决方案。

6.2. Handling Route Changes
6.2. 处理路线更改

Large flow rebalancing must be aware of any changes to the Forwarding Information Base (FIB). In cases where the next hop of a route no longer to points to the LAG or to an ECMP group, any PBR entries added as described in Sections 4.4.1 and 4.4.2 must be withdrawn in order to avoid the creation of forwarding loops.

大流量再平衡必须了解转发信息库(FIB)的任何更改。如果路由的下一跳不再指向延迟或ECMP组,则必须撤回按照第4.4.1节和第4.4.2节所述添加的任何PBR条目,以避免创建转发环路。

6.3. Forwarding Resources
6.3. 转发资源

Hash-based techniques used for load balancing with LAG/ECMP are usually stateless. The mechanisms described in this document require additional resources in the forwarding plane of routers for creating PBR rules that are capable of overriding the forwarding decision from the hash-based approach. These resources may limit the number of flows that can be rebalanced and may also impact the latency experienced by packets due to the additional lookups that are required.

用于LAG/ECMP负载平衡的基于哈希的技术通常是无状态的。本文档中描述的机制需要路由器转发平面中的额外资源来创建PBR规则,这些规则能够覆盖基于哈希的方法的转发决策。这些资源可能会限制可以重新平衡的流的数量,并且由于需要额外的查找,也可能会影响数据包所经历的延迟。

7. Security Considerations
7. 安全考虑

This document does not directly impact the security of the Internet infrastructure or its applications. In fact, it could help if there is a DoS attack pattern that causes a hash imbalance resulting in heavy overloading of large flows to certain LAG/ECMP component links.

本文件不会直接影响互联网基础设施或其应用程序的安全。事实上,如果有一种DoS攻击模式导致哈希不平衡,从而导致到某些LAG/ECMP组件链接的大流量严重过载,那么这可能会有所帮助。

An attacker with knowledge of the large flow recognition algorithm and any stateless distribution method can generate flows that are distributed in a way that overloads a specific path. This could be used to cause the creation of PBR rules that exhaust the available PBR rule capacity on routers in the network. If PBR rules are consequently discarded, this could result in congestion on the attacker-selected path. Alternatively, tracking large numbers of PBR rules could result in performance degradation.

了解大流识别算法和任何无状态分发方法的攻击者都可以生成以重载特定路径的方式分发的流。这可用于创建PBR规则,耗尽网络中路由器上可用的PBR规则容量。如果PBR规则因此被丢弃,这可能会导致攻击者选择的路径发生拥塞。或者,跟踪大量PBR规则可能会导致性能下降。

8. References
8. 工具书类
8.1. Normative References
8.1. 规范性引用文件

[802.1AX] IEEE, "IEEE Standard for Local and metropolitan area networks - Link Aggregation", IEEE Std 802.1AX-2008, 2008.

[802.1AX]IEEE,“局域网和城域网的IEEE标准-链路聚合”,IEEE标准802.1AX-2008,2008年。

[RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and Multicast Next-Hop Selection", RFC 2991, November 2000, <http://www.rfc-editor.org/info/rfc2991>.

[RFC2991]Thaler,D.和C.Hopps,“单播和多播下一跳选择中的多路径问题”,RFC 29912000年11月<http://www.rfc-editor.org/info/rfc2991>.

[RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, September 2013, <http://www.rfc-editor.org/info/rfc7011>.

[RFC7011]Claise,B.,Ed.,Trammell,B.,Ed.,和P.Aitken,“流量信息交换的IP流量信息导出(IPFIX)协议规范”,STD 77,RFC 7011,2013年9月<http://www.rfc-editor.org/info/rfc7011>.

[RFC7012] Claise, B., Ed., and B. Trammell, Ed., "Information Model for IP Flow Information Export (IPFIX)", RFC 7012, September 2013, <http://www.rfc-editor.org/info/rfc7012>.

[RFC7012]Claise,B.,Ed.,和B.Trammell,Ed.,“IP流信息导出(IPFIX)的信息模型”,RFC 7012,2013年9月<http://www.rfc-editor.org/info/rfc7012>.

8.2. Informative References
8.2. 资料性引用

[BIN-PACK] Coffman, Jr., E., Garey, M., and D. Johnson. "Approximation Algorithms for Bin-Packing -- An Updated Survey" (in "Algorithm Design for Computer System Design"), Springer, 1984.

[BIN-PACK]小科夫曼、E.加里、M.和D.约翰逊。“装箱的近似算法——最新调查”(在“计算机系统设计的算法设计”中),Springer,1984年。

[CAIDA] "Caida Traffic Analysis Research", <http://www.caida.org/research/traffic-analysis/>.

[CAIDA]“CAIDA流量分析研究”<http://www.caida.org/research/traffic-analysis/>.

[DEVOFLOW] Mogul, J., Tourrilhes, J., Yalagandula, P., Sharma, P., Curtis, R., and S. Banerjee, "DevoFlow: Cost-Effective Flow Management for High Performance Enterprise Networks", Proceedings of the ACM SIGCOMM, 2010.

[DevFlow]Mogul,J.,Tourrighes,J.,Yalagandula,P.,Sharma,P.,Curtis,R.,和S.Banerjee,“DevFlow:高性能企业网络的成本效益流管理”,ACM SIGCOMM会议录,2010年。

[FLOW-ACC] Zseby, T., Hirsch, T., and B. Claise, "Packet Sampling for Flow Accounting: Challenges and Limitations", Proceedings of the 9th international Passive and Active Measurement Conference, 2008.

[FLOW-ACC]Zseby,T.,Hirsch,T.,和B.Claise,“流量核算的数据包采样:挑战和限制”,第九届国际被动和主动测量会议记录,2008年。

[ITCOM] Jo, J., Kim, Y., Chao, H., and F. Merat, "Internet traffic load balancing using dynamic hashing with flow volume", SPIE ITCOM, 2002.

[ITCOM]Jo,J.,Kim,Y.,Chao,H.,和F.Merat,“使用流量动态哈希的互联网流量负载平衡”,SPIE ITCOM,2002年。

[NDTM] Estan, C. and G. Varghese, "New Directions in Traffic Measurement and Accounting", Proceedings of ACM SIGCOMM, August 2002.

[NDTM]Estan,C.和G.Varghese,“交通测量和会计的新方向”,ACM SIGCOMM会议记录,2002年8月。

[NVGRE] Garg, P. and Y. Wang, "NVGRE: Network Virtualization using Generic Routing Encapsulation", Work in Progress, draft-sridharan-virtualization-nvgre-07, November 2014.

[NVGRE]Garg,P.和Y.Wang,“NVGRE:使用通用路由封装的网络虚拟化”,正在进行的工作,草稿-sridharan-Virtualization-NVGRE-072014年11月。

[RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, March 2000, <http://www.rfc-editor.org/info/rfc2784>.

[RFC2784]Farinaci,D.,Li,T.,Hanks,S.,Meyer,D.,和P.Traina,“通用路由封装(GRE)”,RFC 27842000年3月<http://www.rfc-editor.org/info/rfc2784>.

[RFC6790] Kompella, K., Drake, J., Amante, S., Henderickx, W., and L. Yong, "The Use of Entropy Labels in MPLS Forwarding", RFC 6790, November 2012, <http://www.rfc-editor.org/info/rfc6790>.

[RFC6790]Kompella,K.,Drake,J.,Amante,S.,Henderickx,W.,和L.Yong,“MPLS转发中熵标签的使用”,RFC 67902012年11月<http://www.rfc-editor.org/info/rfc6790>.

[RFC1213] McCloghrie, K. and M. Rose, "Management Information Base for Network Management of TCP/IP-based internets: MIB-II", STD 17, RFC 1213, March 1991, <http://www.rfc-editor.org/info/rfc1213>.

[RFC1213]McCloghrie,K.和M.Rose,“基于TCP/IP的互联网网络管理的管理信息库:MIB-II”,STD 17,RFC 1213,1991年3月<http://www.rfc-editor.org/info/rfc1213>.

[RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, November 2000, <http://www.rfc-editor.org/info/rfc2992>.

[RFC2992]Hopps,C.,“等成本多径算法分析”,RFC 2992,2000年11月<http://www.rfc-editor.org/info/rfc2992>.

[RFC3273] Waldbusser, S., "Remote Network Monitoring Management Information Base for High Capacity Networks", RFC 3273, July 2002, <http://www.rfc-editor.org/info/rfc3273>.

[RFC3273]Waldbusser,S.,“大容量网络的远程网络监控管理信息库”,RFC3273,2002年7月<http://www.rfc-editor.org/info/rfc3273>.

[RFC3931] Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed., "Layer Two Tunneling Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005, <http://www.rfc-editor.org/info/rfc3931>.

[RFC3931]Lau,J.,Ed.,Townsley,M.,Ed.,和I.Goyret,Ed.,“第二层隧道协议-版本3(L2TPv3)”,RFC 39312005年3月<http://www.rfc-editor.org/info/rfc3931>.

[RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export Version 9", RFC 3954, October 2004, <http://www.rfc-editor.org/info/rfc3954>.

[RFC3954]Claise,B.,Ed.,“Cisco Systems NetFlow服务导出版本9”,RFC 3954,2004年10月<http://www.rfc-editor.org/info/rfc3954>.

[RFC5470] Sadasivan, G., Brownlee, N., Claise, B., and J. Quittek, "Architecture for IP Flow Information Export", RFC 5470, March 2009, <http://www.rfc-editor.org/info/rfc5470>.

[RFC5470]Sadasivan,G.,Brownlee,N.,Claise,B.,和J.Quitek,“IP流信息导出架构”,RFC 54702009年3月<http://www.rfc-editor.org/info/rfc5470>.

[RFC5475] Zseby, T., Molina, M., Duffield, N., Niccolini, S., and F. Raspall, "Sampling and Filtering Techniques for IP Packet Selection", RFC 5475, March 2009, <http://www.rfc-editor.org/info/rfc5475>.

[RFC5475]Zseby,T.,Molina,M.,Duffield,N.,Niccolini,S.,和F.Raspall,“IP数据包选择的采样和过滤技术”,RFC 5475,2009年3月<http://www.rfc-editor.org/info/rfc5475>.

[RFC5640] Filsfils, C., Mohapatra, P., and C. Pignataro, "Load-Balancing for Mesh Softwires", RFC 5640, August 2009, <http://www.rfc-editor.org/info/rfc5640>.

[RFC5640]Filsfils,C.,Mohapatra,P.,和C.Pignataro,“网状软电线的负载平衡”,RFC 56402009年8月<http://www.rfc-editor.org/info/rfc5640>.

[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009, <http://www.rfc-editor.org/info/rfc5681>.

[RFC5681]Allman,M.,Paxson,V.和E.Blanton,“TCP拥塞控制”,RFC 56812009年9月<http://www.rfc-editor.org/info/rfc5681>.

[RFC7223] Bjorklund, M., "A YANG Data Model for Interface Management", RFC 7223, May 2014, <http://www.rfc-editor.org/info/rfc7223>.

[RFC7223]Bjorklund,M.“用于接口管理的YANG数据模型”,RFC 7223,2014年5月<http://www.rfc-editor.org/info/rfc7223>.

[RFC7226] Villamizar, C., Ed., McDysan, D., Ed., Ning, S., Malis, A., and L. Yong, "Requirements for Advanced Multipath in MPLS Networks", RFC 7226, May 2014, <http://www.rfc-editor.org/info/rfc7226>.

[RFC7226]Villamizar,C.,Ed.,McDysan,D.,Ed.,Ning,S.,Malis,A.,和L.Yong,“MPLS网络中高级多径的要求”,RFC 7226,2014年5月<http://www.rfc-editor.org/info/rfc7226>.

[SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics", <http://www.sflow.org/packetSamplingBasics/>.

[SAMP-BASIC]Phaal,P.和S.Panchen,“数据包采样基础知识”<http://www.sflow.org/packetSamplingBasics/>.

[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004, <http://www.sflow.org/sflow_version_5.txt>.

[sFlow-v5]Phaal,P.和M.Lavine,“sFlow版本5”,2004年7月<http://www.sflow.org/sflow_version_5.txt>.

[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG Counters Structure", September 2012, <http://www.sflow.org/sflow_lag.txt>.

[sFlow LAG]Phaal,P.和A.Ghanwani,“sFlow LAG计数器结构”,2012年9月<http://www.sflow.org/sflow_lag.txt>.

[STT] Davie, B., Ed., and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)", Work in Progress, draft-davie-stt-06, April 2014.

[STT]Davie,B.,Ed.,和J.Gross,“用于网络虚拟化(STT)的无状态传输隧道协议”,正在进行的工作,draft-Davie-STT-062014年4月。

[RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 2014, <http://www.rfc-editor.org/info/rfc7348>.

[RFC7348]Mahalingam,M.,Dutt,D.,Duda,K.,Agarwal,P.,Kreeger,L.,Sridhar,T.,Bursell,M.,和C.Wright,“虚拟可扩展局域网(VXLAN):在第3层网络上覆盖虚拟化第2层网络的框架”,RFC 7348,2014年8月<http://www.rfc-editor.org/info/rfc7348>.

[YONG] Yong, L. and P. Yang, "Enhanced ECMP and Large Flow Aware Transport", Work in Progress, draft-yong-pwe3-enhance-ecmp-lfat-01, March 2010.

[YONG]YONG,L.和P.Yang,“增强ECMP和大流量感知运输”,正在进行的工作,草稿-YONG-pwe3-Enhanced-ECMP-lfat-01,2010年3月。

Appendix A. Internet Traffic Analysis and Load-Balancing Simulation
附录A.互联网流量分析和负载平衡模拟

Internet traffic [CAIDA] has been analyzed to obtain flow statistics such as the number of packets in a flow and the flow duration. The 5-tuple in the packet header (IP source address, IP destination address, transport protocol source port number, transport protocol destination port number, and IP protocol) is used for flow identification. The analysis indicates that < ~2% of the flows take ~30% of total traffic volume while the rest of the flows (> ~98%) contributes ~70% [YONG].

互联网流量[CAIDA]已被分析,以获得流统计信息,如流中的数据包数量和流持续时间。数据包头中的5元组(IP源地址、IP目标地址、传输协议源端口号、传输协议目标端口号和IP协议)用于流标识。分析表明<~2%的流量占总交通量的约30%,而其余流量(>98%)占总交通量的约70%[YONG]。

The simulation has shown that, given Internet traffic patterns, the hash-based technique does not evenly distribute flows over ECMP paths. Some paths may be > 90% loaded while others are < 40% loaded. The greater the number of ECMP paths, the more severe is the imbalance in the load distribution. This implies that hash-based distribution can cause some paths to become congested while other paths are underutilized [YONG].

仿真表明,给定Internet流量模式,基于哈希的技术不会在ECMP路径上均匀分布流量。某些路径的加载率可能大于90%,而其他路径的加载率可能小于40%。ECMP路径的数量越多,负载分布的不平衡就越严重。这意味着基于散列的分发可能会导致某些路径变得拥挤,而其他路径未得到充分利用[1]。

The simulation also shows substantial improvement by using the large flow-aware, hash-based distribution technique described in this document. In using the same simulated traffic, the improved rebalancing can achieve < 10% load differences among the paths. It proves how large flow-aware, hash-based distribution can effectively compensate the uneven load balancing caused by hashing and the traffic characteristics [YONG].

通过使用本文中描述的基于散列的大流感知分发技术,模拟还显示了实质性的改进。在使用相同的模拟流量时,改进的再平衡可以实现路径间<10%的负载差异。它证明了大流量感知、基于散列的分布如何有效地补偿散列和流量特性造成的负载平衡不均衡[YONG]。

Acknowledgements

致谢

The authors would like to thank the following individuals for their review and valuable feedback on earlier versions of this document: Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George Yum, and Weifeng Zhang. As a part of the IETF Last Call process, valuable comments were received from Martin Thomson and Carlos Pignataro.

作者感谢以下个人对本文件早期版本的评论和宝贵反馈:Shane Amante、Fred Baker、Michael Bugenhagen、Zhen Cao、Brian Carpenter、Benoit Claise、Michael Fargano、Wes George、Sriganesh Kini、Roman Krzanowski、Andrew Malis、Dave McDysan、Pete Moyer、Peter Phaal、,Dan Romascanu、Curtis Villamizar、黄建荣、任志刚和张玮峰。作为IETF上次呼叫过程的一部分,Martin Thomson和Carlos Pignataro提出了宝贵的意见。

Contributors

贡献者

Sanjay Khanna Cisco Systems EMail: sanjakha@gmail.com

Sanjay Khanna Cisco Systems电子邮件:sanjakha@gmail.com

Authors' Addresses

作者地址

Ram Krishnan Brocade Communications San Jose, CA 95134 United States Phone: +1-408-406-7890 EMail: ramkri123@gmail.com

Ram Krishnan Brocade Communications加利福尼亚州圣何塞95134美国电话:+1-408-406-7890电子邮件:ramkri123@gmail.com

Lucy Yong Huawei USA 5340 Legacy Drive Plano, TX 75025 United States Phone: +1-469-277-5837 EMail: lucy.yong@huawei.com

Lucy Yong华为美国5340 Legacy Drive Plano,德克萨斯州75025美国电话:+1-469-277-5837电子邮件:Lucy。yong@huawei.com

Anoop Ghanwani Dell 5450 Great America Pkwy Santa Clara, CA 95054 United States Phone: +1-408-571-3228 EMail: anoop@alumni.duke.edu

Anoop Ghanwani Dell 5450 Great America Pkwy Santa Clara,CA 95054美国电话:+1-408-571-3228电子邮件:anoop@alumni.duke.edu

Ning So Vinci Systems 2613 Fairbourne Cir Plano, TX 75093 United States EMail: ningso@yahoo.com

Ning So Vinci Systems 2613费尔伯恩Cir Plano,德克萨斯州75093美国电子邮件:ningso@yahoo.com

Bhumip Khasnabish ZTE Corporation New Jersey 07960 United States Phone: +1-781-752-8003 EMail: vumip1@gmail.com

普密普哈斯纳比什中兴通讯公司新泽西07960美国电话:+1-781-752-8003电子邮件:vumip1@gmail.com