Network Working Group D. Thaler Request for Comments: 2991 Microsoft Category: Informational C. Hopps NextHop Technologies November 2000
Network Working Group D. Thaler Request for Comments: 2991 Microsoft Category: Informational C. Hopps NextHop Technologies November 2000
Multipath Issues in Unicast and Multicast Next-Hop Selection
单播和多播下一跳选择中的多路径问题
Status of this Memo
本备忘录的状况
This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.
本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。
Copyright Notice
版权公告
Copyright (C) The Internet Society (2000). All Rights Reserved.
版权所有(C)互联网协会(2000年)。版权所有。
Abstract
摘要
Various routing protocols, including Open Shortest Path First (OSPF) and Intermediate System to Intermediate System (ISIS), explicitly allow "Equal-Cost Multipath" (ECMP) routing. Some router implementations also allow equal-cost multipath usage with RIP and other routing protocols. The effect of multipath routing on a forwarder is that the forwarder potentially has several next-hops for any given destination and must use some method to choose which next-hop should be used for a given data packet.
各种路由协议,包括开放最短路径优先(OSPF)和中间系统到中间系统(ISIS),明确允许“等成本多路径”(ECMP)路由。一些路由器实现还允许与RIP和其他路由协议同等成本的多路径使用。多路径路由对转发器的影响是,转发器对于任何给定的目的地都可能有多个下一跳,并且必须使用某种方法来选择哪个下一跳应用于给定的数据包。
Various routing protocols, including OSPF and ISIS, explicitly allow "Equal-Cost Multipath" routing. Some router implementations also allow equal-cost multipath usage with RIP and other routing protocols. Using equal-cost multipath means that if multiple equal-cost routes to the same destination exist, they can be discovered and used to provide load balancing among redundant paths.
各种路由协议,包括OSPF和ISIS,明确允许“等成本多路径”路由。一些路由器实现还允许与RIP和其他路由协议同等成本的多路径使用。使用等成本多路径意味着,如果存在到同一目的地的多个等成本路由,则可以发现这些路由并使用它们在冗余路径之间提供负载平衡。
The effect of multipath routing on a forwarder is that the forwarder potentially has several next-hops for any given destination and must use some method to choose which next-hop should be used for a given data packet. This memo summarizes current practices, problems, and solutions.
多路径路由对转发器的影响是,转发器对于任何给定的目的地都可能有多个下一跳,并且必须使用某种方法来选择哪个下一跳应用于给定的数据包。本备忘录总结了当前的做法、问题和解决方案。
Several router implementations allow multipath forwarding. This is sometimes done naively via round-robin, where each packet matching a given destination route is forwarded using the subsequent next-hop, in a round-robin fashion. This does provide a form of load balancing, but there are several problems with approaches such as round-robin or random:
一些路由器实现允许多路径转发。这有时是天真地通过循环来完成的,其中与给定目的地路由匹配的每个数据包以循环方式使用后续下一跳转发。这确实提供了一种负载平衡形式,但诸如循环或随机的方法存在一些问题:
Variable Path MTU Since each of the redundant paths may have a different MTU, this means that the overall path MTU can change on a packet-by-packet basis, negating the usefulness of path MTU discovery.
可变路径MTU由于每个冗余路径可以具有不同的MTU,这意味着总体路径MTU可以逐个分组地改变,从而否定路径MTU发现的有用性。
Variable Latencies Since each of the redundant paths may have a different latency involved, having packets take separate paths can cause packets to always arrive out of order, increasing delivery latency and buffering requirements.
可变延迟由于每个冗余路径可能涉及不同的延迟,因此让数据包采用不同的路径可能会导致数据包总是无序到达,从而增加传递延迟和缓冲要求。
Packet reordering causes TCP to believe that loss has taken place when packets with higher sequence numbers arrive before an earlier one. When three or more packets are received before a "late" packet, TCP enters a mode called "fast-retransmit" [6] which consumes extra bandwidth (which could potentially cause more loss, decreasing throughput) as it attempts to unnecessarily retransmit the delayed packet(s). Hence, reordering can be detrimental to network performance.
当序列号较高的数据包在较早的数据包之前到达时,数据包重新排序使TCP相信丢失已经发生。当在“延迟”数据包之前收到三个或更多数据包时,TCP进入称为“快速重传”[6]的模式,当它尝试不必要地重传延迟数据包时,会消耗额外的带宽(这可能会导致更多的丢失,降低吞吐量)。因此,重新排序可能对网络性能有害。
Debugging Common debugging utilities such as ping and traceroute are much less reliable in the presence of multiple paths and may even present completely wrong results.
在存在多条路径的情况下,调试ping和traceroute等常见调试实用程序的可靠性要低得多,甚至可能出现完全错误的结果。
In multicast routing, the problem with multiple paths is that multicast routing protocols prevent loops and duplicates by constructing a single tree to all receivers of the same group address. Multicast routing protocols deployed today (DVMRP, PIM-DM, PIM-SM) [2] construct shortest-path trees rooted at either the source, or another router known as a Core or Rendezvous Point. Hence, the way they ensure that duplicates will not arise is that a given tree must use only a single next-hop towards the root of the tree.
在多播路由中,具有多条路径的问题是,多播路由协议通过向相同组地址的所有接收器构造一棵树来防止循环和重复。现在部署的多播路由协议(DVMRP、PIM-DM、PIM-SM)[2]构建了根于源或另一个路由器(称为核心或集合点)的最短路径树。因此,它们确保不会出现重复的方法是,给定的树必须只使用朝向树根的一个下一跳。
In the remainder of this document, we will use the term "flow" to represent the granularity at which the router keeps state (if at all) for classes of traffic. The exact definition of a flow may depend on the actual implementation. For example, a flow might be identified solely by destination address, or it might be identified by (source address, destination address, protocol id) triplet. Hence "flow" is not necessarily synonymous with the term "microflow" as used in RFC 2474 [7], which also includes port numbers. Indeed, including transport-layer information in the next-hop selection process can actually be problematic. For example, if packets are fragmented, the transport-layer information may not be available in every packet. Furthermore, having the choice of path depend on transport-layer fields may negate the benefit of caching information such as MTU for use in subsequent connections between the same endpoints.
在本文档的其余部分中,我们将使用术语“流”来表示路由器为流量类保持状态(如果有的话)的粒度。流的确切定义可能取决于实际实现。例如,流可以仅由目标地址标识,也可以由(源地址、目标地址、协议id)三元组标识。因此,“流量”不一定与RFC 2474[7]中使用的术语“微流量”同义,后者也包括端口号。实际上,在下一跳选择过程中包含传输层信息实际上可能会有问题。例如,如果分组是分段的,则传输层信息可能在每个分组中都不可用。此外,根据传输层字段选择路径可能会抵消缓存信息(如MTU)的好处,以便在相同端点之间的后续连接中使用。
All of the problems outlined in the previous section arise when packets in the same unicast or multicast "flow" are split among multiple paths. The natural solution is therefore to ensure that packets for the same flow always use the same path.
当同一个单播或多播“流”中的数据包在多条路径之间分割时,会出现上一节中概述的所有问题。因此,自然的解决方案是确保相同流的数据包始终使用相同的路径。
Two additional features are desirable:
需要两个附加功能:
Minimal disruption When multipath is used, meaning that multiple routes contribute valid next-hops, the chances are higher of routes being added and deleted from consideration than when only the "best" route is used (in which case metric changes in alternate routes have no effect on traffic paths). Since a higher number of routes may actually be used for forwarding when multipath is in use, the potential for packet reordering and packet loss due to route flaps can be much greater than when not using multipath. Hence, it is desirable to minimize the number of active flows affected by the addition or deletion of another next-hop.
当使用多路径时,中断最小,这意味着多条路由提供有效的下一跳,与仅使用“最佳”路由时相比,添加和删除路由的可能性更高(在这种情况下,备用路由中的度量变化对交通路径没有影响)。由于在使用多路径时,实际上可能会使用更多的路由进行转发,因此,与不使用多路径时相比,由于路由翻转导致的数据包重新排序和数据包丢失的可能性要大得多。因此,期望最小化受另一下一跳的添加或删除影响的活动流的数量。
Fast implementation The amount of additional computation required to forward a packet should be small. For example, when doing round-robin, this computation might consist of incrementing (modulo the number of next-hops) a next-hop index.
快速实现转发数据包所需的额外计算量应该很小。例如,在执行循环时,此计算可能包括增加下一跳索引(以下一跳数为模)。
We now provide three possible methods for improving the performance of multipath and then discuss their applicability to unicast and multicast forwarding.
现在,我们提供三种可能的方法来提高多径性能,然后讨论它们对单播和多播转发的适用性。
Modulo-N Hash To select a next-hop from the list of N next-hops, the router performs a modulo-N hash over the packet header fields that identify a flow. This has the advantage of being fast, at the expense of (N-1)/N of all flows changing paths whenever a next-hop is added or removed.
模N散列从N个下一跳的列表中选择下一跳,路由器对标识流的数据包头字段执行模N散列。这具有速度快的优点,代价是(N-1)/N在添加或删除下一跳时改变路径的所有流。
Hash-Threshold The router first selects a key by performing a hash over the packet header fields that identify the flow. The N next-hops have been assigned unique regions in the hash function's output space. By comparing the hash value against region boundaries the router can determine which region the hash value belongs to and thus which next-hop to use. This method has the advantage of only affecting flows near the region boundaries (or thresholds) when next-hops are added or removed. For ECMP hash-threshold's lookup can be done with a simple division (hash_value / fixed_region_size). When a next-hop is added or removed, between 1/4 and 1/2 of all flows change paths. An analysis of this method can be found in [3].
哈希阈值路由器首先通过对标识流的数据包头字段执行哈希来选择密钥。N个下一跳在哈希函数的输出空间中被分配了唯一的区域。通过将散列值与区域边界进行比较,路由器可以确定散列值属于哪个区域,从而确定使用哪个下一跳。这种方法的优点是,当添加或删除下一跳时,只影响区域边界(或阈值)附近的流。对于ECMP,哈希阈值的查找可以通过简单的除法(哈希值/固定区域大小)完成。添加或删除下一个跃点时,所有流的1/4和1/2之间会更改路径。对该方法的分析见[3]。
Highest Random Weight (HRW) The router computes a key for EACH next-hop by performing a hash over the packet header fields that identify the flow, as well as over the address of the next-hop. The router then chooses the next-hop with the highest resulting key value [4]. This has the advantage of minimizing the number of flows affected by a next-hop addition or deletion (only 1/N of them), but is approximately N times as expensive as a modulo-N hash.
最高随机权重(HRW)路由器通过对标识流的数据包头字段以及下一跳的地址执行哈希运算,为每个下一跳计算密钥。路由器然后选择具有最高结果密钥值的下一跳[4]。这具有最小化受下一跳添加或删除影响的流的数量(仅1/N)的优点,但其代价大约是模N散列的N倍。
The applicability of these three alternatives depends on (at least) two factors: whether the forwarder maintains per-flow state, and how precious CPU is to a multipath forwarder.
这三种备选方案的适用性取决于(至少)两个因素:转发器是否保持每流状态,以及CPU对多路径转发器的价值。
Some routers may maintain per-flow state for reasons other than for supporting multipath. For example, routers typically keep per-flow state for multicast flows so that they can maintain the list of interfaces to which packets in the flow should be copied.
一些路由器可能出于支持多路径以外的原因而保持每流状态。例如,路由器通常为多播流保留每个流状态,以便它们可以维护流中的数据包应复制到的接口列表。
If per-flow state is maintained in a multipath forwarder, then computation of the next-hop can be done by the router at state creation time. This entails no additional computations at packet forwarding time compared with normal forwarding to a single next-hop, since the next-hop is precomputed. In this case, any method can be used, including round-robin, random, modulo-N, hash-threshold or HRW. Hash functions such as modulo-N, hash-threshold and HRW are better if the forwarder state may be deleted for any reason during the lifetime of a flow since subsequent next-hop computations by the router will
如果在多路径转发器中保持每流状态,则路由器可以在状态创建时计算下一跳。与正常转发到单个下一跳相比,这不需要在数据包转发时进行额外的计算,因为下一跳是预先计算的。在这种情况下,可以使用任何方法,包括循环、随机、模N、哈希阈值或HRW。如果在流的生存期内出于任何原因可以删除转发器状态,则诸如模N、Hash threshold和HRW之类的散列函数更好,因为路由器的后续下一跳计算将被删除
always select the same path. This also improves the usefulness of debugging utilities such as traceroute. Finally, to maximize the stability of paths (and hence the usefulness of traceroute, etc.), the use of HRW is recommended over the other methods mentioned herein.
始终选择相同的路径。这也提高了traceroute等调试实用程序的实用性。最后,为了最大限度地提高路径的稳定性(以及追踪路由的实用性等),建议使用HRW,而不是本文提到的其他方法。
If per-flow state is not maintained by the forwarder, then using multiple next-hops requires that the next-hop be calculated at packet arrival time. When CPU is more precious than stability of flow paths, hash-threshold is recommended over the other methods mentioned herein.
如果转发器未维护每流状态,则使用多个下一跳要求在数据包到达时计算下一跳。当CPU比流路径的稳定性更宝贵时,建议使用哈希阈值,而不是本文提到的其他方法。
Depending on the implementation, unicast forwarding may or may not keep per-flow state. We recommend that where forwarder implementations keep flow state, routers should use HRW at state creation time (and next-hop deletion time) to select the next-hop, and that forwarders without per-flow state use hash-threshold.
根据实现情况,单播转发可能会或可能不会保持每流状态。我们建议在转发器实现保持流状态的情况下,路由器应在状态创建时(和下一跳删除时)使用HRW来选择下一跳,并且没有每流状态的转发器应使用哈希阈值。
Today's multicast forwarding engines use a cache of forwarding entries indexed by group (or group prefix) and source (or source prefix). This means that today's multicast forwarder's always keep per-flow state, although for some multicast routing protocols, the "flow" may be fairly coarse (e.g., traffic from all sources to the same destination). Since per-flow state is kept by the forwarder, it is recommended that the router always use HRW to select the next-hop.
今天的多播转发引擎使用按组(或组前缀)和源(或源前缀)索引的转发条目缓存。这意味着今天的多播转发器始终保持每流状态,尽管对于某些多播路由协议,“流”可能相当粗糙(例如,从所有源到同一目的地的流量)。由于每流状态由转发器保持,因此建议路由器始终使用HRW选择下一跳。
Routers using explicit-joining protocols such as PIM-SM [5] should thus use the multipath information when determining to which neighbor a join message should be sent. For example, when multiple next-hops exist for a given Rendezvous Point (RP) toward which a (*,G) Join should be sent, it is recommended that HRW be used to select the next-hop to use for each group.
因此,使用显式加入协议(如PIM-SM[5])的路由器在确定加入消息应发送到哪个邻居时,应使用多路径信息。例如,当给定集合点(RP)存在多个下一跳时,应向其发送(*,G)连接,建议使用HRW为每个组选择下一跳。
The algorithms discussed above (except round-robin) all rely on some form of hash function. Equal flow distribution is achieved when the hash function is uniformly distributed. Since the commonly used hash functions only become uniformly distributed when the number of inputs is relatively large, these algorithms are more applicable to routers used to route many flows, than in, for example, a small business setting.
上面讨论的算法(循环除外)都依赖于某种形式的哈希函数。当哈希函数均匀分布时,可以实现相等的流分布。由于常用的散列函数只有在输入数量相对较大时才会均匀分布,因此这些算法比在小型企业环境中更适用于用于路由许多流的路由器。
A related problem occurs when multiple parallel links are used between the same pair of routers. A common solution is to bundle the two links together into a "super"-link when is then used for routing. For multicast forwarding, this results in the two links being reduced to a single next-hop (over the combined link) which can be used to prevent duplicates. When a unicast or multicast packet is queued to the combined link, some method, such as those discussed earlier, is still required to determine the physical link on which to transmit the packet. If the parallel links are identical, then most of the concerns discussed in this document are avoided with the combined link. The exception is packet reordering, which can still occur with round-robin, adversely affecting TCP.
当同一对路由器之间使用多个并行链路时,会出现相关问题。一个常见的解决方案是将两个链路捆绑在一起,形成一个“超级”链路,然后将其用于路由。对于多播转发,这将导致两个链路减少为单个下一跳(通过组合链路),可用于防止重复。当单播或多播分组排队到组合链路时,仍然需要一些方法(例如前面讨论的方法)来确定在其上传输分组的物理链路。如果平行链路相同,则使用组合链路可避免本文档中讨论的大多数问题。例外情况是数据包重新排序,这种情况在循环中仍然会发生,对TCP造成不利影响。
This document discusses issues with various methods of choosing a next-hop from among multiple valid next-hops. As such, it does not directly impact the security of the Internet infrastructure or its applications.
本文档讨论了从多个有效下一跳中选择下一跳的各种方法的问题。因此,它不会直接影响互联网基础设施或其应用程序的安全。
One issue that is worth mentioning, however, is that when next-hop selection is predictable, an attacker can synthesize traffic that will all hash the same, making it possible to launch a denial-of-service attack that overloads a particular path. Since a special case of this is when the same (single) next-hop is always selected, such an attack is easiest when multipath is not being used. Introducing multipath routing can make such an attack more difficult; the more unpredictable the hash is, the harder it becomes to conduct a denial-of-service attack against any single link.
然而,值得一提的一个问题是,当下一跳选择是可预测的时,攻击者可以合成所有散列相同的流量,从而有可能发起使特定路径过载的拒绝服务攻击。由于这种情况的一种特殊情况是始终选择相同(单个)下一跳,因此在不使用多路径时,这种攻击最容易发生。引入多路径路由会使这种攻击更加困难;散列越不可预测,就越难对任何单个链接进行拒绝服务攻击。
[1] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998.
[1] Moy,J.,“OSPF版本2”,STD 54,RFC 23281998年4月。
[2] Maufer, T., "Deploying IP Multicast in the Enterprise", Prentice-Hall, 1998.
[2] Maufer,T.,“在企业中部署IP多播”,Prentice Hall,1998年。
[3] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm", RFC 2992, November 2000.
[3] Hopps,C.,“等成本多路径算法的分析”,RFC 2992,2000年11月。
[4] Thaler, D., and C.V. Ravishankar, "Using Name-Based Mappings to Increase Hit Rates", IEEE/ACM Transactions on Networking, February 1998.
[4] Thaler,D.和C.V.Ravishankar,“使用基于名称的映射来提高命中率”,IEEE/ACM网络交易,1998年2月。
[5] Estrin, D., Farinacci, D., Helmy, A., Thaler, D., Deering, S., Handley, M., Jacobson, V., Liu, C., Sharma, P. and L. Wei, "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification", RFC 2362, June 1998.
[5] Estrin,D.,Farinaci,D.,Helmy,A.,Thaler,D.,Deering,S.,Handley,M.,Jacobson,V.,Liu,C.,Sharma,P.和L.Wei,“协议独立多播稀疏模式(PIM-SM):协议规范”,RFC 2362,1998年6月。
[6] Allman, M., Paxson, V. and W. Stevens, "TCP Congestion Control", RFC 2581, April 1999.
[6] Allman,M.,Paxson,V.和W.Stevens,“TCP拥塞控制”,RFC 25811999年4月。
[7] Nichols, K., Blake, S., Baker, F. and D. Black., "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, December 1998.
[7] Nichols,K.,Blake,S.,Baker,F.和D.Black.,“IPv4和IPv6报头中区分服务字段(DS字段)的定义”,RFC 2474,1998年12月。
Dave Thaler Microsoft One Microsoft Way Redmond, WA 98052
Dave Thaler Microsoft One Microsoft Way Redmond,WA 98052
Phone: +1 425 703 8835 EMail: dthaler@dthaler.microsoft.com
Phone: +1 425 703 8835 EMail: dthaler@dthaler.microsoft.com
Christian E. Hopps NextHop Technologies, Inc. 517 W. William Street Ann Arbor, MI 48103-4943 U.S.A
Christian E.Hopps NextHop Technologies,Inc.美国密歇根州安娜堡威廉街西517号,邮编:48103-4943
Phone: +1 734 936 0291 EMail: chopps@nexthop.com
Phone: +1 734 936 0291 EMail: chopps@nexthop.com
Copyright (C) The Internet Society (2000). All Rights Reserved.
版权所有(C)互联网协会(2000年)。版权所有。
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.
本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。
The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.
上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。
This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Acknowledgement
确认
Funding for the RFC Editor function is currently provided by the Internet Society.
RFC编辑功能的资金目前由互联网协会提供。