Internet Engineering Task Force (IETF)                        S. Bensley
Request for Comments: 8257                                     D. Thaler
Category: Informational                               P. Balasubramanian
ISSN: 2070-1721                                                Microsoft
                                                               L. Eggert
                                                                  NetApp
                                                                 G. Judd
                                                          Morgan Stanley
                                                            October 2017
        
Internet Engineering Task Force (IETF)                        S. Bensley
Request for Comments: 8257                                     D. Thaler
Category: Informational                               P. Balasubramanian
ISSN: 2070-1721                                                Microsoft
                                                               L. Eggert
                                                                  NetApp
                                                                 G. Judd
                                                          Morgan Stanley
                                                            October 2017
        

Data Center TCP (DCTCP): TCP Congestion Control for Data Centers

数据中心TCP(DCTCP):数据中心的TCP拥塞控制

Abstract

摘要

This Informational RFC describes Data Center TCP (DCTCP): a TCP congestion control scheme for data-center traffic. DCTCP extends the Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate. This method achieves high-burst tolerance, low latency, and high throughput with shallow-buffered switches. This memo also discusses deployment issues related to the coexistence of DCTCP and conventional TCP, discusses the lack of a negotiating mechanism between sender and receiver, and presents some possible mitigations. This memo documents DCTCP as currently implemented by several major operating systems. DCTCP, as described in this specification, is applicable to deployments in controlled environments like data centers, but it must not be deployed over the public Internet without additional measures.

此信息RFC描述了数据中心TCP(DCTCP):数据中心流量的TCP拥塞控制方案。DCTCP扩展了显式拥塞通知(ECN)处理,以估计遇到拥塞的字节数,而不是简单地检测发生了某些拥塞。DCTCP然后根据该估计值缩放TCP拥塞窗口。该方法通过浅缓冲交换机实现了高突发容忍度、低延迟和高吞吐量。本备忘录还讨论了与DCTCP和传统TCP共存相关的部署问题,讨论了发送方和接收方之间缺乏协商机制的问题,并提出了一些可能的缓解措施。本备忘录记录了目前由几个主要操作系统实施的DCTCP。如本规范所述,DCTCP适用于数据中心等受控环境中的部署,但在没有其他措施的情况下,不得在公共互联网上部署。

Status of This Memo

关于下段备忘

This document is not an Internet Standards Track specification; it is published for informational purposes.

本文件不是互联网标准跟踪规范;它是为了提供信息而发布的。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 7841.

本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。并非IESG批准的所有文件都适用于任何级别的互联网标准;见RFC 7841第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc8257.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问https://www.rfc-editor.org/info/rfc8257.

Copyright Notice

版权公告

Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2017 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(https://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。

Table of Contents

目录

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   3.  DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  Marking Congestion on the L3 Switches and Routers . . . .   5
     3.2.  Echoing Congestion Information on the Receiver  . . . . .   5
     3.3.  Processing Echoed Congestion Indications on the Sender  .   7
     3.4.  Handling of Congestion Window Growth  . . . . . . . . . .   8
     3.5.  Handling of Packet Loss . . . . . . . . . . . . . . . . .   8
     3.6.  Handling of SYN, SYN-ACK, and RST Packets . . . . . . . .   9
   4.  Implementation Issues . . . . . . . . . . . . . . . . . . . .   9
     4.1.  Configuration of DCTCP  . . . . . . . . . . . . . . . . .   9
     4.2.  Computation of DCTCP.Alpha  . . . . . . . . . . . . . . .  10
   5.  Deployment Issues . . . . . . . . . . . . . . . . . . . . . .  11
   6.  Known Issues  . . . . . . . . . . . . . . . . . . . . . . . .  12
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  13
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  13
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  14
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  16
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  16
        
   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   3.  DCTCP Algorithm . . . . . . . . . . . . . . . . . . . . . . .   5
     3.1.  Marking Congestion on the L3 Switches and Routers . . . .   5
     3.2.  Echoing Congestion Information on the Receiver  . . . . .   5
     3.3.  Processing Echoed Congestion Indications on the Sender  .   7
     3.4.  Handling of Congestion Window Growth  . . . . . . . . . .   8
     3.5.  Handling of Packet Loss . . . . . . . . . . . . . . . . .   8
     3.6.  Handling of SYN, SYN-ACK, and RST Packets . . . . . . . .   9
   4.  Implementation Issues . . . . . . . . . . . . . . . . . . . .   9
     4.1.  Configuration of DCTCP  . . . . . . . . . . . . . . . . .   9
     4.2.  Computation of DCTCP.Alpha  . . . . . . . . . . . . . . .  10
   5.  Deployment Issues . . . . . . . . . . . . . . . . . . . . . .  11
   6.  Known Issues  . . . . . . . . . . . . . . . . . . . . . . . .  12
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  12
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  13
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  13
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  14
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  16
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  16
        
1. Introduction
1. 介绍

Large data centers necessarily need many network switches to interconnect their many servers. Therefore, a data center can greatly reduce its capital expenditure by leveraging low-cost switches. However, such low-cost switches tend to have limited queue capacities; thus, they are more susceptible to packet loss due to congestion.

大型数据中心必然需要许多网络交换机来互连其许多服务器。因此,数据中心可以通过利用低成本交换机大大减少其资本支出。然而,这种低成本交换机往往具有有限的队列容量;因此,它们更容易因拥塞而丢失数据包。

Network traffic in a data center is often a mix of short and long flows, where the short flows require low latencies and the long flows require high throughputs. Data centers also experience incast bursts, where many servers send traffic to a single server at the same time. For example, this traffic pattern is a natural consequence of the MapReduce [MAPREDUCE] workload: the worker nodes complete at approximately the same time, and all reply to the master node concurrently.

数据中心中的网络流量通常是短流和长流的混合,短流要求低延迟,长流要求高吞吐量。数据中心也会经历incast突发,即许多服务器同时向单个服务器发送流量。例如,这种流量模式是MapReduce[MapReduce]工作负载的自然结果:工作节点几乎同时完成,所有节点同时回复主节点。

These factors place some conflicting demands on the queue occupancy of a switch:

这些因素对交换机的队列占用率提出了一些相互冲突的要求:

o The queue must be short enough that it does not impose excessive latency on short flows.

o 队列必须足够短,以使其不会对短流施加过多的延迟。

o The queue must be long enough to buffer sufficient data for the long flows to saturate the path capacity.

o 队列必须足够长,以缓冲足够长的数据流,使路径容量饱和。

o The queue must be long enough to absorb incast bursts without excessive packet loss.

o 队列必须足够长,以吸收incast突发,而不会过度丢失数据包。

Standard TCP congestion control [RFC5681] relies on packet loss to detect congestion. This does not meet the demands described above. First, short flows will start to experience unacceptable latencies before packet loss occurs. Second, by the time TCP congestion control kicks in on the senders, most of the incast burst has already been dropped.

标准TCP拥塞控制[RFC5681]依赖于数据包丢失来检测拥塞。这不符合上述要求。首先,在数据包丢失发生之前,短流将开始经历不可接受的延迟。其次,当TCP拥塞控制开始对发送方实施时,大部分incast突发已经被丢弃。

[RFC3168] describes a mechanism for using Explicit Congestion Notification (ECN) from the switches for detection of congestion. However, this method only detects the presence of congestion, not its extent. In the presence of mild congestion, the TCP congestion window is reduced too aggressively, and this unnecessarily reduces the throughput of long flows.

[RFC3168]描述了一种使用交换机的显式拥塞通知(ECN)检测拥塞的机制。然而,该方法仅检测拥塞的存在,而不检测其程度。在存在轻度拥塞的情况下,TCP拥塞窗口被过度缩短,这不必要地降低了长流的吞吐量。

Data Center TCP (DCTCP) changes traditional ECN processing by estimating the fraction of bytes that encounter congestion rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate. This method

数据中心TCP(DCTCP)改变了传统的ECN处理,它通过估计遇到拥塞的字节数,而不是简单地检测发生了拥塞。DCTCP然后根据该估计值缩放TCP拥塞窗口。这种方法

achieves high-burst tolerance, low latency, and high throughput with shallow-buffered switches. DCTCP is a modification to the processing of ECN by a conventional TCP and requires that standard TCP congestion control be used for handling packet loss.

通过浅缓冲交换机实现高突发容忍度、低延迟和高吞吐量。DCTCP是对传统TCP处理ECN的一种改进,要求使用标准TCP拥塞控制来处理数据包丢失。

DCTCP should only be deployed in an intra-data-center environment where both endpoints and the switching fabric are under a single administrative domain. DCTCP MUST NOT be deployed over the public Internet without additional measures, as detailed in Section 5.

DCTCP应仅部署在端点和交换结构都位于单个管理域下的数据中心内部环境中。如第5节所述,未采取额外措施,不得在公共互联网上部署DCTCP。

The objective of this Informational RFC is to document DCTCP as a new approach (which is known to be widely implemented and deployed) to address TCP congestion control in data centers. The IETF TCPM Working Group reached consensus regarding the fact that a DCTCP standard would require further work. A precise documentation of running code enables follow-up Experimental or Standards Track RFCs through the IETF stream.

本信息RFC的目的是将DCTCP作为一种新的方法(已知已广泛实施和部署)记录下来,以解决数据中心中的TCP拥塞控制问题。IETF TCPM工作组就DCTCP标准需要进一步工作这一事实达成共识。运行代码的精确文档可以使后续实验或标准通过IETF流跟踪RFC。

This document describes DCTCP as implemented in Microsoft Windows Server 2012 [WINDOWS]. The Linux [LINUX] and FreeBSD [FREEBSD] operating systems have also implemented support for DCTCP in a way that is believed to follow this document. Deployment experiences with DCTCP have been documented in [MORGANSTANLEY].

本文档介绍了在Microsoft Windows Server 2012[Windows]中实现的DCTCP。Linux[Linux]和FreeBSD[FreeBSD]操作系统也实现了对DCTCP的支持,据信这种支持遵循了本文档。[MORGANSTANLEY]中记录了DCTCP的部署经验。

2. Terminology
2. 术语

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“建议”、“不建议”、“可”和“可选”在所有大写字母出现时(如图所示)应按照BCP 14[RFC2119][RFC8174]所述进行解释。

Normative language is used to describe how necessary the various aspects of a DCTCP implementation are for interoperability, but even compliant implementations without the measures in Sections 4-6 would still only be safe to deploy in controlled environments, i.e., not over the public Internet.

规范性语言用于描述DCTCP实施的各个方面对于互操作性的必要性,但即使没有第4-6节中的措施的兼容实施也只能安全地部署在受控环境中,即不能通过公共互联网。

3. DCTCP Algorithm
3. DCTCP算法

There are three components involved in the DCTCP algorithm:

DCTCP算法涉及三个组件:

o The switches (or other intermediate devices in the network) detect congestion and set the Congestion Encountered (CE) codepoint in the IP header.

o 交换机(或网络中的其他中间设备)检测拥塞并在IP报头中设置遇到的拥塞(CE)码点。

o The receiver echoes the congestion information back to the sender, using the ECN-Echo (ECE) flag in the TCP header.

o 接收方使用TCP报头中的ECN Echo(ECE)标志将拥塞信息回显给发送方。

o The sender computes a congestion estimate and reacts by reducing the TCP congestion window (cwnd) accordingly.

o 发送方计算拥塞估计值,并相应地减少TCP拥塞窗口(cwnd)作出反应。

3.1. Marking Congestion on the L3 Switches and Routers
3.1. 标记L3交换机和路由器上的拥塞

The Layer 3 (L3) switches and routers in a data-center fabric indicate congestion to the end nodes by setting the CE codepoint in the IP header as specified in Section 5 of [RFC3168]. For example, the switches may be configured with a congestion threshold. When a packet arrives at a switch and its queue length is greater than the congestion threshold, the switch sets the CE codepoint in the packet. For example, Section 3.4 of [DCTCP10] suggests threshold marking with a threshold of K > (RTT * C)/7, where C is the link rate in packets per second. In typical deployments, the marking threshold is set to be a small value to maintain a short average queueing delay. However, the actual algorithm for marking congestion is an implementation detail of the switch and will generally not be known to the sender and receiver. Therefore, the sender and receiver should not assume that a particular marking algorithm is implemented by the switching fabric.

数据中心结构中的第3层(L3)交换机和路由器通过按照[RFC3168]第5节中的规定在IP报头中设置CE代码点来指示终端节点的拥塞。例如,交换机可以配置拥塞阈值。当数据包到达交换机且其队列长度大于拥塞阈值时,交换机在数据包中设置CE码点。例如,[DCTCP10]的第3.4节建议使用阈值K>(RTT*C)/7进行阈值标记,其中C是以每秒数据包为单位的链路速率。在典型部署中,标记阈值设置为较小的值,以保持较短的平均排队延迟。然而,标记拥塞的实际算法是交换机的实现细节,发送方和接收方通常不知道。因此,发送方和接收方不应假设特定的标记算法是由交换结构实现的。

3.2. Echoing Congestion Information on the Receiver
3.2. 在接收器上回显拥塞信息

According to Section 6.1.3 of [RFC3168], the receiver sets the ECE flag if any of the packets being acknowledged had the CE codepoint set. The receiver then continues to set the ECE flag until it receives a packet with the Congestion Window Reduced (CWR) flag set. However, the DCTCP algorithm requires more-detailed congestion information. In particular, the sender must be able to determine the number of bytes sent that encountered congestion. Thus, the scheme described in [RFC3168] does not suffice.

根据[RFC3168]第6.1.3节,如果正在确认的任何数据包设置了CE码点,则接收器设置ECE标志。然后,接收器继续设置ECE标志,直到它接收到设置了拥塞窗口减少(CWR)标志的数据包。然而,DCTCP算法需要更详细的拥塞信息。特别是,发送方必须能够确定遇到拥塞的发送字节数。因此,[RFC3168]中描述的方案不够。

One possible solution is to ACK every packet and set the ECE flag in the ACK if and only if the CE codepoint was set in the packet being acknowledged. However, this prevents the use of delayed ACKs, which are an important performance optimization in data centers. If the delayed ACK frequency is n, then an ACK is generated every n packets.

一种可能的解决方案是,当且仅当CE码点在被确认的数据包中设置时,确认每个数据包并在确认中设置ECE标志。然而,这防止了延迟ack的使用,延迟ack是数据中心中重要的性能优化。如果延迟的ACK频率为n,则每n个分组生成一个ACK。

The typical value of n is 2, but it could be affected by ACK throttling or packet-coalescing techniques designed to improve performance.

n的典型值为2,但它可能会受到ACK限制或旨在提高性能的数据包合并技术的影响。

Instead, DCTCP introduces a new Boolean TCP state variable, DCTCP Congestion Encountered (DCTCP.CE), which is initialized to false and stored in the Transmission Control Block (TCB). When sending an ACK, the ECE flag MUST be set if and only if DCTCP.CE is true. When receiving packets, the CE codepoint MUST be processed as follows:

相反,DCTCP引入了一个新的布尔TCP状态变量,DCTCP拥塞遇到(DCTCP.CE),该变量被初始化为false并存储在传输控制块(TCB)中。发送ACK时,必须在且仅当DCTCP.CE为真时设置ECE标志。接收数据包时,CE码点必须按如下方式处理:

1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to true and send an immediate ACK.

1. 如果设置了CE码点且DCTCP.CE为false,则将DCTCP.CE设置为true并立即发送ACK。

2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE to false and send an immediate ACK.

2. 如果未设置CE码点且DCTCP.CE为true,则将DCTCP.CE设置为false并立即发送ACK。

3. Otherwise, ignore the CE codepoint.

3. 否则,忽略CE代码点。

Since the immediate ACK reflects the new DCTCP.CE state, it may acknowledge any previously unacknowledged packets in the old state. This can lead to an incorrect rate computation at the sender per Section 3.3. To avoid this, an implementation MAY choose to send two ACKs: one for previously unacknowledged packets and another acknowledging the most recently received packet.

由于即时ACK反映了新的DCTCP.CE状态,因此它可以在旧状态下确认任何先前未确认的数据包。根据第3.3节,这可能导致发送方的费率计算不正确。为了避免这种情况,实现可以选择发送两个确认:一个用于先前未确认的数据包,另一个用于确认最近接收的数据包。

Receiver handling of the CWR bit is also per [RFC3168] (including [Err3639]). That is, on receipt of a segment with both the CE and CWR bits set, CWR is processed first and then CE is processed.

CWR位的接收器处理也符合[RFC3168](包括[Err3639])。也就是说,在接收到同时设置了CE和CWR位的段时,首先处理CWR,然后处理CE。

                             Send immediate
                             ACK with ECE=0
                 .-----.     .--------------.     .-----.
    Send 1 ACK  /      v     v              |     |      \
     for every |     .------------.    .------------.     | Send 1 ACK
     n packets |     | DCTCP.CE=0 |    | DCTCP.CE=1 |     | for every
    with ECE=0 |     '------------'    '------------'     | n packets
                \      |     |              ^     ^      /  with ECE=1
                 '-----'     '--------------'     '-----'
                              Send immediate
                              ACK with ECE=1
        
                             Send immediate
                             ACK with ECE=0
                 .-----.     .--------------.     .-----.
    Send 1 ACK  /      v     v              |     |      \
     for every |     .------------.    .------------.     | Send 1 ACK
     n packets |     | DCTCP.CE=0 |    | DCTCP.CE=1 |     | for every
    with ECE=0 |     '------------'    '------------'     | n packets
                \      |     |              ^     ^      /  with ECE=1
                 '-----'     '--------------'     '-----'
                              Send immediate
                              ACK with ECE=1
        

Figure 1: ACK Generation State Machine

图1:ACK生成状态机

3.3. Processing Echoed Congestion Indications on the Sender
3.3. 处理发送方上的回显拥塞指示

The sender estimates the fraction of bytes sent that encountered congestion. The current estimate is stored in a new TCP state variable, DCTCP.Alpha, which is initialized to 1 and SHOULD be updated as follows:

发送方估计遇到拥塞的发送字节数。当前估计值存储在新的TCP状态变量DCTCP.Alpha中,该变量初始化为1,并应按如下方式更新:

      DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
        
      DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
        

where:

哪里:

o g is the estimation gain, a real number between 0 and 1. The selection of g is left to the implementation. See Section 4 for further considerations.

o g是估计增益,是介于0和1之间的实数。g的选择留给实现。有关进一步的考虑,请参见第4节。

o M is the fraction of bytes sent that encountered congestion during the previous observation window, where the observation window is chosen to be approximately the Round-Trip Time (RTT). In particular, an observation window ends when all bytes in flight at the beginning of the window have been acknowledged.

o M是在上一个观察窗口期间遇到拥塞的发送字节的分数,其中观察窗口被选择为大约往返时间(RTT)。特别是,当窗口开始处的所有字节都已确认时,观察窗口结束。

In order to update DCTCP.Alpha, the TCP state variables defined in [RFC0793] are used, and three additional TCP state variables are introduced:

为了更新DCTCP.Alpha,使用了[RFC0793]中定义的TCP状态变量,并引入了三个额外的TCP状态变量:

o DCTCP.WindowEnd: the TCP sequence number threshold when one observation window ends and another is to begin; initialized to SND.UNA.

o DCTCP.WindowEnd:一个观察窗口结束,另一个观察窗口开始时的TCP序列号阈值;初始化为SND.UNA。

o DCTCP.BytesAcked: the number of sent bytes acknowledged during the current observation window; initialized to 0.

o DCTCP.BytesAcked:当前观察窗口中确认的发送字节数;已初始化为0。

o DCTCP.BytesMarked: the number of bytes sent during the current observation window that encountered congestion; initialized to 0.

o DCTCP.BytesMarked:当前观察窗口中遇到拥塞时发送的字节数;已初始化为0。

The congestion estimator on the sender MUST process acceptable ACKs as follows:

发送方上的拥塞估计器必须处理可接受的ACK,如下所示:

1. Compute the bytes acknowledged (TCP Selective Acknowledgment (SACK) options [RFC2018] are ignored for this computation):

1. 计算已确认的字节数(此计算忽略TCP选择性确认(SACK)选项[RFC2018]:

BytesAcked = SEG.ACK - SND.UNA

BytesAcked=SEG.ACK-SND.UNA

2. Update the bytes sent:

2. 更新发送的字节数:

DCTCP.BytesAcked += BytesAcked

DCTCP.BytesAcked+=BytesAcked

3. If the ECE flag is set, update the bytes marked:

3. 如果设置了ECE标志,则更新标记的字节:

DCTCP.BytesMarked += BytesAcked

DCTCP.BytesMarked+=BytesAcked

4. If the acknowledgment number is less than or equal to DCTCP.WindowEnd, stop processing. Otherwise, the end of the observation window has been reached, so proceed to update the congestion estimate as follows:

4. 如果确认号小于或等于DCTCP.WindowEnd,则停止处理。否则,已到达观察窗口的末尾,因此继续按如下方式更新拥塞估计:

5. Compute the congestion level for the current observation window:

5. 计算当前观察窗口的拥塞级别:

M = DCTCP.BytesMarked / DCTCP.BytesAcked

M=DCTCP.BytesAcked/DCTCP.BytesAcked

6. Update the congestion estimate:

6. 更新拥塞估计:

          DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
        
          DCTCP.Alpha = DCTCP.Alpha * (1 - g) + g * M
        

7. Determine the end of the next observation window:

7. 确定下一个观察窗口的终点:

DCTCP.WindowEnd = SND.NXT

DCTCP.WindowEnd=SND.NXT

8. Reset the byte counters:

8. 重置字节计数器:

DCTCP.BytesAcked = DCTCP.BytesMarked = 0

DCTCP.BytesAcked=DCTCP.BytesAcked=0

9. Rather than always halving the congestion window as described in [RFC3168], the sender SHOULD update cwnd as follows:

9. 发送方不应总是像[RFC3168]中所述将拥塞窗口减半,而应按如下方式更新cwnd:

          cwnd = cwnd * (1 - DCTCP.Alpha / 2)
        
          cwnd = cwnd * (1 - DCTCP.Alpha / 2)
        

Just as specified in [RFC3168], DCTCP does not react to congestion indications more than once for every window of data. The setting of the CWR bit is also as per [RFC3168]. This is required for interoperation with classic ECN receivers due to potential misconfigurations.

正如[RFC3168]中所规定的,DCTCP不会对每个数据窗口的拥塞指示做出多次反应。CWR位的设置也符合[RFC3168]。由于潜在的错误配置,这是与经典ECN接收器互操作所必需的。

3.4. Handling of Congestion Window Growth
3.4. 拥塞窗口增长的处理

A DCTCP sender grows its congestion window in the same way as conventional TCP. Slow start and congestion avoidance algorithms are handled as specified in [RFC5681].

DCTCP发送方以与传统TCP相同的方式增加其拥塞窗口。慢启动和拥塞避免算法按照[RFC5681]中的规定进行处理。

3.5. Handling of Packet Loss
3.5. 分组丢失的处理

A DCTCP sender MUST react to loss episodes in the same way as conventional TCP, including fast retransmit and fast recovery algorithms, as specified in [RFC5681]. For cases where the packet loss is inferred and not explicitly signaled by ECN, the cwnd and

DCTCP发送方必须以与传统TCP相同的方式对丢失事件作出反应,包括[RFC5681]中规定的快速重传和快速恢复算法。对于包丢失是推断出来的,并且没有由ECN显式地发出信号的情况,cwnd和

other state variables like ssthresh MUST be changed in the same way that a conventional TCP would have changed them. As with ECN, a DCTCP sender will only reduce the cwnd once per window of data across all loss signals. Just as specified in [RFC5681], upon a timeout, the cwnd MUST be set to no more than the loss window (1 full-sized segment), regardless of previous cwnd reductions in a given window of data.

其他状态变量(如ssthresh)的更改方式必须与传统TCP更改它们的方式相同。与ECN一样,DCTCP发送器只会在所有丢失信号的每个数据窗口中减少cwnd一次。正如[RFC5681]中所述,在超时时,无论给定数据窗口中先前的cwnd减少情况如何,cwnd必须设置为不超过丢失窗口(1个完整大小的段)。

3.6. Handling of SYN, SYN-ACK, and RST Packets
3.6. 处理SYN、SYN-ACK和RST数据包

If SYN, SYN-ACK, and RST packets for DCTCP connections have the ECN-Capable Transport (ECT) codepoint set in the IP header, they will receive the same treatment as other DCTCP packets when forwarded by a switching fabric under load. Lack of ECT in these packets can result in a higher drop rate, depending on the switching fabric configuration. Hence, for DCTCP connections, the sender SHOULD set ECT for SYN, SYN-ACK, and RST packets. A DCTCP receiver ignores CE codepoints set on any SYN, SYN-ACK, or RST packets.

如果用于DCTCP连接的SYN、SYN-ACK和RST数据包在IP报头中设置了支持ECN的传输(ECT)码点,则当交换结构在负载下转发时,它们将收到与其他DCTCP数据包相同的处理。根据交换结构的配置,这些数据包中缺少ECT可能会导致较高的丢弃率。因此,对于DCTCP连接,发送方应该为SYN、SYN-ACK和RST数据包设置ECT。DCTCP接收器忽略任何SYN、SYN-ACK或RST数据包上设置的CE码点。

4. Implementation Issues
4. 执行问题
4.1. Configuration of DCTCP
4.1. DCTCP的配置

An implementation needs to know when to use DCTCP. Data-center servers may need to communicate with endpoints outside the data center, where DCTCP is unsuitable or unsupported. Thus, a global configuration setting to enable DCTCP will generally not suffice. DCTCP provides no mechanism for negotiating its use. Thus, additional management and configuration functionality is needed to ensure that DCTCP is not used with non-DCTCP endpoints.

实现需要知道何时使用DCTCP。数据中心服务器可能需要与数据中心以外的端点通信,因为DCTCP不适用或不受支持。因此,启用DCTCP的全局配置设置通常是不够的。DCTCP没有提供协商其使用的机制。因此,需要额外的管理和配置功能来确保DCTCP不与非DCTCP端点一起使用。

Known solutions rely on either configuration or heuristics. Heuristics need to allow endpoints to individually enable DCTCP to ensure a DCTCP sender is always paired with a DCTCP receiver. One approach is to enable DCTCP based on the IP address of the remote endpoint. Another approach is to detect connections that transmit within the bounds of a data center. For example, an implementation could support automatic selection of DCTCP if the estimated RTT is less than a threshold (like 10 msec) and ECN is successfully negotiated under the assumption that if the RTT is low, then the two endpoints are likely in the same data-center network.

已知的解决方案依赖于配置或启发式。启发式需要允许端点单独启用DCTCP,以确保DCTCP发送方始终与DCTCP接收方配对。一种方法是基于远程端点的IP地址启用DCTCP。另一种方法是检测在数据中心范围内传输的连接。例如,如果估计的RTT小于阈值(如10毫秒),并且在假设如果RTT低,则两个端点可能在同一数据中心网络中的情况下成功协商ECN,则实现可以支持DCTCP的自动选择。

[RFC3168] forbids the ECN-marking of pure ACK packets because of the inability of TCP to mitigate ACK-path congestion. RFC 3168 also forbids ECN-marking of retransmissions, window probes, and RSTs. However, dropping all these control packets -- rather than ECN-marking them -- has considerable performance disadvantages. It is RECOMMENDED that an implementation provide a configuration knob that

[RFC3168]禁止纯ACK数据包的ECN标记,因为TCP无法缓解ACK路径拥塞。RFC 3168还禁止重新传输、窗口探测和RST的ECN标记。然而,丢弃所有这些控制数据包——而不是ECN标记它们——有相当大的性能缺点。建议一个实现提供一个配置旋钮

will cause ECT to be set on such control packets, which can be used in environments where such concerns do not apply. See [ECN-EXPERIMENTATION] for details.

将导致在此类控制数据包上设置ECT,这些控制数据包可在此类问题不适用的环境中使用。有关详细信息,请参见[ECN-REFEATION]。

It is useful to implement DCTCP as an additional action on top of an existing congestion control algorithm like Reno [RFC5681]. The DCTCP implementation MAY also allow configuration of resetting the value of DCTCP.Alpha as part of processing any loss episodes.

在现有的拥塞控制算法(如Reno[RFC5681])的基础上,将DCTCP作为附加操作来实现是非常有用的。DCTCP实现还允许配置重置DCTCP.Alpha的值,作为处理任何丢失事件的一部分。

4.2. Computation of DCTCP.Alpha
4.2. DCTCP.Alpha的计算

As noted in Section 3.3, the implementation will need to choose a suitable estimation gain. [DCTCP10] provides a theoretical basis for selecting the gain. However, it may be more practical to use experimentation to select a suitable gain for a particular network and workload. A fixed estimation gain of 1/16 is used in some implementations. (It should be noted that values of 0 or 1 for g result in problematic behavior; g=0 fixes DCTCP.Alpha to its initial value, and g=1 sets it to M without any smoothing.)

如第3.3节所述,实施需要选择合适的估计增益。[DCTCP10]为选择增益提供了理论依据。然而,使用实验为特定网络和工作负载选择合适的增益可能更为实际。在一些实现中使用1/16的固定估计增益。(应该注意,g的值为0或1会导致有问题的行为;g=0会将DCTCP.Alpha固定到其初始值,g=1会将其设置为M而不进行任何平滑。)

The DCTCP.Alpha computation as per the formula in Section 3.3 involves fractions. An efficient kernel implementation MAY scale the DCTCP.Alpha value for efficient computation using shift operations. For example, if the implementation chooses g as 1/16, multiplications of DCTCP.Alpha by g become right-shifts by 4. A scaling implementation SHOULD ensure that DCTCP.Alpha is able to reach 0 once it falls below the smallest shifted value (16 in the above example). At the other extreme, a scaled update needs to ensure DCTCP.Alpha does not exceed the scaling factor, which would be equivalent to greater than 100% congestion. So, DCTCP.Alpha MUST be clamped after an update.

根据第3.3节中的公式进行的DCTCP.Alpha计算涉及分数。有效的内核实现可以使用移位操作缩放DCTCP.Alpha值以实现高效计算。例如,如果实现选择g为1/16,则DCTCP.Alpha与g的乘积将右移4。缩放实现应确保DCTCP.Alpha在低于最小移位值(上例中为16)时能够达到0。在另一个极端,缩放更新需要确保DCTCP.Alpha不超过缩放因子,这相当于大于100%的拥塞。因此,必须在更新后钳制DCTCP.Alpha。

This results in the following computations replacing steps 5 and 6 in Section 3.3, where SCF is the chosen scaling factor (65536 in the example), and SHF is the shift factor (4 in the example):

这将导致以下计算取代第3.3节中的步骤5和6,其中SCF是选择的比例因子(示例中为65536),SHF是移位因子(示例中为4):

1. Compute the congestion level for the current observation window:

1. 计算当前观察窗口的拥塞级别:

          ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked
        
          ScaledM = SCF * DCTCP.BytesMarked / DCTCP.BytesAcked
        

2. Update the congestion estimate:

2. 更新拥塞估计:

          if (DCTCP.Alpha >> SHF) == 0, then DCTCP.Alpha = 0
        
          if (DCTCP.Alpha >> SHF) == 0, then DCTCP.Alpha = 0
        
          DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF)
        
          DCTCP.Alpha += (ScaledM >> SHF) - (DCTCP.Alpha >> SHF)
        

if DCTCP.Alpha > SCF, then DCTCP.Alpha = SCF

如果DCTCP.Alpha>SCF,则DCTCP.Alpha=SCF

5. Deployment Issues
5. 部署问题

DCTCP and conventional TCP congestion control do not coexist well in the same network. In typical DCTCP deployments, the marking threshold in the switching fabric is set to a very low value to reduce queueing delay, and a relatively small amount of congestion will exceed the marking threshold. During such periods of congestion, conventional TCP will suffer packet loss and quickly and drastically reduce cwnd. DCTCP, on the other hand, will use the fraction of marked packets to reduce cwnd more gradually. Thus, the rate reduction in DCTCP will be much slower than that of conventional TCP, and DCTCP traffic will gain a larger share of the capacity compared to conventional TCP traffic traversing the same path. If the traffic in the data center is a mix of conventional TCP and DCTCP, it is RECOMMENDED that DCTCP traffic be segregated from conventional TCP traffic. [MORGANSTANLEY] describes a deployment that uses the IP Differentiated Services Codepoint (DSCP) bits to segregate the network such that Active Queue Management (AQM) [RFC7567] is applied to DCTCP traffic, whereas TCP traffic is managed via drop-tail queueing.

DCTCP和传统的TCP拥塞控制在同一网络中不能很好地共存。在典型的DCTCP部署中,交换结构中的标记阈值被设置为非常低的值以减少排队延迟,并且相对少量的拥塞将超过标记阈值。在这种拥塞期间,传统的TCP将遭受数据包丢失,并迅速而显著地减少cwnd。另一方面,DCTCP将使用标记数据包的分数逐渐减少cwnd。因此,DCTCP中的速率降低将比传统TCP慢得多,并且与通过相同路径的传统TCP通信相比,DCTCP通信将获得更大的容量份额。如果数据中心中的流量是传统TCP和DCTCP的混合,建议将DCTCP流量与传统TCP流量分开。[MORGANSTANLEY]描述了一种部署,该部署使用IP区分服务代码点(DSCP)位隔离网络,从而将主动队列管理(AQM)[RFC7567]应用于DCTCP流量,而TCP流量则通过丢弃尾排队进行管理。

Deployments should take into account segregation of non-TCP traffic as well. Today's commodity switches allow configuration of different marking/drop profiles for non-TCP and non-IP packets. Non-TCP and non-IP packets should be able to pass through such switches, unless they really run out of buffer space.

部署还应考虑非TCP流量的隔离。今天的商品交换机允许为非TCP和非IP数据包配置不同的标记/丢弃配置文件。非TCP和非IP数据包应该能够通过此类交换机,除非它们确实耗尽了缓冲空间。

Since DCTCP relies on congestion marking by the switches, DCTCP's potential can only be realized in data centers where the entire network infrastructure supports ECN. The switches may also support configuration of the congestion threshold used for marking. The proposed parameterization can be configured with switches that implement Random Early Detection (RED) [RFC2309]. [DCTCP10] provides a theoretical basis for selecting the congestion threshold, but, as with the estimation gain, it may be more practical to rely on experimentation or simply to use the default configuration of the device. DCTCP will revert to loss-based congestion control when packet loss is experienced (e.g., when transiting a congested drop-tail link, or a link with an AQM drop behavior).

由于DCTCP依赖于交换机的拥塞标记,因此DCTCP的潜力只能在整个网络基础设施支持ECN的数据中心实现。交换机还可以支持用于标记的拥塞阈值的配置。建议的参数化可以配置实现随机早期检测(RED)的开关[RFC2309]。[DCTCP10]为选择拥塞阈值提供了理论依据,但是,与估计增益一样,依赖实验或仅使用设备的默认配置可能更为实际。当经历数据包丢失时(例如,在传输拥塞的丢弃尾链路或具有AQM丢弃行为的链路时),DCTCP将恢复到基于丢失的拥塞控制。

DCTCP requires changes on both the sender and the receiver, so both endpoints must support DCTCP. Furthermore, DCTCP provides no mechanism for negotiating its use, so both endpoints must be configured through some out-of-band mechanism to use DCTCP. A variant of DCTCP that can be deployed unilaterally and that only requires standard ECN behavior has been described in [ODCTCP] and [BSDCAN], but it requires additional experimental evaluation.

DCTCP要求发送方和接收方都进行更改,因此两个端点都必须支持DCTCP。此外,DCTCP不提供协商其使用的机制,因此必须通过一些带外机制配置两个端点才能使用DCTCP。[ODCTCP]和[BSDCAN]中描述了一种可以单方面部署且只需要标准ECN行为的DCTCP变体,但它需要额外的实验评估。

6. Known Issues
6. 已知问题

DCTCP relies on the sender's ability to reconstruct the stream of CE codepoints received by the remote endpoint. To accomplish this, DCTCP avoids using a single ACK packet to acknowledge segments received both with and without the CE codepoint set. However, if one or more ACK packets are dropped, it is possible that a subsequent ACK will cumulatively acknowledge a mix of CE and non-CE segments. This will, of course, result in a less-accurate congestion estimate. There are some potential considerations:

DCTCP依赖于发送方重构远程端点接收的CE码点流的能力。为了实现这一点,DCTCP避免使用单个ACK包来确认使用和不使用CE码点集接收的段。然而,如果一个或多个ACK分组被丢弃,则后续ACK将累积地确认CE段和非CE段的混合。当然,这将导致较不准确的拥塞估计。有一些潜在的考虑因素:

o Even with an inaccurate congestion estimate, DCTCP may still perform better than [RFC3168].

o 即使拥塞估计不准确,DCTCP仍可能比[RFC3168]性能更好。

o If the estimation gain is small relative to the packet loss rate, the estimate may not be too inaccurate.

o 如果估计增益相对于分组丢失率小,则估计可能不会太不准确。

o If ACK packet loss mostly occurs under heavy congestion, most drops will occur during an unbroken string of CE packets, and the estimate will be unaffected.

o 如果ACK数据包丢失主要发生在严重拥塞情况下,则大多数丢包将发生在连续的CE数据包串期间,并且估计值将不受影响。

However, the effect of packet drops on DCTCP under real-world conditions has not been analyzed.

然而,在实际情况下,还没有分析丢包对DCTCP的影响。

DCTCP provides no mechanism for negotiating its use. The effect of using DCTCP with a standard ECN endpoint has been analyzed in [ODCTCP] and [BSDCAN]. Furthermore, it is possible that other implementations may also modify behavior in the [RFC3168] style without negotiation, causing further interoperability issues.

DCTCP没有提供协商其使用的机制。[ODCTCP]和[BSDCAN]中分析了将DCTCP与标准ECN端点结合使用的效果。此外,其他实现也可能在没有协商的情况下修改[RFC3168]样式的行为,从而导致进一步的互操作性问题。

Much like standard TCP, DCTCP is biased against flows with longer RTTs. A method for improving the RTT fairness of DCTCP has been proposed in [ADCTCP], but it requires additional experimental evaluation.

与标准TCP非常相似,DCTCP偏向于RTT较长的流。[ADCTCP]中提出了一种改进DCTCP的RTT公平性的方法,但需要额外的实验评估。

7. Security Considerations
7. 安全考虑

DCTCP enhances ECN; thus, it inherits the general security considerations discussed in [RFC3168], although additional mitigation options exist due to the limited intra-data-center deployment of DCTCP.

DCTCP增强ECN;因此,它继承了[RFC3168]中讨论的一般安全注意事项,尽管由于DCTCP在数据中心内部的部署有限,存在其他缓解选项。

The processing changes introduced by DCTCP do not exacerbate the considerations in [RFC3168] or introduce new ones. In particular, with either algorithm, the network infrastructure or the remote endpoint can falsely report congestion and, thus, cause the sender to reduce cwnd. However, this is no worse than what can be achieved by simply dropping packets.

DCTCP引入的处理更改不会加剧[RFC3168]中的考虑,也不会引入新的考虑。特别是,无论使用哪种算法,网络基础设施或远程端点都可能错误地报告拥塞,从而导致发送方减少cwnd。然而,这并不比简单地丢弃数据包所能达到的效果差。

[RFC3168] requires that a compliant TCP must not set ECT on SYN or SYN-ACK packets. [RFC5562] proposes setting ECT on SYN-ACK packets but maintains the restriction of no ECT on SYN packets. Both these RFCs prohibit ECT in SYN packets due to security concerns regarding malicious SYN packets with ECT set. However, these RFCs are intended for general Internet use; they do not directly apply to a controlled data-center environment. The security concerns addressed by both of these RFCs might not apply in controlled environments like data centers, and it might not be necessary to account for the presence of non-ECN servers. Beyond the security considerations related to virtual servers, additional security can be imposed in the physical servers to intercept and drop traffic resembling an attack.

[RFC3168]要求兼容TCP不得在SYN或SYN-ACK数据包上设置ECT。[RFC5562]建议在SYN-ACK数据包上设置ECT,但保持对SYN数据包不设置ECT的限制。这两个RFC都禁止在SYN数据包中使用ECT,因为安全问题涉及到设置了ECT的恶意SYN数据包。然而,这些RFC用于一般互联网用途;它们不直接应用于受控数据中心环境。这两个RFC解决的安全问题可能不适用于数据中心等受控环境,并且可能不需要考虑非ECN服务器的存在。除了与虚拟服务器相关的安全考虑之外,还可以在物理服务器中实施额外的安全性,以拦截和丢弃类似攻击的流量。

8. IANA Considerations
8. IANA考虑

This document does not require any IANA actions.

本文件不要求IANA采取任何行动。

9. References
9. 工具书类
9.1. Normative References
9.1. 规范性引用文件

[RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, DOI 10.17487/RFC0793, September 1981, <https://www.rfc-editor.org/info/rfc793>.

[RFC0793]Postel,J.,“传输控制协议”,标准7,RFC 793,DOI 10.17487/RFC0793,1981年9月<https://www.rfc-editor.org/info/rfc793>.

[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, DOI 10.17487/RFC2018, October 1996, <https://www.rfc-editor.org/info/rfc2018>.

[RFC2018]Mathis,M.,Mahdavi,J.,Floyd,S.,和A.Romanow,“TCP选择性确认选项”,RFC 2018,DOI 10.17487/RFC2018,1996年10月<https://www.rfc-editor.org/info/rfc2018>.

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.

[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,DOI 10.17487/RFC2119,1997年3月<https://www.rfc-editor.org/info/rfc2119>.

[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, <https://www.rfc-editor.org/info/rfc3168>.

[RFC3168]Ramakrishnan,K.,Floyd,S.,和D.Black,“向IP添加显式拥塞通知(ECN)”,RFC 3168,DOI 10.17487/RFC3168,2001年9月<https://www.rfc-editor.org/info/rfc3168>.

[RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, DOI 10.17487/RFC5562, June 2009, <https://www.rfc-editor.org/info/rfc5562>.

[RFC5562]Kuzmanovic,A.,Mondal,A.,Floyd,S.,和K.Ramakrishnan,“向TCP的SYN/ACK数据包添加显式拥塞通知(ECN)功能”,RFC 5562,DOI 10.17487/RFC5562,2009年6月<https://www.rfc-editor.org/info/rfc5562>.

[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, <https://www.rfc-editor.org/info/rfc5681>.

[RFC5681]Allman,M.,Paxson,V.和E.Blanton,“TCP拥塞控制”,RFC 5681,DOI 10.17487/RFC56812009年9月<https://www.rfc-editor.org/info/rfc5681>.

[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.

[RFC8174]Leiba,B.,“RFC 2119关键词中大写与小写的歧义”,BCP 14,RFC 8174,DOI 10.17487/RFC8174,2017年5月<https://www.rfc-editor.org/info/rfc8174>.

9.2. Informative References
9.2. 资料性引用

[ADCTCP] Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis of DCTCP: Stability, Convergence, and Fairness", DOI 10.1145/1993744.1993753, Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, June 2011, <https://dl.acm.org/citation.cfm?id=1993753>.

[ADCTCP]Alizadeh,M.,Javanmard,A.,和B.Prabhakar,“DCTCP分析:稳定性、收敛性和公平性”,DOI 10.1145/1993744.1993753,《ACM SIGMETRICS计算机系统测量和建模联合国际会议论文集》,2011年6月<https://dl.acm.org/citation.cfm?id=1993753>.

[BSDCAN] Kato, M., Eggert, L., Zimmermann, A., van Meter, R., and H. Tokuda, "Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support", BSDCan 2015, June 2015, <https://www.bsdcan.org/2015/schedule/events/559.en.html>.

[BSDCAN]Kato,M.,Eggert,L.,Zimmermann,A.,van Meter,R.,和H.Tokuda,“用于增量部署支持的FreeBSD数据中心TCP扩展”,BSDCAN 2015,2015年6月<https://www.bsdcan.org/2015/schedule/events/559.en.html>.

[DCTCP10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and M. Sridharan, "Data Center TCP (DCTCP)", DOI 10.1145/1851182.1851192, Proceedings of the ACM SIGCOMM 2010 Conference, August 2010, <http://dl.acm.org/citation.cfm?doid=1851182.1851192>.

[DCTCP10]Alizadeh,M.,Greenberg,A.,Maltz,D.,Padhye,J.,Patel,P.,Prabhakar,B.,Sengupta,S.,和M.Sridharan,“数据中心TCP(DCTCP)”,DOI 10.1145/1851182.1851192,ACM SIGCOMM 2010年会议记录,2010年8月<http://dl.acm.org/citation.cfm?doid=1851182.1851192>.

[ECN-EXPERIMENTATION] Black, D., "Explicit Congestion Notification (ECN) Experimentation", Work in Progress, draft-ietf-tsvwg-ecn-experimentation-06, September 2017.

[ECN-Experiments]Black,D.,“显式拥塞通知(ECN)实验”,正在进行的工作,草稿-ietf-tsvwg-ECN-Experiments-062017年9月。

[Err3639] RFC Errata, Erratum ID 3639, RFC 3168, <https://www.rfc-editor.org/errata/eid3639>.

[Err3639]RFC勘误表,勘误表ID 3639,RFC 3168<https://www.rfc-editor.org/errata/eid3639>.

[FREEBSD] Kato, M. and H. Panchasara, "DCTCP (Data Center TCP) implementation", January 2015, <https://github.com/freebsd/freebsd/ commit/8ad879445281027858a7fa706d13e458095b595f>.

[FREEBSD]Kato,M.和H.Panchasara,“DCTCP(数据中心TCP)实施”,2015年1月<https://github.com/freebsd/freebsd/ 提交/8ad879445281027858a7fa706d13e458095b595f>。

[LINUX] Borkmann, D., Westphal, F., and Glenn. Judd, "net: tcp: add DCTCP congestion control algorithm", LINUX DCTCP Patch, September 2014, <https://git.kernel.org/cgit/linux/ kernel/git/davem/net-next.git/commit/ ?id=e3118e8359bb7c59555aca60c725106e6d78c5ce>.

[LINUX]Borkmann,D.,Westphal,F.,和Glenn。Judd,“网络:tcp:添加DCTCP拥塞控制算法”,LINUX DCTCP补丁,2014年9月<https://git.kernel.org/cgit/linux/ kernel/git/davem/net-next.git/commit/?id=e3118e8359bb7c59555aca60c725106e6d78c5ce>。

[MAPREDUCE] Dean, J. and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", Proceedings of the 6th ACM/USENIX Symposium on Operating Systems Design and Implementation, October 2004, <https://www.usenix.org/ legacy/publications/library/proceedings/osdi04/tech/ dean.html>.

[MAPREDUCE]Dean,J.和S.Ghemawat,“MAPREDUCE:大型集群上的简化数据处理”,第六届ACM/USENIX操作系统设计和实现研讨会论文集,2004年10月<https://www.usenix.org/ legacy/publications/library/proceedings/osdi04/tech/dean.html>。

[MORGANSTANLEY] Judd, G., "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter", Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, May 2015, <https://www.usenix.org/conference/nsdi15/ technical-sessions/presentation/judd>.

[MORGANSTANLEY]Judd,G.,“在数据中心实现TCP的承诺并避免其陷阱”,第12届USENIX网络系统设计和实施研讨会论文集,2015年5月<https://www.usenix.org/conference/nsdi15/ 技术会议/演示/judd>。

[ODCTCP] Kato, M., "Improving Transmission Performance with One-Sided Datacenter TCP", M.S. Thesis, Keio University, 2013, <http://eggert.org/students/kato-thesis.pdf>.

[ODCTCP]加藤,M.,“利用单边数据中心TCP提高传输性能”,庆应大学硕士论文,2013年<http://eggert.org/students/kato-thesis.pdf>.

[RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J., and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, <https://www.rfc-editor.org/info/rfc2309>.

[RFC2309]Braden,B.,Clark,D.,Crowcroft,J.,Davie,B.,Deering,S.,Estrin,D.,Floyd,S.,Jacobson,V.,Minshall,G.,Partridge,C.,Peterson,L.,Ramakrishnan,K.,Shenker,S.,Wroclawski,J.,and L.Zhang,“关于互联网中队列管理和拥塞避免的建议”,RFC 2309,DOI 10.17487/RFC2309,1998年4月, <https://www.rfc-editor.org/info/rfc2309>.

[RFC7567] Baker, F., Ed. and G. Fairhurst, Ed., "IETF Recommendations Regarding Active Queue Management", BCP 197, RFC 7567, DOI 10.17487/RFC7567, July 2015, <https://www.rfc-editor.org/info/rfc7567>.

[RFC7567]Baker,F.,Ed.和G.Fairhurst,Ed.,“IETF关于主动队列管理的建议”,BCP 197,RFC 7567,DOI 10.17487/RFC7567,2015年7月<https://www.rfc-editor.org/info/rfc7567>.

[WINDOWS] Microsoft, "Data Center Transmission Control Protocol (DCTCP)", May 2012, <https://technet.microsoft.com/ en-us/library/hh997028(v=ws.11).aspx>.

[WINDOWS]微软,“数据中心传输控制协议(DCTCP)”,2012年5月<https://technet.microsoft.com/ en-us/library/hh997028(v=ws.11).aspx>。

Acknowledgments

致谢

The DCTCP algorithm was originally proposed and analyzed in [DCTCP10] by Mohammad Alizadeh, Albert Greenberg, Dave Maltz, Jitu Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan.

DCTCP算法最初由Mohammad Alizadeh、Albert Greenberg、Dave Maltz、Jitu Padhye、Parveen Patel、Balaji Prabhakar、Sudipta Sengupta和Murari Sridharan在[DCTCP10]中提出和分析。

We would like to thank Andrew Shewmaker for identifying the problem of clamping DCTCP.Alpha and proposing a solution for it.

我们要感谢Andrew Shewmaker发现了夹紧DCTCP.Alpha的问题并提出了解决方案。

Lars Eggert has received funding from the European Union's Horizon 2020 research and innovation program 2014-2018 under grant agreement No. 644866 ("SSICLOPS"). This document reflects only the authors' views and the European Commission is not responsible for any use that may be made of the information it contains.

Lars Eggert已收到欧盟地平线2020研究与创新计划2014-2018的资助,资助协议编号为644866(“SSICLOPS”)。本文件仅反映了作者的观点,欧盟委员会不对其所含信息的任何使用负责。

Authors' Addresses

作者地址

Stephen Bensley Microsoft One Microsoft Way Redmond, WA 98052 United States of America

Stephen Bensley微软一路微软雷德蒙德,华盛顿98052美利坚合众国

   Phone: +1 425 703 5570
   Email: sbens@microsoft.com
        
   Phone: +1 425 703 5570
   Email: sbens@microsoft.com
        

Dave Thaler Microsoft

戴夫·泰勒微软公司

   Phone: +1 425 703 8835
   Email: dthaler@microsoft.com
        
   Phone: +1 425 703 8835
   Email: dthaler@microsoft.com
        

Praveen Balasubramanian Microsoft

Praveen Balasubramanian微软

   Phone: +1 425 538 2782
   Email: pravb@microsoft.com
        
   Phone: +1 425 538 2782
   Email: pravb@microsoft.com
        

Lars Eggert NetApp Sonnenallee 1 Kirchheim 85551 Germany

德国基尔希海姆1号拉尔斯·埃格特·内塔普·索内纳利85551

   Phone: +49 151 120 55791
   Email: lars@netapp.com
   URI:   http://eggert.org/
        
   Phone: +49 151 120 55791
   Email: lars@netapp.com
   URI:   http://eggert.org/
        

Glenn Judd Morgan Stanley

格伦•贾德•摩根士丹利

   Phone: +1 973 979 6481
   Email: glenn.judd@morganstanley.com
        
   Phone: +1 973 979 6481
   Email: glenn.judd@morganstanley.com