Network Working Group                                         A. Romanow
Request for Comments: 4297                                         Cisco
Category: Informational                                         J. Mogul
                                                                      HP
                                                               T. Talpey
                                                                  NetApp
                                                               S. Bailey
                                                               Sandburst
                                                           December 2005
        
Network Working Group                                         A. Romanow
Request for Comments: 4297                                         Cisco
Category: Informational                                         J. Mogul
                                                                      HP
                                                               T. Talpey
                                                                  NetApp
                                                               S. Bailey
                                                               Sandburst
                                                           December 2005
        

Remote Direct Memory Access (RDMA) over IP Problem Statement

IP远程直接内存访问(RDMA)问题声明

Status of This Memo

关于下段备忘

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

Abstract

摘要

Overhead due to the movement of user data in the end-system network I/O processing path at high speeds is significant, and has limited the use of Internet protocols in interconnection networks, and the Internet itself -- especially where high bandwidth, low latency, and/or low overhead are required by the hosted application.

由于用户数据在终端系统网络I/O处理路径中高速移动而产生的开销是巨大的,并且限制了互联网络和互联网本身中互联网协议的使用——特别是在托管应用程序需要高带宽、低延迟和/或低开销的情况下。

This document examines this overhead, and addresses an architectural, IP-based "copy avoidance" solution for its elimination, by enabling Remote Direct Memory Access (RDMA).

本文档分析了这一开销,并通过启用远程直接内存访问(RDMA)解决了一个基于体系结构、基于IP的“拷贝避免”解决方案。

Table of Contents

目录

   1. Introduction ....................................................2
   2. The High Cost of Data Movement Operations in Network I/O ........4
      2.1. Copy avoidance improves processing overhead. ...............5
   3. Memory bandwidth is the root cause of the problem. ..............6
   4. High copy overhead is problematic for many key Internet
      applications. ...................................................8
   5. Copy Avoidance Techniques ......................................10
      5.1. A Conceptual Framework: DDP and RDMA ......................11
   6. Conclusions ....................................................12
   7. Security Considerations ........................................12
   8. Terminology ....................................................14
   9. Acknowledgements ...............................................14
   10. Informative References ........................................15
        
   1. Introduction ....................................................2
   2. The High Cost of Data Movement Operations in Network I/O ........4
      2.1. Copy avoidance improves processing overhead. ...............5
   3. Memory bandwidth is the root cause of the problem. ..............6
   4. High copy overhead is problematic for many key Internet
      applications. ...................................................8
   5. Copy Avoidance Techniques ......................................10
      5.1. A Conceptual Framework: DDP and RDMA ......................11
   6. Conclusions ....................................................12
   7. Security Considerations ........................................12
   8. Terminology ....................................................14
   9. Acknowledgements ...............................................14
   10. Informative References ........................................15
        
1. Introduction
1. 介绍

This document considers the problem of high host processing overhead associated with the movement of user data to and from the network interface under high speed conditions. This problem is often referred to as the "I/O bottleneck" [CT90]. More specifically, the source of high overhead that is of interest here is data movement operations, i.e., copying. The throughput of a system may therefore be limited by the overhead of this copying. This issue is not to be confused with TCP offload, which is not addressed here. High speed refers to conditions where the network link speed is high, relative to the bandwidths of the host CPU and memory. With today's computer systems, one Gigabit per second (Gbits/s) and over is considered high speed.

本文件考虑了与高速条件下用户数据进出网络接口相关的高主机处理开销问题。这个问题通常被称为“I/O瓶颈”[CT90]。更具体地说,这里感兴趣的高开销的来源是数据移动操作,即复制。因此,系统的吞吐量可能会受到此复制开销的限制。不要将此问题与TCP卸载混淆,这里没有讨论此问题。高速是指相对于主机CPU和内存的带宽,网络链路速度较高的情况。在当今的计算机系统中,每秒1千兆比特(Gbits/s)及以上被认为是高速的。

High costs associated with copying are an issue primarily for large scale systems. Although smaller systems such as rack-mounted PCs and small workstations would benefit from a reduction in copying overhead, the benefit to smaller machines will be primarily in the next few years as they scale the amount of bandwidth they handle. Today, it is large system machines with high bandwidth feeds, usually multiprocessors and clusters, that are adversely affected by copying overhead. Examples of such machines include all varieties of servers: database servers, storage servers, application servers for transaction processing, for e-commerce, and web serving, content distribution, video distribution, backups, data mining and decision support, and scientific computing.

与复制相关的高成本主要是大型系统的一个问题。虽然机架式PC和小型工作站等小型系统将从复制开销的减少中受益,但小型机器的好处主要是在未来几年内,因为它们将扩展其处理的带宽量。今天,具有高带宽馈送的大型系统机器(通常是多处理器和集群)受到复制开销的不利影响。这类机器的例子包括各种各样的服务器:数据库服务器、存储服务器、用于事务处理、用于电子商务和web服务、内容分发、视频分发、备份、数据挖掘和决策支持以及科学计算的应用服务器。

Note that such servers almost exclusively service many concurrent sessions (transport connections), which, in aggregate, are responsible for > 1 Gbits/s of communication. Nonetheless, the cost

请注意,此类服务器几乎专门为许多并发会话(传输连接)提供服务,这些会话总共负责>1 Gbit/s的通信。尽管如此,成本

of copying overhead for a particular load is the same whether from few or many sessions.

无论是从几个会话还是从多个会话,特定负载的复制开销都是相同的。

The I/O bottleneck, and the role of data movement operations, have been widely studied in research and industry over the last approximately 14 years, and we draw freely on these results. Historically, the I/O bottleneck has received attention whenever new networking technology has substantially increased line rates: 100 Megabit per second (Mbits/s) Fast Ethernet and Fibre Distributed Data Interface [FDDI], 155 Mbits/s Asynchronous Transfer Mode [ATM], 1 Gbits/s Ethernet. In earlier speed transitions, the availability of memory bandwidth allowed the I/O bottleneck issue to be deferred. Now however, this is no longer the case. While the I/O problem is significant at 1 Gbits/s, it is the introduction of 10 Gbits/s Ethernet which is motivating an upsurge of activity in industry and research [IB, VI, CGY01, Ma02, MAF+02].

在过去大约14年中,研究和工业界对I/O瓶颈和数据移动操作的作用进行了广泛的研究,我们自由地利用了这些结果。从历史上看,每当新的网络技术大幅提高线路速率时,I/O瓶颈就会受到关注:100Mbits/s(Mbits/s)快速以太网和光纤分布式数据接口[FDDI]、155Mbits/s异步传输模式[ATM]、1Gbits/s以太网。在早期的速度转换中,内存带宽的可用性允许延迟I/O瓶颈问题。然而,现在情况已不再如此。虽然I/O问题在1 Gbits/s时非常严重,但10 Gbits/s以太网的引入正在推动工业和研究活动的热潮[IB、VI、CGY01、Ma02、MAF+02]。

Because of high overhead of end-host processing in current implementations, the TCP/IP protocol stack is not used for high speed transfer. Instead, special purpose network fabrics, using a technology generally known as Remote Direct Memory Access (RDMA), have been developed and are widely used. RDMA is a set of mechanisms that allow the network adapter, under control of the application, to steer data directly into and out of application buffers. Examples of such interconnection fabrics include Fibre Channel [FIBRE] for block storage transfer, Virtual Interface Architecture [VI] for database clusters, and Infiniband [IB], Compaq Servernet [SRVNET], and Quadrics [QUAD] for System Area Networks. These link level technologies limit application scaling in both distance and size, meaning that the number of nodes cannot be arbitrarily large.

由于当前实现中终端主机处理的高开销,TCP/IP协议栈不用于高速传输。相反,使用一种通常被称为远程直接内存访问(RDMA)的技术的专用网络结构已经开发出来并得到广泛应用。RDMA是一组机制,允许网络适配器在应用程序的控制下直接将数据导入和导出应用程序缓冲区。此类互连结构的示例包括用于块存储传输的光纤通道[Fibre],用于数据库集群的虚拟接口体系结构[VI],以及用于系统区域网络的Infiniband[IB]、Compaq Servernet[SRVNET]和Quadrics[QUAD]。这些链路级技术限制了应用程序在距离和大小上的扩展,这意味着节点的数量不能任意大。

This problem statement substantiates the claim that in network I/O processing, high overhead results from data movement operations, specifically copying; and that copy avoidance significantly decreases this processing overhead. It describes when and why the high processing overheads occur, explains why the overhead is problematic, and points out which applications are most affected.

此问题陈述证实了以下说法:在网络I/O处理中,数据移动操作(特别是复制)会导致高开销;而且,这种避免复制的方法显著降低了这种处理开销。它描述了出现高处理开销的时间和原因,解释了开销问题的原因,并指出哪些应用程序受到的影响最大。

The document goes on to discuss why the problem is relevant to the Internet and to Internet-based applications. Applications that store, manage, and distribute the information of the Internet are well suited to applying the copy avoidance solution. They will benefit by avoiding high processing overheads, which removes limits to the available scaling of tiered end-systems. Copy avoidance also eliminates latency for these systems, which can further benefit effective distributed processing.

该文件接着讨论了为什么这个问题与互联网和基于互联网的应用程序有关。存储、管理和分发Internet信息的应用程序非常适合应用复制避免解决方案。它们将受益于避免高处理开销,这将消除分层终端系统可用扩展的限制。避免拷贝还消除了这些系统的延迟,这将进一步有利于有效的分布式处理。

In addition, this document introduces an architectural approach to solving the problem, which is developed in detail in [BT05]. It also discusses how the proposed technology may introduce security concerns and how they should be addressed.

此外,本文档介绍了解决该问题的体系结构方法,该方法在[BT05]中有详细介绍。本文还讨论了拟议的技术如何引入安全问题,以及应如何解决这些问题。

Finally, this document includes a Terminology section to aid as a reference for several new terms introduced by RDMA.

最后,本文档包括一个术语部分,作为RDMA引入的几个新术语的参考。

2. The High Cost of Data Movement Operations in Network I/O
2. 网络I/O中数据移动操作的高成本

A wealth of data from research and industry shows that copying is responsible for substantial amounts of processing overhead. It further shows that even in carefully implemented systems, eliminating copies significantly reduces the overhead, as referenced below.

来自研究和行业的大量数据表明,复制造成了大量的处理开销。它进一步表明,即使在精心实施的系统中,消除拷贝也会显著降低开销,如下所述。

Clark et al. [CJRS89] in 1989 shows that TCP [Po81] overhead processing is attributable to both operating system costs (such as interrupts, context switches, process management, buffer management, timer management) and the costs associated with processing individual bytes (specifically, computing the checksum and moving data in memory). They found that moving data in memory is the more important of the costs, and their experiments show that memory bandwidth is the greatest source of limitation. In the data presented [CJRS89], 64% of the measured microsecond overhead was attributable to data touching operations, and 48% was accounted for by copying. The system measured Berkeley TCP on a Sun-3/60 using 1460 Byte Ethernet packets.

Clark等人[CJRS89]在1989年指出,TCP[Po81]开销处理既可归因于操作系统成本(如中断、上下文切换、进程管理、缓冲区管理、计时器管理),也可归因于与处理单个字节相关的成本(具体而言,计算校验和和并在内存中移动数据)。他们发现在内存中移动数据的成本更为重要,他们的实验表明,内存带宽是最大的限制来源。在提供的数据[CJRS89]中,64%的测量微秒开销归因于数据触摸操作,48%归因于复制。该系统使用1460字节的以太网数据包在Sun-3/60上测量Berkeley TCP。

In a well-implemented system, copying can occur between the network interface and the kernel, and between the kernel and application buffers; there are two copies, each of which are two memory bus crossings, for read and write. Although in certain circumstances it is possible to do better, usually two copies are required on receive.

在一个实现良好的系统中,复制可以发生在网络接口和内核之间,以及内核和应用程序缓冲区之间;有两个副本,每个副本都是两个内存总线交叉点,用于读取和写入。虽然在某些情况下可以做得更好,但通常在收到时需要两份副本。

Subsequent work has consistently shown the same phenomenon as the earlier Clark study. A number of studies report results that data-touching operations, checksumming and data movement, dominate the processing costs for messages longer than 128 Bytes [BS96, CGY01, Ch96, CJRS89, DAPP93, KP96]. For smaller sized messages, per-packet overheads dominate [KP96, CGY01].

随后的研究一直显示出与早期克拉克研究相同的现象。许多研究报告的结果表明,数据接触操作、校验和和数据移动在长度超过128字节的消息的处理成本中占主导地位[BS96、CGY01、Ch96、CJRS89、DAPP93、KP96]。对于较小的消息,每个数据包的开销占主导地位[KP96,CGY01]。

The percentage of overhead due to data-touching operations increases with packet size, since time spent on per-byte operations scales linearly with message size [KP96]. For example, Chu [Ch96] reported substantial per-byte latency costs as a percentage of total networking software costs for an MTU size packet on a SPARCstation/20

由于数据接触操作导致的开销百分比随着数据包大小的增加而增加,因为每字节操作花费的时间与消息大小成线性比例[KP96]。例如,Chu[Ch96]报告了SPARCstation/20上MTU大小数据包的每字节延迟成本占网络软件总成本的百分比

running memory-to-memory TCP tests over networks with 3 different MTU sizes. The percentage of total software costs attributable to per-byte operations were:

在具有3种不同MTU大小的网络上运行内存对内存TCP测试。每字节操作占总软件成本的百分比为:

1500 Byte Ethernet 18-25% 4352 Byte FDDI 35-50% 9180 Byte ATM 55-65%

1500字节以太网18-25%4352字节FDDI 35-50%9180字节ATM 55-65%

Although many studies report results for data-touching operations, including checksumming and data movement together, much work has focused just on copying [BS96, Br99, Ch96, TK95]. For example, [KP96] reports results that separate processing times for checksum from data movement operations. For the 1500 Byte Ethernet size, 20% of total processing overhead time is attributable to copying. The study used 2 DECstations 5000/200 connected by an FDDI network. (In this study, checksum accounts for 30% of the processing time.)

尽管许多研究报告了数据接触操作的结果,包括校验和和数据移动,但许多工作仅集中于复制[BS96、Br99、Ch96、TK95]。例如,[KP96]报告将校验和的处理时间与数据移动操作分开的结果。对于1500字节的以太网大小,总处理开销时间的20%归因于复制。该研究使用了两个由FDDI网络连接的DEC5000/200站。(在本研究中,校验和占处理时间的30%。)

2.1. Copy avoidance improves processing overhead.

2.1. 避免复制提高了处理开销。

A number of studies show that eliminating copies substantially reduces overhead. For example, results from copy-avoidance in the IO-Lite system [PDZ99], which aimed at improving web server performance, show a throughput increase of 43% over an optimized web server, and 137% improvement over an Apache server. The system was implemented in a 4.4BSD-derived UNIX kernel, and the experiments used a server system based on a 333MHz Pentium II PC connected to a switched 100 Mbits/s Fast Ethernet.

大量研究表明,消除拷贝可以大大减少开销。例如,IO Lite系统[PDZ99]中的复制避免(旨在提高web服务器性能)结果显示,与优化的web服务器相比,吞吐量提高了43%,与Apache服务器相比,吞吐量提高了137%。该系统在4.4BSD派生的UNIX内核中实现,实验使用了一个基于333MHz奔腾II PC的服务器系统,该PC连接到交换式100Mbits/s快速以太网。

There are many other examples where elimination of copying using a variety of different approaches showed significant improvement in system performance [CFF+94, DP93, EBBV95, KSZ95, TK95, Wa97]. We will discuss the results of one of these studies in detail in order to clarify the significant degree of improvement produced by copy avoidance [Ch02].

还有许多其他例子表明,使用各种不同的方法消除复制可以显著提高系统性能[CFF+94、DP93、EBBV95、KSZ95、TK95、Wa97]。我们将详细讨论其中一项研究的结果,以阐明避免复制所产生的显著改善程度[Ch02]。

Recent work by Chase et al. [CGY01], measuring CPU utilization, shows that avoiding copies reduces CPU time spent on data access from 24% to 15% at 370 Mbits/s for a 32 KBytes MTU using an AlphaStation XP1000 and a Myrinet adapter [BCF+95]. This is an absolute improvement of 9% due to copy avoidance.

Chase等人[CGY01]最近的一项测量CPU利用率的工作表明,对于使用AlphaStation XP1000和Myrinet适配器[BCF+95]的32 KBytes MTU,以370Mbits/s的速度避免拷贝将数据访问所花费的CPU时间从24%减少到15%。由于避免复制,这是9%的绝对改善。

The total CPU utilization was 35%, with data access accounting for 24%. Thus, the relative importance of reducing copies is 26%. At 370 Mbits/s, the system is not very heavily loaded. The relative improvement in achievable bandwidth is 34%. This is the improvement we would see if copy avoidance were added when the machine was saturated by network I/O.

CPU总利用率为35%,数据访问占24%。因此,减少拷贝的相对重要性为26%。在370Mbits/s时,系统负载不是很重。可实现带宽的相对改善为34%。这是当机器被网络I/O饱和时,如果添加了复制避免,我们将看到的改进。

Note that improvement from the optimization becomes more important if the overhead it targets is a larger share of the total cost. This is what happens if other sources of overhead, such as checksumming, are eliminated. In [CGY01], after removing checksum overhead, copy avoidance reduces CPU utilization from 26% to 10%. This is a 16% absolute reduction, a 61% relative reduction, and a 160% relative improvement in achievable bandwidth.

请注意,如果优化目标的管理费用占总成本的比例更大,那么优化的改进就变得更为重要。如果消除了校验和等其他开销来源,就会发生这种情况。在[CGY01]中,删除校验和开销后,避免复制将CPU利用率从26%降低到10%。这是16%的绝对减少,61%的相对减少,以及160%的可实现带宽的相对改善。

In fact, today's network interface hardware commonly offloads the checksum, which removes the other source of per-byte overhead. They also coalesce interrupts to reduce per-packet costs. Thus, today copying costs account for a relatively larger part of CPU utilization than previously, and therefore relatively more benefit is to be gained in reducing them. (Of course this argument would be specious if the amount of overhead were insignificant, but it has been shown to be substantial. [BS96, Br99, Ch96, KP96, TK95])

事实上,今天的网络接口硬件通常会卸载校验和,从而消除每字节开销的另一个来源。它们还合并中断以降低每个数据包的成本。因此,与以前相比,今天的复制成本在CPU利用率中所占的比例相对较大,因此,在减少复制成本方面可以获得相对更多的好处。(当然,如果间接费用的数额微不足道,那么这个论点可能是似是而非的,但事实证明它是巨大的。[BS96、Br99、Ch96、KP96、TK95])

3. Memory bandwidth is the root cause of the problem.

3. 内存带宽是问题的根本原因。

Data movement operations are expensive because memory bandwidth is scarce relative to network bandwidth and CPU bandwidth [PAC+97]. This trend existed in the past and is expected to continue into the future [HP97, STREAM], especially in large multiprocessor systems.

数据移动操作非常昂贵,因为相对于网络带宽和CPU带宽而言,内存带宽非常稀缺[PAC+97]。这一趋势在过去就存在,预计在未来还会继续[HP97,STREAM],特别是在大型多处理器系统中。

With copies crossing the bus twice per copy, network processing overhead is high whenever network bandwidth is large in comparison to CPU and memory bandwidths. Generally, with today's end-systems, the effects are observable at network speeds over 1 Gbits/s. In fact, with multiple bus crossings it is possible to see the bus bandwidth being the limiting factor for throughput. This prevents such an end-system from simultaneously achieving full network bandwidth and full application performance.

由于每个拷贝跨总线两次,与CPU和内存带宽相比,只要网络带宽较大,网络处理开销就会很高。通常,对于今天的终端系统,在网络速度超过1 Gbits/s时可以观察到这种影响。事实上,对于多个总线交叉口,总线带宽可能是吞吐量的限制因素。这会阻止这样一个终端系统同时实现完整的网络带宽和完整的应用程序性能。

A common question is whether an increase in CPU processing power alleviates the problem of high processing costs of network I/O. The answer is no, it is the memory bandwidth that is the issue. Faster CPUs do not help if the CPU spends most of its time waiting for memory [CGY01].

一个常见的问题是CPU处理能力的增加是否会缓解网络I/O的高处理成本问题。答案是否定的,这是内存带宽的问题。如果CPU的大部分时间都在等待内存,那么速度更快的CPU也无济于事[CGY01]。

The widening gap between microprocessor performance and memory performance has long been a widely recognized and well-understood problem [PAC+97]. Hennessy [HP97] shows microprocessor performance grew from 1980-1998 at 60% per year, while the access time to DRAM improved at 10% per year, giving rise to an increasing "processor-memory performance gap".

微处理器性能和内存性能之间不断扩大的差距长期以来一直是一个被广泛认识和理解的问题[PAC+97]。轩尼诗[HP97]显示,从1980年到1998年,微处理器性能以每年60%的速度增长,而对DRAM的访问时间以每年10%的速度增长,这导致“处理器内存性能差距”不断扩大。

Another source of relevant data is the STREAM Benchmark Reference Information website, which provides information on the STREAM benchmark [STREAM]. The benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MBytes/s) and the corresponding computation rate for simple vector kernels measured in MFLOPS. The website tracks information on sustainable memory bandwidth for hundreds of machines and all major vendors.

相关数据的另一个来源是STREAM Benchmark参考信息网站,该网站提供有关STREAM Benchmark[STREAM]的信息。基准测试是一个简单的综合基准测试程序,用于测量可持续的内存带宽(以MBytes/s为单位)以及以MFLOPS为单位测量的简单向量核的相应计算速率。该网站跟踪数百台机器和所有主要供应商的可持续内存带宽信息。

Results show measured system performance statistics. Processing performance from 1985-2001 increased at 50% per year on average, and sustainable memory bandwidth from 1975 to 2001 increased at 35% per year, on average, over all the systems measured. A similar 15% per year lead of processing bandwidth over memory bandwidth shows up in another statistic, machine balance [Mc95], a measure of the relative rate of CPU to memory bandwidth (FLOPS/cycle) / (sustained memory ops/cycle) [STREAM].

结果显示了测量的系统性能统计信息。从1985年到2001年,在所有测量的系统中,处理性能平均每年增加50%,而从1975年到2001年,可持续内存带宽平均每年增加35%。另一个统计数据,机器平衡[Mc95]显示了处理带宽与内存带宽的年领先率为15%,这是CPU与内存带宽(触发器/周期)/(持续内存操作/周期)[STREAM]的相对速率的度量。

Network bandwidth has been increasing about 10-fold roughly every 8 years, which is a 40% per year growth rate.

网络带宽大约每8年增长10倍,年增长率为40%。

A typical example illustrates that the memory bandwidth compares unfavorably with link speed. The STREAM benchmark shows that a modern uniprocessor PC, for example the 1.2 GHz Athlon in 2001, will move the data 3 times in doing a receive operation: once for the network interface to deposit the data in memory, and twice for the CPU to copy the data. With 1 GBytes/s of memory bandwidth, meaning one read or one write, the machine could handle approximately 2.67 Gbits/s of network bandwidth, one third the copy bandwidth. But this assumes 100% utilization, which is not possible, and more importantly the machine would be totally consumed! (A rule of thumb for databases is that 20% of the machine should be required to service I/O, leaving 80% for the database application. And, the less, the better.)

一个典型的例子说明,内存带宽与链路速度相比是不利的。STREAM基准测试显示,现代单处理器PC,例如2001年的1.2 GHz Athlon,在执行接收操作时将移动数据3次:一次用于网络接口将数据存入内存,两次用于CPU复制数据。使用1 GBytes/s的内存带宽,即一次读取或一次写入,机器可以处理大约2.67 Gbits/s的网络带宽,即拷贝带宽的三分之一。但这假设利用率为100%,这是不可能的,更重要的是,机器将被完全消耗!(数据库的一条经验法则是,需要20%的计算机来服务I/O,剩下80%用于数据库应用程序。而且,越少越好。)

In 2001, 1 Gbits/s links were common. An application server may typically have two 1 Gbits/s connections: one connection backend to a storage server and one front-end, say for serving HTTP [FGM+99]. Thus, the communications could use 2 Gbits/s. In our typical example, the machine could handle 2.7 Gbits/s at its theoretical maximum while doing nothing else. This means that the machine basically could not keep up with the communication demands in 2001; with the relative growth trends, the situation only gets worse.

2001年,1 Gbits/s链路很常见。应用服务器通常可能有两个1 Gbit/s的连接:一个连接后端到存储服务器,另一个连接前端,例如用于服务HTTP[FGM+99]。因此,通信可以使用2 Gbit/s。在我们的典型示例中,机器可以在理论最大速度下处理2.7 Gbits/s的数据,而不执行其他操作。这意味着该机器基本上无法满足2001年的通信需求;按照相对增长趋势,情况只会变得更糟。

4. High copy overhead is problematic for many key Internet applications.

4. 对于许多关键的Internet应用程序来说,高拷贝开销是个问题。

If a significant portion of resources on an application machine is consumed in network I/O rather than in application processing, it makes it difficult for the application to scale, i.e., to handle more clients, to offer more services.

如果应用程序计算机上的大部分资源消耗在网络I/O中,而不是在应用程序处理中,则应用程序很难扩展,即处理更多的客户端,提供更多的服务。

Several years ago the most affected applications were streaming multimedia, parallel file systems, and supercomputing on clusters [BS96]. In addition, today the applications that suffer from copying overhead are more central in Internet computing -- they store, manage, and distribute the information of the Internet and the enterprise. They include database applications doing transaction processing, e-commerce, web serving, decision support, content distribution, video distribution, and backups. Clusters are typically used for this category of application, since they have advantages of availability and scalability.

几年前,受影响最大的应用是流媒体、并行文件系统和集群超级计算[BS96]。此外,如今,受复制开销影响的应用程序在互联网计算中更为重要——它们存储、管理和分发互联网和企业的信息。它们包括执行事务处理、电子商务、web服务、决策支持、内容分发、视频分发和备份的数据库应用程序。集群通常用于这类应用程序,因为它们具有可用性和可伸缩性的优势。

Today these applications, which provide and manage Internet and corporate information, are typically run in data centers that are organized into three logical tiers. One tier is typically a set of web servers connecting to the WAN. The second tier is a set of application servers that run the specific applications usually on more powerful machines, and the third tier is backend databases. Physically, the first two tiers -- web server and application server -- are usually combined [Pi01]. For example, an e-commerce server communicates with a database server and with a customer site, or a content distribution server connects to a server farm, or an OLTP server connects to a database and a customer site.

如今,这些提供和管理互联网和公司信息的应用程序通常运行在数据中心中,数据中心分为三个逻辑层。一层通常是连接到WAN的一组web服务器。第二层是一组应用服务器,通常在功能更强大的机器上运行特定的应用程序,第三层是后端数据库。实际上,前两层——web服务器和应用服务器——通常是组合的[Pi01]。例如,电子商务服务器与数据库服务器和客户站点通信,或者内容分发服务器连接到服务器场,或者OLTP服务器连接到数据库和客户站点。

When network I/O uses too much memory bandwidth, performance on network paths between tiers can suffer. (There might also be performance issues on Storage Area Network paths used either by the database tier or the application tier.) The high overhead from network-related memory copies diverts system resources from other application processing. It also can create bottlenecks that limit total system performance.

当网络I/O使用太多内存带宽时,层间网络路径的性能可能会受到影响。(数据库层或应用程序层使用的存储区域网络路径上也可能存在性能问题。)网络相关内存拷贝的高开销将系统资源从其他应用程序处理转移。它还可能造成瓶颈,限制系统的总体性能。

There is high motivation to maximize the processing capacity of each CPU because scaling by adding CPUs, one way or another, has drawbacks. For example, adding CPUs to a multiprocessor will not necessarily help because a multiprocessor improves performance only when the memory bus has additional bandwidth to spare. Clustering can add additional complexity to handling the applications.

最大化每个CPU的处理能力的动机很高,因为通过添加CPU(以这种或那种方式)进行扩展有缺点。例如,将CPU添加到多处理器不一定会有帮助,因为多处理器只有在内存总线有额外带宽可用时才能提高性能。集群可以增加处理应用程序的额外复杂性。

In order to scale a cluster or multiprocessor system, one must proportionately scale the interconnect bandwidth. Interconnect

为了扩展集群或多处理器系统,必须按比例扩展互连带宽。互连

bandwidth governs the performance of communication-intensive parallel applications; if this (often expressed in terms of "bisection bandwidth") is too low, adding additional processors cannot improve system throughput. Interconnect latency can also limit the performance of applications that frequently share data between processors.

带宽控制着通信密集型并行应用程序的性能;如果这(通常用“二等分带宽”表示)太低,则添加额外的处理器无法提高系统吞吐量。互连延迟还可能限制处理器之间频繁共享数据的应用程序的性能。

So, excessive overheads on network paths in a "scalable" system both can require the use of more processors than optimal, and can reduce the marginal utility of those additional processors.

因此,在一个“可伸缩”系统中,网络路径上的过度开销既可能需要使用比最优值更多的处理器,也可能降低这些额外处理器的边际效用。

Copy avoidance scales a machine upwards by removing at least two-thirds of the bus bandwidth load from the "very best" 1-copy (on receive) implementations, and removes at least 80% of the bandwidth overhead from the 2-copy implementations.

拷贝避免通过从“最佳”1拷贝(接收时)实现中移除至少三分之二的总线带宽负载,并从2拷贝实现中移除至少80%的带宽开销,从而向上扩展机器。

The removal of bus bandwidth requirements, in turn, removes bottlenecks from the network processing path and increases the throughput of the machine. On a machine with limited bus bandwidth, the advantages of removing this load is immediately evident, as the host can attain full network bandwidth. Even on a machine with bus bandwidth adequate to sustain full network bandwidth, removal of bus bandwidth load serves to increase the availability of the machine for the processing of user applications, in some cases dramatically.

总线带宽需求的消除反过来又消除了网络处理路径的瓶颈,并提高了机器的吞吐量。在总线带宽有限的机器上,删除此负载的好处是显而易见的,因为主机可以获得完整的网络带宽。即使在总线带宽足以维持整个网络带宽的机器上,去除总线带宽负载也有助于提高机器处理用户应用程序的可用性,在某些情况下,这一点非常明显。

An example showing poor performance with copies and improved scaling with copy avoidance is illustrative. The IO-Lite work [PDZ99] shows higher server throughput servicing more clients using a zero-copy system. In an experiment designed to mimic real world web conditions by simulating the effect of TCP WAN connections on the server, the performance of 3 servers was compared. One server was Apache, another was an optimized server called Flash, and the third was the Flash server running IO-Lite, called Flash-Lite with zero copy. The measurement was of throughput in requests/second as a function of the number of slow background clients that could be served. As the table shows, Flash-Lite has better throughput, especially as the number of clients increases.

一个示例显示了副本的性能差以及通过避免副本而改进的缩放。IO Lite工作[PDZ99]显示了更高的服务器吞吐量,使用零拷贝系统为更多客户机提供服务。在一个通过模拟TCP WAN连接对服务器的影响来模拟真实web环境的实验中,比较了3台服务器的性能。一台服务器是Apache,另一台是名为Flash的优化服务器,第三台是运行IO Lite的Flash服务器,名为FlashLite,具有零拷贝。测量的是吞吐量(以请求/秒为单位),它是可以服务的慢速后台客户端数量的函数。如表所示,FlashLite具有更好的吞吐量,尤其是当客户端数量增加时。

              Apache              Flash         Flash-Lite
              ------              -----         ----------
   #Clients   Throughput reqs/s   Throughput    Throughput
        
              Apache              Flash         Flash-Lite
              ------              -----         ----------
   #Clients   Throughput reqs/s   Throughput    Throughput
        

0 520 610 890 16 390 490 890 32 360 490 850 64 360 490 890 128 310 450 880 256 310 440 820

0 520 610 890 16 390 490 890 32 360 490 850 64 360 490 890 128 310 450 880 256 310 440 820

Traditional Web servers (which mostly send data and can keep most of their content in the file cache) are not the worst case for copy overhead. Web proxies (which often receive as much data as they send) and complex Web servers based on System Area Networks or multi-tier systems will suffer more from copy overheads than in the example above.

传统的Web服务器(主要发送数据,并且可以将其大部分内容保存在文件缓存中)并不是复制开销的最坏情况。与上述示例相比,基于系统区域网络或多层系统的Web代理(通常接收与发送相同数量的数据)和复杂Web服务器将遭受更多的复制开销。

5. Copy Avoidance Techniques
5. 复制避免技术

There have been extensive research investigation and industry experience with two main alternative approaches to eliminating data movement overhead, often along with improving other Operating System processing costs. In one approach, hardware and/or software changes within a single host reduce processing costs. In another approach, memory-to-memory networking [MAF+02], the exchange of explicit data placement information between hosts allows them to reduce processing costs.

在消除数据移动开销的两种主要替代方法方面已经有了广泛的研究调查和行业经验,这两种方法通常与提高其他操作系统处理成本一起使用。在一种方法中,单个主机内的硬件和/或软件更改降低了处理成本。在另一种方法中,内存到内存联网[MAF+02],主机之间显式数据放置信息的交换允许它们降低处理成本。

The single host approaches range from new hardware and software architectures [KSZ95, Wa97, DWB+93] to new or modified software systems [BS96, Ch96, TK95, DP93, PDZ99]. In the approach based on using a networking protocol to exchange information, the network adapter, under control of the application, places data directly into and out of application buffers, reducing the need for data movement. Commonly this approach is called RDMA, Remote Direct Memory Access.

单主机方法的范围从新的硬件和软件体系结构[KSZ95、Wa97、DWB+93]到新的或改进的软件系统[BS96、Ch96、TK95、DP93、PDZ99]。在基于使用网络协议交换信息的方法中,网络适配器在应用程序的控制下,将数据直接放入和移出应用程序缓冲区,从而减少数据移动的需要。通常这种方法称为RDMA,即远程直接内存访问。

As discussed below, research and industry experience has shown that copy avoidance techniques within the receiver processing path alone have proven to be problematic. The research special purpose host adapter systems had good performance and can be seen as precursors for the commercial RDMA-based adapters [KSZ95, DWB+93]. In software, many implementations have successfully achieved zero-copy transmit, but few have accomplished zero-copy receive. And those that have done so make strict alignment and no-touch requirements on the application, greatly reducing the portability and usefulness of the implementation.

如下文所述,研究和行业经验表明,仅在接收器处理路径内的复制避免技术已被证明是有问题的。研究的专用主机适配器系统具有良好的性能,可以看作是基于RDMA的商用适配器的先驱[KSZ95,DWB+93]。在软件方面,许多实现已经成功地实现了零拷贝传输,但很少有实现了零拷贝接收。这样做的人对应用程序提出了严格的一致性和无接触要求,大大降低了实现的可移植性和实用性。

In contrast, experience has proven satisfactory with memory-to-memory systems that permit RDMA; performance has been good and there have not been system or networking difficulties. RDMA is a single solution. Once implemented, it can be used with any OS and machine architecture, and it does not need to be revised when either of these are changed.

相反,经验证明,允许RDMA的内存到内存系统令人满意;性能良好,没有出现系统或网络问题。RDMA是一个单一的解决方案。一旦实现,它就可以与任何操作系统和机器体系结构一起使用,并且在其中任何一个发生变化时都不需要修改。

In early work, one goal of the software approaches was to show that TCP could go faster with appropriate OS support [CJRS89, CFF+94]. While this goal was achieved, further investigation and experience showed that, though possible to craft software solutions, specific

在早期的工作中,软件方法的一个目标是显示TCP可以在适当的操作系统支持下运行得更快[CJRS89,CFF+94]。虽然这一目标已经实现,但进一步的调查和经验表明,尽管可以设计软件解决方案,但具体的

system optimizations have been complex, fragile, extremely interdependent with other system parameters in complex ways, and often of only marginal improvement [CFF+94, CGY01, Ch96, DAPP93, KSZ95, PDZ99]. The network I/O system interacts with other aspects of the Operating System such as machine architecture and file I/O, and disk I/O [Br99, Ch96, DP93].

系统优化是复杂的、脆弱的,以复杂的方式与其他系统参数极其相互依赖,并且通常只有微小的改进[CFF+94、CGY01、Ch96、DAPP93、KSZ95、PDZ99]。网络I/O系统与操作系统的其他方面交互,如机器体系结构、文件I/O和磁盘I/O[Br99、Ch96、DP93]。

For example, the Solaris Zero-Copy TCP work [Ch96], which relies on page remapping, shows that the results are highly interdependent with other systems, such as the file system, and that the particular optimizations are specific for particular architectures, meaning that for each variation in architecture, optimizations must be re-crafted [Ch96].

例如,Solaris Zero Copy TCP work[Ch96]依赖于页面重新映射,它表明结果与其他系统(如文件系统)高度相互依赖,并且特定的优化特定于特定的体系结构,这意味着对于体系结构中的每个变体,都必须重新构建优化[Ch96]。

With RDMA, application I/O buffers are mapped directly, and the authorized peer may access it without incurring additional processing overhead. When RDMA is implemented in hardware, arbitrary data movement can be performed without involving the host CPU at all.

使用RDMA,应用程序I/O缓冲区直接映射,授权的对等方可以访问它,而不会产生额外的处理开销。当RDMA在硬件中实现时,可以执行任意数据移动,而根本不涉及主机CPU。

A number of research projects and industry products have been based on the memory-to-memory approach to copy avoidance. These include U-Net [EBBV95], SHRIMP [BLA+94], Hamlyn [BJM+96], Infiniband [IB], Winsock Direct [Pi01]. Several memory-to-memory systems have been widely used and have generally been found to be robust, to have good performance, and to be relatively simple to implement. These include VI [VI], Myrinet [BCF+95], Quadrics [QUAD], Compaq/Tandem Servernet [SRVNET]. Networks based on these memory-to-memory architectures have been used widely in scientific applications and in data centers for block storage, file system access, and transaction processing.

许多研究项目和工业产品都基于内存到内存的方法来避免复制。其中包括U-Net[EBBV95]、SHRIMP[BLA+94]、Hamlyn[BJM+96]、Infiniband[IB]、Winsock Direct[Pi01]。一些存储器到存储器系统已经被广泛使用,并且通常被发现是健壮的、具有良好的性能和相对简单的实现。其中包括VI[VI]、Myrinet[BCF+95]、quadrac[QUAD]、Compaq/Tandem Servernet[SRVNET]。基于这些内存到内存体系结构的网络已广泛应用于科学应用和数据中心的块存储、文件系统访问和事务处理。

By exporting direct memory access "across the wire", applications may direct the network stack to manage all data directly from application buffers. A large and growing class that takes advantage of such capabilities of applications has already emerged. It includes all the major databases, as well as network protocols such as Sockets Direct [SDP].

通过“跨线”导出直接内存访问,应用程序可以指示网络堆栈直接从应用程序缓冲区管理所有数据。一个庞大且不断增长的类已经出现,它利用了应用程序的这些功能。它包括所有主要的数据库,以及网络协议,如Sockets Direct[SDP]。

5.1. A Conceptual Framework: DDP and RDMA
5.1. 概念框架:DDP和RDMA

An RDMA solution can be usefully viewed as being comprised of two distinct components: "direct data placement (DDP)" and "remote direct memory access (RDMA) semantics". They are distinct in purpose and also in practice -- they may be implemented as separate protocols.

RDMA解决方案可以有效地看作由两个不同的组件组成:“直接数据放置(DDP)”和“远程直接内存访问(RDMA)语义”。它们在目的和实践上都是不同的——它们可以作为单独的协议实现。

The more fundamental of the two is the direct data placement facility. This is the means by which memory is exposed to the remote peer in an appropriate fashion, and the means by which the peer may access it, for instance, reading and writing.

其中更基本的是直接数据放置工具。这是以适当方式向远程对等方公开内存的方法,以及对等方可以访问内存的方法,例如,读和写。

The RDMA control functions are semantically layered atop direct data placement. Included are operations that provide "control" features, such as connection and termination, and the ordering of operations and signaling their completions. A "send" facility is provided.

RDMA控制函数在语义上分层于直接数据放置之上。包括提供“控制”功能的操作,如连接和终止,以及操作顺序和完成的信号。提供“发送”功能。

While the functions (and potentially protocols) are distinct, historically both aspects taken together have been referred to as "RDMA". The facilities of direct data placement are useful in and of themselves, and may be employed by other upper layer protocols to facilitate data transfer. Therefore, it is often useful to refer to DDP as the data placement functionality and RDMA as the control aspect.

虽然功能(以及可能的协议)是不同的,但从历史上看,这两个方面合在一起被称为“RDMA”。直接数据放置的设施本身是有用的,并且可以被其他上层协议用于促进数据传输。因此,将DDP称为数据放置功能,将RDMA称为控制方面通常是有用的。

[BT05] develops an architecture for DDP and RDMA atop the Internet Protocol Suite, and is a companion document to this problem statement.

[BT05]在互联网协议套件上开发了DDP和RDMA的体系结构,是本问题陈述的配套文档。

6. Conclusions
6. 结论

This Problem Statement concludes that an IP-based, general solution for reducing processing overhead in end-hosts is desirable.

此问题陈述的结论是,需要一种基于IP的通用解决方案来减少终端主机中的处理开销。

It has shown that high overhead of the processing of network data leads to end-host bottlenecks. These bottlenecks are in large part attributable to the copying of data. The bus bandwidth of machines has historically been limited, and the bandwidth of high-speed interconnects taxes it heavily.

它表明,网络数据处理的高开销导致终端主机瓶颈。这些瓶颈在很大程度上归因于数据的复制。从历史上看,机器的总线带宽一直是有限的,而高速互连的带宽对其造成了沉重的负担。

An architectural solution to alleviate these bottlenecks best satisfies the issue. Further, the high speed of today's interconnects and the deployment of these hosts on Internet Protocol-based networks leads to the desirability of layering such a solution on the Internet Protocol Suite. The architecture described in [BT05] is such a proposal.

缓解这些瓶颈的体系结构解决方案最能满足这一问题。此外,当今互连的高速以及这些主机在基于Internet协议的网络上的部署导致了在Internet协议套件上分层这种解决方案的可取性。[BT05]中描述的体系结构就是这样一个方案。

7. Security Considerations
7. 安全考虑

Solutions to the problem of reducing copying overhead in high bandwidth transfers may introduce new security concerns. Any proposed solution must be analyzed for security vulnerabilities and any such vulnerabilities addressed. Potential security weaknesses -- due to resource issues that might lead to denial-of-service attacks, overwrites and other concurrent operations, the ordering of completions as required by the RDMA protocol, the granularity of transfer, and any other identified vulnerabilities -- need to be examined, described, and an adequate resolution to them found.

减少高带宽传输中的复制开销问题的解决方案可能会带来新的安全问题。必须分析任何提议的解决方案的安全漏洞,并解决任何此类漏洞。潜在的安全弱点——由于可能导致拒绝服务攻击、覆盖和其他并发操作的资源问题、RDMA协议要求的完成顺序、传输粒度和任何其他已识别的漏洞——需要检查、描述,并找到了一个适当的解决方案。

Layered atop Internet transport protocols, the RDMA protocols will gain leverage from and must permit integration with Internet security standards, such as IPsec and TLS [IPSEC, TLS]. However, there may be implementation ramifications for certain security approaches with respect to RDMA, due to its copy avoidance.

RDMA协议分层于互联网传输协议之上,将从互联网安全标准(如IPsec和TLS[IPsec,TLS])中获得优势,并且必须允许与之集成。然而,由于RDMA的复制避免,对于某些安全性方法可能会有实现上的影响。

IPsec, operating to secure the connection on a packet-by-packet basis, seems to be a natural fit to securing RDMA placement, which operates in conjunction with transport. Because RDMA enables an implementation to avoid buffering, it is preferable to perform all applicable security protection prior to processing of each segment by the transport and RDMA layers. Such a layering enables the most efficient secure RDMA implementation.

IPsec以分组为基础保护连接,似乎是保护RDMA放置的自然选择,RDMA放置与传输结合使用。因为RDMA使实现能够避免缓冲,所以最好在传输层和RDMA层处理每个段之前执行所有适用的安全保护。这样的分层实现了最高效的安全RDMA实现。

The TLS record protocol, on the other hand, is layered on top of reliable transports and cannot provide such security assurance until an entire record is available, which may require the buffering and/or assembly of several distinct messages prior to TLS processing. This defers RDMA processing and introduces overheads that RDMA is designed to avoid. Therefore, TLS is viewed as potentially a less natural fit for protecting the RDMA protocols.

另一方面,TLS记录协议是在可靠传输之上分层的,在整个记录可用之前无法提供此类安全保证,这可能需要在TLS处理之前缓冲和/或组装几个不同的消息。这延迟了RDMA处理,并引入了RDMA旨在避免的开销。因此,TLS被视为不太适合保护RDMA协议。

It is necessary to guarantee properties such as confidentiality, integrity, and authentication on an RDMA communications channel. However, these properties cannot defend against all attacks from properly authenticated peers, which might be malicious, compromised, or buggy. Therefore, the RDMA design must address protection against such attacks. For example, an RDMA peer should not be able to read or write memory regions without prior consent.

必须保证RDMA通信信道上的机密性、完整性和身份验证等属性。但是,这些属性无法抵御来自经过适当身份验证的对等方的所有攻击,这些攻击可能是恶意的、被破坏的或有缺陷的。因此,RDMA设计必须针对此类攻击提供保护。例如,未经事先同意,RDMA对等方不应能够读取或写入内存区域。

Further, it must not be possible to evade memory consistency checks at the recipient. The RDMA design must allow the recipient to rely on its consistent memory contents by explicitly controlling peer access to memory regions at appropriate times.

此外,必须不能逃避收件人的内存一致性检查。RDMA设计必须允许接收方通过在适当的时间显式控制对等方对内存区域的访问来依赖其一致的内存内容。

Peer connections that do not pass authentication and authorization checks by upper layers must not be permitted to begin processing in RDMA mode with an inappropriate endpoint. Once associated, peer accesses to memory regions must be authenticated and made subject to authorization checks in the context of the association and connection on which they are to be performed, prior to any transfer operation or data being accessed.

不能允许未通过上层身份验证和授权检查的对等连接在RDMA模式下使用不合适的端点开始处理。一旦关联,对内存区域的对等访问必须经过身份验证,并在访问任何传输操作或数据之前,在关联和连接的上下文中进行授权检查。

The RDMA protocols must ensure that these region protections be under strict application control. Remote access to local memory by a network peer is particularly important in the Internet context, where such access can be exported globally.

RDMA协议必须确保这些区域保护受到严格的应用程序控制。网络对等方对本地内存的远程访问在Internet上下文中尤其重要,在Internet上下文中,这种访问可以全局导出。

8. Terminology
8. 术语

This section contains general terminology definitions for this document and for Remote Direct Memory Access in general.

本节包含本文档和一般远程直接内存访问的一般术语定义。

Remote Direct Memory Access (RDMA) A method of accessing memory on a remote system in which the local system specifies the location of the data to be transferred.

远程直接内存访问(RDMA):一种访问远程系统内存的方法,其中本地系统指定要传输的数据的位置。

RDMA Protocol A protocol that supports RDMA Operations to transfer data between systems.

RDMA协议支持RDMA操作以在系统之间传输数据的协议。

Fabric The collection of links, switches, and routers that connect a set of systems.

连接一组系统的链路、交换机和路由器的集合。

Storage Area Network (SAN) A network where disks, tapes, and other storage devices are made available to one or more end-systems via a fabric.

存储区域网络(SAN):磁盘、磁带和其他存储设备通过结构可供一个或多个终端系统使用的网络。

System Area Network A network where clustered systems share services, such as storage and interprocess communication, via a fabric.

系统区域网络群集系统通过结构共享服务(如存储和进程间通信)的网络。

Fibre Channel (FC) An ANSI standard link layer with associated protocols, typically used to implement Storage Area Networks. [FIBRE]

光纤通道(FC):具有相关协议的ANSI标准链路层,通常用于实现存储区域网络。[纤维]

Virtual Interface Architecture (VI, VIA) An RDMA interface definition developed by an industry group and implemented with a variety of differing wire protocols. [VI]

虚拟接口体系结构(VI,VIA)由一个行业团体开发的RDMA接口定义,并用各种不同的有线协议实现。[六]

Infiniband (IB) An RDMA interface, protocol suite and link layer specification defined by an industry trade association. [IB]

Infiniband(IB):由行业协会定义的RDMA接口、协议套件和链路层规范。[IB]

9. Acknowledgements
9. 致谢

Jeff Chase generously provided many useful insights and information. Thanks to Jim Pinkerton for many helpful discussions.

杰夫·蔡斯慷慨地提供了许多有用的见解和信息。感谢吉姆·平克顿的许多有益讨论。

10. Informative References
10. 资料性引用

[ATM] The ATM Forum, "Asynchronous Transfer Mode Physical Layer Specification" af-phy-0015.000, etc. available from http://www.atmforum.com/standards/approved.html.

[ATM]ATM论坛,“异步传输模式物理层规范”af-phy-0015.000等,可从http://www.atmforum.com/standards/approved.html.

[BCF+95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. Su. "Myrinet - A gigabit-per-second local-area network", IEEE Micro, February 1995.

[BCF+95]N.J.博登、D.科恩、R.E.费尔德曼、A.E.库拉维克、C.L.塞茨、J.N.查多维奇和W.苏。“Myrinet-每秒千兆位的局域网”,IEEE Micro,1995年2月。

[BJM+96] G. Buzzard, D. Jacobson, M. Mackey, S. Marovich, J. Wilkes, "An implementation of the Hamlyn send-managed interface architecture", in Proceedings of the Second Symposium on Operating Systems Design and Implementation, USENIX Assoc., October 1996.

[BJM+96]G.Buzzard,D.Jacobson,M.Mackey,S.Marovich,J.Wilkes,“Hamlyn发送管理接口体系结构的实现”,载于第二届操作系统设计和实现研讨会论文集,USENIX Assoc,1996年10月。

[BLA+94] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, "A virtual memory mapped network interface for the SHRIMP multicomputer", in Proceedings of the 21st Annual Symposium on Computer Architecture, April 1994, pp. 142- 153.

[BLA+94]M.A.Blumrich,K.Li,R.Alpert,C.Dubnicki,E.W.Felten,“用于SHRIMP多计算机的虚拟内存映射网络接口”,载于第21届计算机体系结构年度研讨会论文集,1994年4月,第142-153页。

[Br99] J. C. Brustoloni, "Interoperation of copy avoidance in network and file I/O", Proceedings of IEEE Infocom, 1999, pp. 534-542.

[Br99]J.C.Brustoloni,“网络和文件I/O中避免复制的互操作”,IEEE信息通信会议录,1999年,第534-542页。

[BS96] J. C. Brustoloni, P. Steenkiste, "Effects of buffering semantics on I/O performance", Proceedings OSDI'96, USENIX, Seattle, WA October 1996, pp. 277-291.

[BS96]J.C.Brustoloni,P.Steenkiste,“缓冲语义对I/O性能的影响”,OSDI'96论文集,华盛顿州西雅图USENIX,1996年10月,第277-291页。

[BT05] Bailey, S. and T. Talpey, "The Architecture of Direct Data Placement (DDP) And Remote Direct Memory Access (RDMA) On Internet Protocols", RFC 4296, December 2005.

[BT05]Bailey,S.和T.Talpey,“互联网协议上直接数据放置(DDP)和远程直接内存访问(RDMA)的体系结构”,RFC 42962005年12月。

[CFF+94] C-H Chang, D. Flower, J. Forecast, H. Gray, B. Hawe, A. Nadkarni, K. K. Ramakrishnan, U. Shikarpur, K. Wilde, "High-performance TCP/IP and UDP/IP networking in DEC OSF/1 for Alpha AXP", Proceedings of the 3rd IEEE Symposium on High Performance Distributed Computing, August 1994, pp. 36-42.

[CFF+94]C-H Chang,D.Flower,J.Forecast,H.Gray,B.Hawe,A.Nadkarni,K.K.Ramakrishnan,U.Shikarpur,K.Wilde,“Alpha AXP的12月OSF/1中的高性能TCP/IP和UDP/IP网络”,第三届IEEE高性能分布式计算研讨会论文集,1994年8月,第36-42页。

[CGY01] J. S. Chase, A. J. Gallatin, and K. G. Yocum, "End system optimizations for high-speed TCP", IEEE Communications Magazine, Volume: 39, Issue: 4 , April 2001, pp 68-74. http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}.

[CGY01]J.S.Chase,A.J.Gallatin和K.G.Yocum,“高速TCP的终端系统优化”,IEEE通信杂志,第39卷,第4期,2001年4月,第68-74页。http://www.cs.duke.edu/ari/publications/end-system.{ps,pdf}。

[Ch96] H.K. Chu, "Zero-copy TCP in Solaris", Proc. of the USENIX 1996 Annual Technical Conference, San Diego, CA, January 1996.

[Ch96]Chu H.K.“Solaris中的零拷贝TCP”,过程。1996年1月在加利福尼亚州圣地亚哥举行的USENIX 1996年年度技术会议。

[Ch02] Jeffrey Chase, Personal communication.

[Ch02]杰弗里·蔡斯,个人通讯。

[CJRS89] D. D. Clark, V. Jacobson, J. Romkey, H. Salwen, "An analysis of TCP processing overhead", IEEE Communications Magazine, volume: 27, Issue: 6, June 1989, pp 23-29.

[CJRS89]D.D.Clark,V.Jacobson,J.Romkey,H.Salwen,“TCP处理开销分析”,IEEE通信杂志,第27卷,1989年6月6日,第23-29页。

[CT90] D. D. Clark, D. Tennenhouse, "Architectural considerations for a new generation of protocols", Proceedings of the ACM SIGCOMM Conference, 1990.

[CT90]D.D.Clark,D.Tennenhouse,“新一代协议的架构考虑”,ACM SIGCOMM会议记录,1990年。

[DAPP93] P. Druschel, M. B. Abbott, M. A. Pagels, L. L. Peterson, "Network subsystem design", IEEE Network, July 1993, pp. 8-17.

[DAPP93]P.Druschel,M.B.Abbott,M.A.Pagels,L.L.Peterson,“网络子系统设计”,IEEE网络,1993年7月,第8-17页。

[DP93] P. Druschel, L. L. Peterson, "Fbufs: a high-bandwidth cross-domain transfer facility", Proceedings of the 14th ACM Symposium of Operating Systems Principles, December 1993.

[DP93]P.Druschel,L.L.Peterson,“Fbufs:高带宽跨域传输设施”,第14届ACM操作系统原理研讨会论文集,1993年12月。

[DWB+93] C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards, J. Lumley, "Afterburner: architectural support for high-performance protocols", Technical Report, HP Laboratories Bristol, HPL-93-46, July 1993.

[DWB+93]C.Dalton,G.Watson,D.Banks,C.Calamvokis,A.Edwards,J.Lumley,“加力燃烧室:高性能协议的架构支持”,技术报告,惠普实验室布里斯托尔,HPL-93-46,1993年7月。

[EBBV95] T. von Eicken, A. Basu, V. Buch, and W. Vogels, "U-Net: A user-level network interface for parallel and distributed computing", Proc. of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain, Colorado, December 3-6, 1995.

[EBBV95]T.von Eiken,A.Basu,V.Buch和W.Vogels,“U-Net:用于并行和分布式计算的用户级网络接口”,Proc。第15届ACM操作系统原理研讨会,科罗拉多州铜山,1995年12月3-6日。

[FDDI] International Standards Organization, "Fibre Distributed Data Interface", ISO/IEC 9314, committee drafts available from http://www.iso.org.

[FDDI]国际标准组织,“光纤分布式数据接口”,ISO/IEC 9314,委员会草案可从http://www.iso.org.

[FGM+99] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

[FGM+99]菲尔丁,R.,盖蒂斯,J.,莫卧儿,J.,弗莱斯蒂克,H.,马斯特,L.,利奇,P.,和T.伯纳斯·李,“超文本传输协议——HTTP/1.1”,RFC 2616,1999年6月。

   [FIBRE]    ANSI Technical Committee T10, "Fibre Channel Protocol
              (FCP)" (and as revised and updated), ANSI X3.269:1996
              [R2001], committee draft available from
              http://www.t10.org/drafts.htm#FibreChannel
        
   [FIBRE]    ANSI Technical Committee T10, "Fibre Channel Protocol
              (FCP)" (and as revised and updated), ANSI X3.269:1996
              [R2001], committee draft available from
              http://www.t10.org/drafts.htm#FibreChannel
        

[HP97] J. L. Hennessy, D. A. Patterson, Computer Organization and Design, 2nd Edition, San Francisco: Morgan Kaufmann Publishers, 1997.

[HP97]J.L.Hennessy,D.A.Patterson,《计算机组织与设计》,第二版,旧金山:摩根·考夫曼出版社,1997年。

[IB] InfiniBand Trade Association, "InfiniBand Architecture Specification, Volumes 1 and 2", Release 1.1, November 2002, available from http://www.infinibandta.org/specs.

[IB]InfiniBand贸易协会,“InfiniBand体系结构规范,第1卷和第2卷”,1.12002年11月发布,可从http://www.infinibandta.org/specs.

[IPSEC] Kent, S. and R. Atkinson, "Security Architecture for the Internet Protocol", RFC 2401, November 1998.

[IPSEC]Kent,S.和R.Atkinson,“互联网协议的安全架构”,RFC 2401,1998年11月。

[KP96] J. Kay, J. Pasquale, "Profiling and reducing processing overheads in TCP/IP", IEEE/ACM Transactions on Networking, Vol 4, No. 6, pp.817-828, December 1996.

[KP96]J.Kay,J.Pasquale,“分析和减少TCP/IP中的处理开销”,IEEE/ACM网络事务,第4卷,第6期,第817-828页,1996年12月。

[KSZ95] K. Kleinpaste, P. Steenkiste, B. Zill, "Software support for outboard buffering and checksumming", SIGCOMM'95.

[KSZ95]K.Kleinpaste,P.Steenkiste,B.Zill,“舷外缓冲和校验和的软件支持”,SIGCOMM'95。

[Ma02] K. Magoutis, "Design and Implementation of a Direct Access File System (DAFS) Kernel Server for FreeBSD", in Proceedings of USENIX BSDCon 2002 Conference, San Francisco, CA, February 11-14, 2002.

[MA02] K.MangouTIS,“FreeBSD的直接访问文件系统(DAFS)内核服务器的设计和实现”,在USENIX BSDCON 2002会议,旧金山,CA,二月,11-14,2002。

[MAF+02] K. Magoutis, S. Addetia, A. Fedorova, M. I. Seltzer, J. S. Chase, D. Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure and Performance of the Direct Access File System (DAFS)", in Proceedings of the 2002 USENIX Annual Technical Conference, Monterey, CA, June 9-14, 2002.

[MAF+02]K.Magoutis,S.Addetia,A.Fedorova,M.I.Seltzer,J.S.Chase,D.Gallatin,R.Kisley,R.Wickremesinghe,E.Gabber,“直接访问文件系统(DAFS)的结构和性能”,2002年6月9日至14日在加利福尼亚州蒙特利举行的2002年USENIX年度技术会议记录。

[Mc95] J. D. McCalpin, "A Survey of memory bandwidth and machine balance in current high performance computers", IEEE TCCA Newsletter, December 1995.

[Mc95]J.D.McCalpin,“当前高性能计算机内存带宽和机器平衡的调查”,IEEE TCCA通讯,1995年12月。

[PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick , "A case for intelligient RAM: IRAM", IEEE Micro, April 1997.

[PAC+97]D.Patterson,T.Anderson,N.Cardwell,R.From,K.Keeton,C.Kozyrakis,R.Thomas,K.Yelick,“智能RAM的案例:IRAM”,IEEE Micro,1997年4月。

[PDZ99] V. S. Pai, P. Druschel, W. Zwaenepoel, "IO-Lite: a unified I/O buffering and caching system", Proc. of the 3rd Symposium on Operating Systems Design and Implementation, New Orleans, LA, February 1999.

[PDZ99]V.S.Pai,P.Druschel,W.Zwaenepoel,“IO Lite:统一的I/O缓冲和缓存系统”,Proc。第三届操作系统设计与实现研讨会,新奥尔良,洛杉矶,1999年2月。

[Pi01] J. Pinkerton, "Winsock Direct: The Value of System Area Networks", May 2001, available from http://www.microsoft.com/windows2000/techinfo/ howitworks/communications/winsock.asp.

[Pi01]J.Pinkerton,“Winsock Direct:系统区域网络的价值”,2001年5月,可从http://www.microsoft.com/windows2000/techinfo/ howitworks/communications/winsock.asp。

[Po81] Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981.

[Po81]Postel,J.,“传输控制协议”,标准7,RFC 793,1981年9月。

[QUAD] Quadrics Ltd., Quadrics QSNet product information, available from http://www.quadrics.com/website/pages/02qsn.html.

[QUAD]Quadrics Ltd.,Quadrics QSNet产品信息,可从http://www.quadrics.com/website/pages/02qsn.html.

[SDP] InfiniBand Trade Association, "Sockets Direct Protocol v1.0", Annex A of InfiniBand Architecture Specification Volume 1, Release 1.1, November 2002, available from http://www.infinibandta.org/specs.

[SDP]InfiniBand贸易协会,“Sockets Direct Protocol v1.0”,InfiniBand体系结构规范第1卷附件A,1.1版,2002年11月,可从http://www.infinibandta.org/specs.

[SRVNET] R. Horst, "TNet: A reliable system area network", IEEE Micro, pp. 37-45, February 1995.

[SRVNET]R.Horst,“TNet:可靠的系统区域网络”,IEEE Micro,第37-45页,1995年2月。

[STREAM] J. D. McAlpin, The STREAM Benchmark Reference Information, http://www.cs.virginia.edu/stream/.

[STREAM]J.D.McAlpin,STREAM基准参考信息,http://www.cs.virginia.edu/stream/.

[TK95] M. N. Thadani, Y. A. Khalidi, "An efficient zero-copy I/O framework for UNIX", Technical Report, SMLI TR-95-39, May 1995.

[TK95]M.N.Thadani,Y.A.Khalidi,“UNIX的高效零拷贝I/O框架”,技术报告,SMLI TR-95-39,1995年5月。

[TLS] Dierks, T. and C. Allen, "The TLS Protocol Version 1.0", RFC 2246, January 1999.

[TLS]Dierks,T.和C.Allen,“TLS协议版本1.0”,RFC 2246,1999年1月。

[VI] D. Cameron and G. Regnier, "The Virtual Interface Architecture", ISBN 0971288704, Intel Press, April 2002, more info at http://www.intel.com/intelpress/via/.

[六] D.Cameron和G.Regnier,“虚拟接口体系结构”,ISBN 0971288704,英特尔出版社,2002年4月,更多信息请访问http://www.intel.com/intelpress/via/.

[Wa97] J. R. Walsh, "DART: Fast application-level networking via data-copy avoidance", IEEE Network, July/August 1997, pp. 28-38.

[Wa97]J.R.Walsh,“DART:通过避免数据拷贝实现的快速应用程序级网络”,IEEE网络,1997年7月/8月,第28-38页。

Authors' Addresses

作者地址

Stephen Bailey Sandburst Corporation 600 Federal Street Andover, MA 01810 USA

Stephen Bailey Sandburst Corporation美国马萨诸塞州安多弗联邦街600号01810

   Phone: +1 978 689 1614
   EMail: steph@sandburst.com
        
   Phone: +1 978 689 1614
   EMail: steph@sandburst.com
        

Jeffrey C. Mogul HP Labs Hewlett-Packard Company 1501 Page Mill Road, MS 1117 Palo Alto, CA 94304 USA

Jeffrey C.Mogul惠普实验室惠普公司美国加利福尼亚州帕洛阿尔托市佩奇米尔路1501号,邮编:1117,邮编:94304

   Phone: +1 650 857 2206 (EMail preferred)
   EMail: JeffMogul@acm.org
        
   Phone: +1 650 857 2206 (EMail preferred)
   EMail: JeffMogul@acm.org
        

Allyn Romanow Cisco Systems, Inc. 170 W. Tasman Drive San Jose, CA 95134 USA

Allyn Romanow Cisco Systems,Inc.美国加利福尼亚州圣何塞塔斯曼大道西170号,邮编95134

   Phone: +1 408 525 8836
   EMail: allyn@cisco.com
        
   Phone: +1 408 525 8836
   EMail: allyn@cisco.com
        

Tom Talpey Network Appliance 1601 Trapelo Road Waltham, MA 02451 USA

美国马萨诸塞州沃尔瑟姆特拉佩罗路1601号汤姆·塔尔佩网络设备公司02451

   Phone: +1 781 768 5329
   EMail: thomas.talpey@netapp.com
        
   Phone: +1 781 768 5329
   EMail: thomas.talpey@netapp.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。

This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件及其包含的信息是按“原样”提供的,贡献者、他/她所代表或赞助的组织(如有)、互联网协会和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Intellectual Property

知识产权

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.

Acknowledgement

确认

Funding for the RFC Editor function is currently provided by the Internet Society.

RFC编辑功能的资金目前由互联网协会提供。