Network Working Group                                       J. Rosenberg
Request for Comments: 5390                                         Cisco
Category: Informational                                    December 2008
        
Network Working Group                                       J. Rosenberg
Request for Comments: 5390                                         Cisco
Category: Informational                                    December 2008
        

Requirements for Management of Overload in the Session Initiation Protocol

会话启动协议中过载管理的要求

Status of This Memo

关于下段备忘

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (c) 2008 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2008 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/ 许可证信息)在本文件发布之日生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。

Abstract

摘要

Overload occurs in Session Initiation Protocol (SIP) networks when proxies and user agents have insufficient resources to complete the processing of a request. SIP provides limited support for overload handling through its 503 response code, which tells an upstream element that it is overloaded. However, numerous problems have been identified with this mechanism. This document summarizes the problems with the existing 503 mechanism, and provides some requirements for a solution.

当代理和用户代理没有足够的资源来完成请求处理时,会话初始化协议(SIP)网络中会发生过载。SIP通过其503响应代码为过载处理提供有限的支持,该代码告诉上游元素它已过载。然而,这一机制存在许多问题。本文档总结了现有503机制的问题,并提供了解决方案的一些要求。

Table of Contents

目录

   1. Introduction ....................................................2
   2. Causes of Overload ..............................................2
   3. Terminology .....................................................4
   4. Current SIP Mechanisms ..........................................4
   5. Problems with the Mechanism .....................................5
      5.1. Load Amplification .........................................5
      5.2. Underutilization ...........................................9
      5.3. The Off/On Retry-After Problem .............................9
      5.4. Ambiguous Usages ..........................................10
   6. Solution Requirements ..........................................10
   7. Security Considerations ........................................13
   8. Acknowledgements ...............................................13
   9. References .....................................................14
      9.1. Normative Reference .......................................14
      9.2. Informative References ....................................14
        
   1. Introduction ....................................................2
   2. Causes of Overload ..............................................2
   3. Terminology .....................................................4
   4. Current SIP Mechanisms ..........................................4
   5. Problems with the Mechanism .....................................5
      5.1. Load Amplification .........................................5
      5.2. Underutilization ...........................................9
      5.3. The Off/On Retry-After Problem .............................9
      5.4. Ambiguous Usages ..........................................10
   6. Solution Requirements ..........................................10
   7. Security Considerations ........................................13
   8. Acknowledgements ...............................................13
   9. References .....................................................14
      9.1. Normative Reference .......................................14
      9.2. Informative References ....................................14
        
1. Introduction
1. 介绍

Overload occurs in Session Initiation Protocol (SIP) [RFC3261] networks when proxies and user agents have insufficient resources to complete the processing of a request or a response. SIP provides limited support for overload handling through its 503 response code. This code allows a server to tell an upstream element that it is overloaded. However, numerous problems have been identified with this mechanism.

当代理和用户代理没有足够的资源来完成请求或响应的处理时,会话启动协议(SIP)[RFC3261]网络中会发生过载。SIP通过其503响应代码为过载处理提供有限的支持。此代码允许服务器告诉上游元素它已过载。然而,这一机制存在许多问题。

This document describes the general problem of SIP overload and reviews the current SIP mechanisms for dealing with overload. It then explains some of the problems with these mechanisms. Finally, the document provides a set of requirements for fixing these problems.

本文档描述了SIP过载的一般问题,并回顾了当前用于处理过载的SIP机制。然后解释了这些机制的一些问题。最后,该文档提供了一组解决这些问题的需求。

2. Causes of Overload
2. 超载原因

Overload occurs when an element, such as a SIP user agent or proxy, has insufficient resources to successfully process all of the traffic it is receiving. Resources include all of the capabilities of the element used to process a request, including CPU processing, memory, I/O, or disk resources. It can also include external resources such as a database or DNS server, in which case the CPU, processing, memory, I/O, and disk resources of those servers are effectively part of the logical element processing the request. Overload can occur for many reasons, including:

当元素(如SIP用户代理或代理)没有足够的资源来成功处理其接收的所有流量时,就会发生过载。资源包括用于处理请求的元素的所有功能,包括CPU处理、内存、I/O或磁盘资源。它还可以包括外部资源,如数据库或DNS服务器,在这种情况下,这些服务器的CPU、处理、内存、I/O和磁盘资源实际上是处理请求的逻辑元素的一部分。发生过载的原因有很多,包括:

Poor Capacity Planning: SIP networks need to be designed with sufficient numbers of servers, hardware, disks, and so on, in order to meet the needs of the subscribers they are expected to serve. Capacity planning is the process of determining these needs. It is based on the number of expected subscribers and the types of flows they are expected to use. If this work is not done properly, the network may have insufficient capacity to handle predictable usages, including regular usages and predictably high ones (such as high voice calling volumes on Mother's Day).

糟糕的容量规划:SIP网络需要设计足够数量的服务器、硬件、磁盘等,以满足预期服务的用户的需求。能力规划是确定这些需求的过程。它基于预期订阅者的数量和他们预期使用的流的类型。如果这项工作做得不好,网络可能没有足够的容量来处理可预测的使用,包括常规使用和可预测的高使用(例如母亲节的高话音呼叫量)。

Dependency Failures: A SIP element can become overloaded because a resource on which it is dependent has failed or become overloaded, greatly reducing the logical capacity of the element. In these cases, even minimal traffic might cause the server to go into overload. Examples of such dependency overloads include DNS servers, databases, disks, and network interfaces.

依赖失败:SIP元素可能会过载,因为它所依赖的资源发生故障或过载,从而大大降低元素的逻辑容量。在这些情况下,即使是最小的通信量也可能导致服务器过载。此类依赖项重载的示例包括DNS服务器、数据库、磁盘和网络接口。

Component Failures: A SIP element can become overloaded when it is a member of a cluster of servers that each share the load of traffic, and one or more of the other members in the cluster fail. In this case, the remaining elements take over the work of the failed elements. Normally, capacity planning takes such failures into account, and servers are typically run with enough spare capacity to handle failure of another element. However, unusual failure conditions can cause many elements to fail at once. This is often the case with software failures, where a bad packet or bad database entry hits the same bug in a set of elements in a cluster.

组件故障:当SIP元素是共享流量负载的服务器集群的成员,并且集群中的一个或多个其他成员出现故障时,SIP元素可能会过载。在这种情况下,其余元素将接管故障元素的工作。通常情况下,容量规划会将此类故障考虑在内,服务器运行时通常会有足够的备用容量来处理另一个元素的故障。但是,异常故障条件可能会导致许多元件同时发生故障。软件故障通常就是这种情况,坏数据包或坏数据库条目在集群中的一组元素中遇到相同的错误。

Avalanche Restart: One of the most troubling sources of overload is avalanche restart. This happens when a large number of clients all simultaneously attempt to connect to the network with a SIP registration. Avalanche restart can be caused by several events. One is the "Manhattan Reboots" scenario, where there is a power failure in a large metropolitan area, such as Manhattan. When power is restored, all of the SIP phones, whether in PCs or standalone devices, simultaneously power on and begin booting. They will all then connect to the network and register, causing a flood of SIP REGISTER messages. Another cause of avalanche restart is failure of a large network connection, for example, the access router for an enterprise. When it fails, SIP clients will detect the failure rapidly using the mechanisms in [OUTBOUND]. When connectivity is restored, this is detected, and clients re-REGISTER, all within a short time period. Another source of avalanche restart is failure of a proxy server. If clients had

雪崩重启:最令人不安的过载来源之一是雪崩重启。当大量客户端同时尝试通过SIP注册连接到网络时,就会发生这种情况。雪崩重启可能由多个事件引起。一种是“曼哈顿重新启动”情景,在曼哈顿这样的大城市地区出现停电。恢复电源后,所有SIP电话,无论是PC还是独立设备,都会同时通电并开始引导。然后,它们都将连接到网络并注册,从而导致大量SIP注册消息。雪崩重启的另一个原因是大型网络连接失败,例如,企业的访问路由器。当失败时,SIP客户端将使用[OUTBOUND]中的机制快速检测失败。恢复连接后,将检测到连接,并在短时间内重新注册客户端。雪崩重启的另一个原因是代理服务器故障。如果客户有

all connected to the server with TCP, its failure will be detected, followed by re-connection and re-registration to another server. Note that [OUTBOUND] does provide some remedies to this case.

所有连接到服务器的TCP,其故障将被检测,然后重新连接并重新注册到另一个服务器。请注意,[OUTBOUND]确实为这种情况提供了一些补救措施。

Flash Crowds: A flash crowd occurs when an extremely large number of users all attempt to simultaneously make a call. One example of how this can happen is a television commercial that advertises a number to call to receive a free gift. If the gift is compelling and many people see the ad, many calls can be simultaneously made to the same number. This can send the system into overload.

闪光群组:当大量用户试图同时拨打电话时,会出现闪光群组。这种情况如何发生的一个例子是一个电视广告,广告中有一个号码可以打电话接受免费礼物。如果礼物很吸引人,而且很多人看到了广告,那么可以同时拨打同一个号码。这会使系统过载。

Denial of Service (DoS) Attacks: An attacker, wishing to disrupt service in the network, can cause a large amount of traffic to be launched at a target server. This can be done from a central source of traffic or through a distributed DoS attack. In all cases, the volume of traffic well exceeds the capacity of the server, sending the system into overload.

拒绝服务(DoS)攻击:攻击者希望中断网络中的服务,可导致在目标服务器上启动大量流量。这可以通过中央通信源或分布式DoS攻击实现。在所有情况下,通信量都远远超过了服务器的容量,导致系统过载。

Unfortunately, the overload problem tends to compound itself. When a network goes into overload, this can frequently cause failures of the elements that are trying to process the traffic. This causes even more load on the remaining elements. Furthermore, during overload, the overall capacity of functional elements goes down, since much of their resources are spent just rejecting or treating load that they cannot actually process. In addition, overload tends to cause SIP messages to be delayed or lost, which causes retransmissions to be sent, further increasing the amount of work in the network. This compounding factor can produce substantial multipliers on the load in the system. Indeed, in the case of UDP, with as many as seven retransmits of an INVITE request prior to timeout, overload can multiply the already-heavy message volume by as much as seven!

不幸的是,过载问题往往会变得更加复杂。当网络过载时,这经常会导致试图处理流量的元件发生故障。这会对其余图元造成更大的负载。此外,在过载期间,功能元素的总体容量会下降,因为它们的大部分资源都用于拒绝或处理它们无法实际处理的负载。此外,过载往往会导致SIP消息延迟或丢失,从而导致发送重传,从而进一步增加网络中的工作量。该复合系数可以对系统中的负载产生实质性的乘数。事实上,在UDP的情况下,在超时之前,一个INVITE请求有多达七次的重新传输,过载可以将已经很重的消息量乘以多达七次!

3. Terminology
3. 术语

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[RFC2119]中所述进行解释。

4. Current SIP Mechanisms
4. 当前的SIP机制

SIP provides very basic support for overload. It defines the 503 response code, which is sent by an element that is overloaded. RFC 3261 defines it thus:

SIP为过载提供了非常基本的支持。它定义503响应代码,该代码由重载的元素发送。RFC 3261对其进行了如下定义:

The server is temporarily unable to process the request due to a temporary overloading or maintenance of the server. The server MAY indicate when the client should retry the request in

由于服务器临时过载或维护,服务器暂时无法处理请求。服务器可能会指示客户端何时应在中重试请求

a Retry-After header field. If no Retry-After is given, the client MUST act as if it had received a 500 (Server Internal Error) response.

标头字段之后重试。如果没有给出重试后,客户端必须像收到500(服务器内部错误)响应一样进行操作。

A client (proxy or UAC) receiving a 503 (Service Unavailable) SHOULD attempt to forward the request to an alternate server. It SHOULD NOT forward any other requests to that server for the duration specified in the Retry-After header field, if present.

接收503(服务不可用)的客户端(代理或UAC)应尝试将请求转发到备用服务器。在头字段(如果存在)后重试中指定的持续时间内,它不应将任何其他请求转发到该服务器。

Servers MAY refuse the connection or drop the request instead of responding with 503 (Service Unavailable).

服务器可能会拒绝连接或丢弃请求,而不是使用503(服务不可用)进行响应。

The objective is to provide a mechanism to move the work of the overloaded server to another server so that the request can be processed. The Retry-After header field, when present, is meant to allow a server to tell an upstream element to back off for a period of time, so that the overloaded server can work through its backlog of work.

目标是提供一种机制,将过载服务器的工作转移到另一台服务器,以便处理请求。当存在Retry After header字段时,它意味着允许服务器通知上游元素退出一段时间,以便过载的服务器可以完成积压的工作。

RFC 3261 also instructs proxies to not forward 503 responses upstream, at SHOULD NOT strength. This is to avoid the upstream server of mistakingly concluding that the proxy is overloaded when, in fact, the problem was an element further downstream.

RFC 3261还指示代理不以不应有的强度向上游转发503个响应。这是为了避免上游服务器错误地认为代理超载,而事实上,问题是下游的一个元素。

5. Problems with the Mechanism
5. 机制的问题

At the surface, the 503 mechanism seems workable. Unfortunately, this mechanism has had numerous problems in actual deployment. These problems are described here.

从表面上看,503机制似乎是可行的。不幸的是,这一机制在实际部署中遇到了许多问题。这里描述了这些问题。

5.1. Load Amplification
5.1. 负载放大

The principal problem with the 503 mechanism is that it tends to substantially amplify the load in the network when the network is overloaded, causing further escalation of the problem and introducing the very real possibility of congestive collapse. Consider the topology in Figure 1.

503机制的主要问题是,当网络过载时,它往往会大幅放大网络中的负载,导致问题进一步升级,并引入非常真实的拥塞崩溃可能性。考虑图1中的拓扑结构。

                                         +------+
                                       > |      |
                                      /  |  S1  |
                                     /   |      |
                                    /    +------+
                                   /
                                  /
                                 /
                                /
                      +------+ /         +------+
            --------> |      |/          |      |
                      |  P1  |---------> |  S2  |
            --------> |      |\          |      |
                      +------+ \         +------+
                                \
                                 \
                                  \
                                   \
                                    \
                                     \   +------+
                                      \  |      |
                                       > |  S3  |
                                         |      |
                                         +------+
        
                                         +------+
                                       > |      |
                                      /  |  S1  |
                                     /   |      |
                                    /    +------+
                                   /
                                  /
                                 /
                                /
                      +------+ /         +------+
            --------> |      |/          |      |
                      |  P1  |---------> |  S2  |
            --------> |      |\          |      |
                      +------+ \         +------+
                                \
                                 \
                                  \
                                   \
                                    \
                                     \   +------+
                                      \  |      |
                                       > |  S3  |
                                         |      |
                                         +------+
        

Figure 1

图1

Proxy P1 receives SIP requests from many sources and acts solely as a load balancer, proxying the requests to servers S1, S2, and S3 for processing. The input load increases to the point where all three servers become overloaded. Server S1, when it receives its next request, generates a 503. However, because the server is loaded, it might take some time to generate the 503. If SIP is being run over UDP, this may result in request retransmissions, which further increase the work on S1. Even in the case of TCP, if the server is loaded and the kernel cannot send TCP acknowledgements fast enough, TCP retransmits may occur. When the 503 is received by P1, it retries the request on S2. S2 is also overloaded and eventually generates a 503, but in the interim may also be hit with retransmits. P1 once again tries another server, this time S3, which also eventually rejects it with a 503.

代理P1接收来自多个源的SIP请求,并单独充当负载平衡器,将请求代理到服务器S1、S2和S3进行处理。输入负载增加到三台服务器都过载的程度。服务器S1在接收到下一个请求时生成503。但是,由于服务器已加载,因此生成503可能需要一些时间。如果SIP通过UDP运行,这可能会导致请求重新传输,从而进一步增加S1上的工作。即使是在TCP的情况下,如果服务器已加载且内核无法足够快地发送TCP确认,则可能会发生TCP重新传输。当P1接收到503时,它将重试S2上的请求。S2也过载并最终生成503,但在此期间也可能受到重传的影响。P1再次尝试另一台服务器,这次是S3,它最终也用503拒绝了它。

Thus, the processing of this request, which ultimately failed, involved four SIP transactions (client to P1, P1 to S1, P1 to S2, P1 to S3), each of which may have involved many retransmissions -- up to seven in the case of UDP. Thus, under unloaded conditions, a single request from a client would generate one request (to S1, S2, or S3) and two responses (from S1 to P1, then P1 to the client). When the

因此,该请求的处理(最终失败)涉及四个SIP事务(客户端到P1、P1到S1、P1到S2、P1到S3),每个事务可能涉及多次重传——UDP情况下最多七次。因此,在卸载条件下,来自客户端的单个请求将生成一个请求(到S1、S2或S3)和两个响应(从S1到P1,然后P1到客户端)。当

network is overloaded, a single request from the client, before timing out, could generate as many as 18 requests and as many responses when UDP is used! The situation is better with TCP (or any reliable transport in general), but even if there was never a TCP segment retransmitted, a single request from the client can generate three requests and four responses. Each server had to expend resources to process these messages. Thus, more messages and more work were sent into the network at the point at which the elements became overloaded. The 503 mechanism works well when a single element is overloaded. But when the problem is overall network load, the 503 mechanism actually generates more messages and more work for all servers, ultimately resulting in the rejection of the request anyway.

网络过载,在超时之前,来自客户端的单个请求可以生成多达18个请求,在使用UDP时可以生成多达18个响应!使用TCP(或任何可靠的传输)情况更好,但即使从未重新传输TCP段,来自客户端的单个请求也可以生成三个请求和四个响应。每个服务器都必须花费资源来处理这些消息。因此,在元素过载时,更多的消息和更多的工作被发送到网络中。当单个元素过载时,503机制工作良好。但当问题是整个网络负载时,503机制实际上会为所有服务器生成更多消息和更多工作,最终导致请求被拒绝。

The problem becomes amplified further if one considers proxies upstream from P1, as shown in Figure 2.

如果考虑P1上游的代理,问题会进一步扩大,如图2所示。

                                +------+
                              > |      | <
                             /  |  S1  |  \\
                            /   |      |    \\
                           /    +------+      \\
                          /                     \
                         /                       \\
                        /                          \\
                       /                             \
            +------+  /         +------+           +------+
            |      | /          |      |           |      |
            |  P1  | ---------> |  S2  |<----------|  P2  |
            |      | \          |      |           |      |
            +------+  \         +------+           +------+
                ^      \                             / ^
                 \      \                          // /
                  \      \                       //  /
                   \      \                    //   /
                    \      \                  /    /
                     \      \   +------+    //    /
                      \      \  |      |  //     /
                       \      > |  S3  | <      /
                        \       |      |       /
                         \      +------+      /
                          \                  /
                           \                /
                            \              /
                             \            /
                              \          /
                               \        /
                                \      /
                                 \    /
                                +------+
                                |      |
                                |  PA  |
                                |      |
                                +------+
                                 ^   ^
                                 |   |
                                 |   |
        
                                +------+
                              > |      | <
                             /  |  S1  |  \\
                            /   |      |    \\
                           /    +------+      \\
                          /                     \
                         /                       \\
                        /                          \\
                       /                             \
            +------+  /         +------+           +------+
            |      | /          |      |           |      |
            |  P1  | ---------> |  S2  |<----------|  P2  |
            |      | \          |      |           |      |
            +------+  \         +------+           +------+
                ^      \                             / ^
                 \      \                          // /
                  \      \                       //  /
                   \      \                    //   /
                    \      \                  /    /
                     \      \   +------+    //    /
                      \      \  |      |  //     /
                       \      > |  S3  | <      /
                        \       |      |       /
                         \      +------+      /
                          \                  /
                           \                /
                            \              /
                             \            /
                              \          /
                               \        /
                                \      /
                                 \    /
                                +------+
                                |      |
                                |  PA  |
                                |      |
                                +------+
                                 ^   ^
                                 |   |
                                 |   |
        

Figure 2

图2

Here, proxy PA receives requests and sends these to proxies P1 or P2. P1 and P2 both load balance across S1 through S3. Assuming again S1 through S3 are all overloaded, a request arrives at PA, which tries P1 first. P1 tries S1, S2, and then S3, and each transaction results in many request retransmits if UDP is used. Since P1 is unable to

这里,代理PA接收请求并将这些请求发送给代理P1或P2。P1和P2都是S1到S3之间的负载平衡。再次假设S1到S3都过载,请求到达PA,PA首先尝试P1。P1尝试S1、S2,然后是S3,如果使用UDP,每个事务都会导致多次请求重传。因为P1不能

eventually process the request, it rejects it. However, since all of its downstream dependencies are busy, it decides to send a 503. This propagates to PA, which tries P2, which tries S1 through S3 again, resulting in a 503 once more. Thus, in this case, we have doubled the number of SIP transactions and overall work in the network compared to the previous case. The problem here is that the fact that S1 through S3 were overloaded was known to P1, but this information was not passed back to PA and through to P2, so that P2 retries S1 through S3 again.

最终处理请求时,它会拒绝它。然而,由于它的所有下游依赖项都很忙,它决定发送503。这传播到PA,PA尝试P2,后者再次尝试S1到S3,结果再次出现503。因此,在这种情况下,与前一种情况相比,我们在网络中的SIP事务和总体工作的数量增加了一倍。这里的问题是P1知道S1到S3过载的事实,但该信息没有传回PA,也没有传回P2,因此P2再次重试S1到S3。

5.2. Underutilization
5.2. 利用不足

Interestingly, there are also examples of deployments where the network capacity was greatly reduced as a consequence of the overload mechanism. Consider again Figure 1. Unfortunately, RFC 3261 is unclear on the scope of a 503. When it is received by P1, does the proxy cease sending requests to that IP address? To the hostname? To the URI? Some implementations have chosen the hostname as the scope. When the hostname for a URI points to an SRV record in the DNS, which, in turn, maps to a cluster of downstream servers (S1, S2, and S3 in the example), a 503 response from a single one of them will make the proxy believe that the entire cluster is overloaded. Consequently, proxy P1 will cease sending any traffic to any element in the cluster, even though there are elements in the cluster that are underutilized.

有趣的是,还有一些部署的示例,由于过载机制,网络容量大大降低。再次考虑图1。不幸的是,RFC 3261不清楚503的范围。当P1收到请求时,代理是否停止向该IP地址发送请求?到主机名?到URI?有些实现选择主机名作为作用域。当URI的主机名指向DNS中的SRV记录时,该记录反过来映射到下游服务器集群(示例中为S1、S2和S3),其中一个服务器的503响应将使代理相信整个集群过载。因此,代理P1将停止向集群中的任何元素发送任何通信量,即使集群中存在未充分利用的元素。

5.3. The Off/On Retry-After Problem
5.3. 出现问题后,关闭/打开重试

The Retry-After mechanism allows a server to tell an upstream element to stop sending traffic for a period of time. The work that would have otherwise been sent to that server is instead sent to another server. The mechanism is an all-or-nothing technique. A server can turn off all traffic towards it, or none. There is nothing in between. This tends to cause highly oscillatory behavior under even mild overload. Consider a proxy P1 that is balancing requests between two servers S1 and S2. The input load just reaches the point where both S1 and S2 are at 100% capacity. A request arrives at P1 and is sent to S1. S1 rejects this request with a 503, and decides to use Retry-After to clear its backlog. P1 stops sending all traffic to S1. Now, S2 gets traffic, but it is seriously overloaded -- at 200% capacity! It decides to reject a request with a 503 and a Retry-After, which now forces P1 to reject all traffic until S1's Retry-After timer expires. At that point, all load is shunted back to S1, which reaches overload, and the cycle repeats.

重试后机制允许服务器通知上游元素在一段时间内停止发送流量。否则会发送到该服务器的工作将被发送到另一台服务器。该机制是一种要么全有要么全无的技术。服务器可以关闭指向它的所有通信,也可以不关闭。两者之间什么都没有。即使在轻微过载的情况下,这也会导致高度振荡行为。考虑一个代理P1,它平衡两个服务器S1和S2之间的请求。输入负载刚好达到S1和S2均为100%容量的点。请求到达P1并发送到S1。S1以503拒绝此请求,并决定使用Retry After清除积压。P1停止向S1发送所有通信量。现在,S2获得了流量,但它严重超载——在200%的容量下!它决定以503和重试后的方式拒绝请求,这将强制P1拒绝所有流量,直到S1在计时器过期后重试为止。此时,所有负载分流回S1,S1达到过载,循环重复。

It's important to observe that this problem is only observed for servers where there are a small number of upstream elements sending it traffic, as is the case in these examples. If a proxy is accessed

必须注意的是,只有在有少量上游元素发送It流量的服务器上才会出现此问题,在这些示例中就是这样。如果代理被访问

by a large number of clients, each of which sends a small amount of traffic, the 503 mechanism with Retry-After is quite effective when utilized with a subset of the clients. This is because spreading the 503 out amongst the clients has the effect of providing the proxy more fine-grained controls on the amount of work it receives.

通过大量客户端(每个客户端发送少量通信量),503机制在与客户端子集一起使用时,具有Retry After的机制非常有效。这是因为在客户机之间分散503具有为代理提供对其接收的工作量的更细粒度控制的效果。

5.4. Ambiguous Usages
5.4. 含糊不清的用法

Unfortunately, the specific instances under which a server is to send a 503 are ambiguous. The result is that implementations generate 503 for many reasons, only some of which are related to actual overload. For example, RFC 3398 [RFC3398], which specifies interworking from SIP to ISDN User Part (ISUP), defines the usage of 503 when the gateway receives certain ISUP cause codes from downstream switches. In these cases, the gateway has ample capacity; it's just that this specific request could not be processed because of a downstream problem. All subsequent requests might succeed if they take a different route in the Public Switched Telephone Network (PSTN).

不幸的是,服务器发送503的特定实例是不明确的。结果是,由于许多原因,实现生成503,其中只有一些与实际过载有关。例如,RFC 3398[RFC3398]规定了从SIP到ISDN用户部分(ISUP)的互通,定义了网关从下游交换机接收到某些ISUP原因代码时503的使用。在这些情况下,网关具有足够的容量;只是由于下游问题,无法处理此特定请求。如果在公共交换电话网(PSTN)中采用不同的路由,则所有后续请求都可能成功。

This causes two problems. First, during periods of overload, it exacerbates the problems above because it causes additional 503 to be fed into the system, causing further work to be generated in conditions of overload. Second, it becomes hard for an upstream element to know whether to retry when a 503 is received. There are classes of failures where trying on another server won't help, since the reason for the failure was that a common downstream resource is unavailable. For example, if servers S1 and S2 share a database and the database fails, a request sent to S1 will result in a 503, but retrying on S2 won't help since the same database is unavailable.

这导致了两个问题。首先,在过载期间,它会加剧上述问题,因为它会导致额外的503输入系统,从而在过载情况下产生进一步的功。第二,当接收到503时,上游元素很难知道是否重试。在某些类别的故障中,在另一台服务器上尝试是没有帮助的,因为故障的原因是公共下游资源不可用。例如,如果服务器S1和S2共享一个数据库,而该数据库失败,则发送到S1的请求将导致503,但在S2上重试没有帮助,因为同一个数据库不可用。

6. Solution Requirements
6. 解决方案要求

In this section, we propose requirements for an overload control mechanism for SIP that addresses these problems.

在本节中,我们提出了SIP过载控制机制的要求,以解决这些问题。

REQ 1: The overload mechanism shall strive to maintain the overall useful throughput (taking into consideration the quality-of-service needs of the using applications) of a SIP server at reasonable levels, even when the incoming load on the network is far in excess of its capacity. The overall throughput under load is the ultimate measure of the value of an overload control mechanism.

REQ 1:过载机制应努力将SIP服务器的总体有效吞吐量(考虑到使用应用程序的服务质量需求)保持在合理水平,即使网络上的输入负载远远超过其容量。负载下的总吞吐量是过载控制机制值的最终度量。

REQ 2: When a single network element fails, goes into overload, or suffers from reduced processing capacity, the mechanism should strive to limit the impact of this on other elements in the network. This helps to prevent a small-scale failure from becoming a widespread outage.

REQ 2:当单个网元出现故障、过载或处理能力降低时,该机制应努力限制其对网络中其他网元的影响。这有助于防止小规模故障成为大范围停机。

REQ 3: The mechanism should seek to minimize the amount of configuration required in order to work. For example, it is better to avoid needing to configure a server with its SIP message throughput, as these kinds of quantities are hard to determine.

要求3:该机制应尽量减少工作所需的配置量。例如,最好避免使用SIP消息吞吐量配置服务器,因为这些数量很难确定。

REQ 4: The mechanism must be capable of dealing with elements that do not support it, so that a network can consist of a mix of elements that do and don't support it. In other words, the mechanism should not work only in environments where all elements support it. It is reasonable to assume that it works better in such environments, of course. Ideally, there should be incremental improvements in overall network throughput as increasing numbers of elements in the network support the mechanism.

REQ 4:该机制必须能够处理不支持它的元素,以便网络可以由支持和不支持它的元素组成。换句话说,该机制不应该只在所有元素都支持它的环境中工作。当然,我们有理由假设它在这样的环境中工作得更好。理想情况下,随着网络中支持该机制的元素数量的增加,总体网络吞吐量应该有增量的提高。

REQ 5: The mechanism should not assume that it will only be deployed in environments with completely trusted elements. It should seek to operate as effectively as possible in environments where other elements are malicious; this includes preventing malicious elements from obtaining more than a fair share of service.

REQ 5:该机制不应假定它仅部署在具有完全受信任元素的环境中。它应寻求在其他元素恶意的环境中尽可能有效地运行;这包括防止恶意元素获得超过公平份额的服务。

REQ 6: When overload is signaled by means of a specific message, the message must clearly indicate that it is being sent because of overload, as opposed to other, non overload-based failure conditions. This requirement is meant to avoid some of the problems that have arisen from the reuse of the 503 response code for multiple purposes. Of course, overload is also signaled by lack of response to requests. This requirement applies only to explicit overload signals.

REQ 6:当通过特定消息发出过载信号时,该消息必须清楚地表明是由于过载而发送的,而不是其他基于非过载的故障情况。此要求旨在避免因将503响应代码用于多种用途而产生的一些问题。当然,过载的信号还包括对请求的响应不足。此要求仅适用于明确的过载信号。

REQ 7: The mechanism shall provide a way for an element to throttle the amount of traffic it receives from an upstream element. This throttling shall be graded so that it is not all-or-nothing as with the current 503 mechanism. This recognizes the fact that "overload" is not a binary state and that there are degrees of overload.

REQ 7:该机制应为元件提供一种方式,以限制其从上游元件接收的通信量。该节流应分级,以使其与当前503机构不完全相同或完全不同。这认识到“过载”不是二进制状态,并且存在不同程度的过载。

REQ 8: The mechanism shall ensure that, when a request was not processed successfully due to overload (or failure) of a downstream element, the request will not be retried on another element that is also overloaded or whose status is unknown. This requirement derives from REQ 1.

REQ 8:该机制应确保,当由于下游元件过载(或故障)导致请求未成功处理时,不会在另一个同样过载或状态未知的元件上重试该请求。此要求源自REQ 1。

REQ 9: That a request has been rejected from an overloaded element shall not unduly restrict the ability of that request to be submitted to and processed by an element that is not overloaded. This requirement derives from REQ 1.

REQ 9:来自过载元素的请求被拒绝不应过度限制该请求提交给未过载元素并由其处理的能力。此要求源自REQ 1。

REQ 10: The mechanism should support servers that receive requests from a large number of different upstream elements, where the set of upstream elements is not enumerable.

REQ 10:该机制应支持从大量不同上游元素接收请求的服务器,其中上游元素集不可枚举。

REQ 11: The mechanism should support servers that receive requests from a finite set of upstream elements, where the set of upstream elements is enumerable.

REQ 11:该机制应支持从有限的上游元素集接收请求的服务器,其中上游元素集是可枚举的。

REQ 12: The mechanism should work between servers in different domains.

请求12:该机制应在不同域中的服务器之间工作。

REQ 13: The mechanism must not dictate a specific algorithm for prioritizing the processing of work within a proxy during times of overload. It must permit a proxy to prioritize requests based on any local policy, so that certain ones (such as a call for emergency services or a call with a specific value of the Resource-Priority header field [RFC4412]) are given preferential treatment, such as not being dropped, being given additional retransmission, or being processed ahead of others.

REQ 13:该机制不得规定在过载期间对代理内的工作处理进行优先级排序的特定算法。它必须允许代理根据任何本地策略对请求进行优先级排序,以便对某些请求(例如紧急服务呼叫或具有资源优先级头字段[RFC4412]特定值的呼叫)给予优先处理,例如不被丢弃、被给予额外重传或被提前处理。

REQ 14: The mechanism should provide unambiguous directions to clients on when they should retry a request and when they should not. This especially applies to TCP connection establishment and SIP registrations, in order to mitigate against avalanche restart.

REQ 14:该机制应向客户端提供明确的指示,说明何时应重试请求,何时不应重试请求。这尤其适用于TCP连接建立和SIP注册,以缓解雪崩重启。

REQ 15: In cases where a network element fails, is so overloaded that it cannot process messages, or cannot communicate due to a network failure or network partition, it will not be able to provide explicit indications of the nature of the failure or its levels of congestion. The mechanism must properly function in these cases.

REQ 15:如果一个网元发生故障,过载到无法处理消息,或者由于网络故障或网络分区而无法通信,那么它将无法提供故障性质或拥塞级别的明确指示。在这些情况下,该机制必须正常运作。

REQ 16: The mechanism should attempt to minimize the overhead of the overload control messaging.

REQ 16:该机制应尝试最小化过载控制消息传递的开销。

REQ 17: The overload mechanism must not provide an avenue for malicious attack, including DoS and DDoS attacks.

请求17:过载机制不得提供恶意攻击的途径,包括DoS和DDoS攻击。

REQ 18: The overload mechanism should be unambiguous about whether a load indication applies to a specific IP address, host, or URI, so that an upstream element can determine the load of the entity to which a request is to be sent.

REQ 18:重载机制应该明确负载指示是否适用于特定的IP地址、主机或URI,以便上游元素可以确定要向其发送请求的实体的负载。

REQ 19: The specification for the overload mechanism should give guidance on which message types might be desirable to process over others during times of overload, based on SIP-specific considerations. For example, it may be more beneficial to process a SUBSCRIBE refresh with Expires of zero than a SUBSCRIBE refresh

REQ 19:过载机制的规范应根据SIP的具体考虑,给出过载期间哪些消息类型可能需要优先于其他类型处理的指导。例如,处理过期时间为零的订阅刷新可能比订阅刷新更有益

with a non-zero expiration (since the former reduces the overall amount of load on the element), or to process re-INVITEs over new INVITEs.

使用非零过期(因为前者减少了元素上的总负载量),或者在新邀请上处理重新邀请。

REQ 20: In a mixed environment of elements that do and do not implement the overload mechanism, no disproportionate benefit shall accrue to the users or operators of the elements that do not implement the mechanism.

REQ 20:在实施和未实施过载机制的元件的混合环境中,未实施过载机制的元件的用户或操作员不得获得不相称的利益。

REQ 21: The overload mechanism should ensure that the system remains stable. When the offered load drops from above the overall capacity of the network to below the overall capacity, the throughput should stabilize and become equal to the offered load.

REQ 21:过载机制应确保系统保持稳定。当提供的负载从网络的总容量以上下降到总容量以下时,吞吐量应稳定并与提供的负载相等。

REQ 22: It must be possible to disable the reporting of load information towards upstream targets based on the identity of those targets. This allows a domain administrator who considers the load of their elements to be sensitive information, to restrict access to that information. Of course, in such cases, there is no expectation that the overload mechanism itself will help prevent overload from that upstream target.

REQ 22:必须能够根据上游目标的标识禁用向这些目标报告负载信息。这允许将其元素的负载视为敏感信息的域管理员限制对该信息的访问。当然,在这种情况下,我们并不期望过载机制本身能够帮助防止来自上游目标的过载。

REQ 23: It must be possible for the overload mechanism to work in cases where there is a load balancer in front of a farm of proxies.

REQ 23:在代理服务器群前面有负载平衡器的情况下,过载机制必须能够工作。

7. Security Considerations
7. 安全考虑

Like all protocol mechanisms, a solution for overload handling must prevent against malicious inside and outside attacks. This document includes requirements for such security functions.

与所有协议机制一样,过载处理解决方案必须防止恶意的内部和外部攻击。本文件包括此类安全功能的要求。

Any mechanism that improves the behavior of SIP elements under load will result in more predictable performance in the face of application-layer denial-of-service attacks.

任何改善SIP元素在负载下行为的机制都会在应用层拒绝服务攻击面前带来更可预测的性能。

8. Acknowledgements
8. 致谢

The author would like to thank Steve Mayer, Mouli Chandramouli, Robert Whent, Mark Perkins, Joe Stone, Vijay Gurbani, Steve Norreys, Volker Hilt, Spencer Dawkins, Matt Mathis, Juergen Schoenwaelder, and Dale Worley for their contributions to this document.

作者要感谢史蒂夫·梅尔、穆利·钱德拉穆利、罗伯特·温特、马克·珀金斯、乔·斯通、维杰·古巴尼、史蒂夫·诺里斯、沃尔克·希尔特、斯宾塞·道金斯、马特·马蒂斯、尤尔根·舍恩瓦埃尔德和戴尔·沃利对本文件的贡献。

9. References
9. 工具书类
9.1. Normative Reference
9.1. 规范性引用文件

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

9.2. Informative References
9.2. 资料性引用

[OUTBOUND] Jennings, C. and R. Mahy, "Managing Client Initiated Connections in the Session Initiation Protocol (SIP)", Work in Progress, October 2008.

[OUTBOUND]Jennings,C.和R.Mahy,“在会话启动协议(SIP)中管理客户端启动的连接”,正在进行的工作,2008年10月。

[RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002.

[RFC3261]Rosenberg,J.,Schulzrinne,H.,Camarillo,G.,Johnston,A.,Peterson,J.,Sparks,R.,Handley,M.,和E.Schooler,“SIP:会话启动协议”,RFC 3261,2002年6月。

[RFC3398] Camarillo, G., Roach, A., Peterson, J., and L. Ong, "Integrated Services Digital Network (ISDN) User Part (ISUP) to Session Initiation Protocol (SIP) Mapping", RFC 3398, December 2002.

[RFC3398]Camarillo,G.,Roach,A.,Peterson,J.,和L.Ong,“综合业务数字网(ISDN)用户部分(ISUP)到会话发起协议(SIP)的映射”,RFC 3398,2002年12月。

[RFC4412] Schulzrinne, H. and J. Polk, "Communications Resource Priority for the Session Initiation Protocol (SIP)", RFC 4412, February 2006.

[RFC4412]Schulzrinne,H.和J.Polk,“会话启动协议(SIP)的通信资源优先级”,RFC 4412,2006年2月。

Author's Address

作者地址

Jonathan Rosenberg Cisco Edison, NJ US

Jonathan Rosenberg Cisco Edison,美国新泽西州

   EMail: jdrosen@cisco.com
   URI:   http://www.jdrosen.net
        
   EMail: jdrosen@cisco.com
   URI:   http://www.jdrosen.net