Network Working Group                                       G. Camarillo
Request for Comments: 3960                                      Ericsson
Category: Informational                                   H. Schulzrinne
                                                     Columbia University
                                                           December 2004
Network Working Group                                       G. Camarillo
Request for Comments: 3960                                      Ericsson
Category: Informational                                   H. Schulzrinne
                                                     Columbia University
                                                           December 2004

Early Media and Ringing Tone Generation in the Session Initiation Protocol (SIP)


Status of This Memo


This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.


Copyright Notice


Copyright (C) The Internet Society (2004).




This document describes how to manage early media in the Session Initiation Protocol (SIP) using two models: the gateway model and the application server model. It also describes the inputs one needs to consider in defining local policies for ringing tone generation.


Table of Contents


   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
   2.  Session Establishment in SIP . . . . . . . . . . . . . . . . .  3
   3.  The Gateway Model. . . . . . . . . . . . . . . . . . . . . . .  4
       3.1.  Forking. . . . . . . . . . . . . . . . . . . . . . . . .  4
       3.2.  Ringing Tone Generation. . . . . . . . . . . . . . . . .  5
       3.3.  Absence of an Early Media Indicator. . . . . . . . . . .  7
       3.4.  Applicability of the Gateway Model . . . . . . . . . . .  8
   4.  The Application Server Model . . . . . . . . . . . . . . . . .  8
       4.1.  In-Band Versus Out-of-Band Session Progress Information.  9
   5.  Alert-Info Header Field. . . . . . . . . . . . . . . . . . . .  9
   6.  Security Considerations. . . . . . . . . . . . . . . . . . . .  9
   7.  Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . 10
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
       8.1.  Normative References . . . . . . . . . . . . . . . . . . 11
       8.2.  Informative References . . . . . . . . . . . . . . . . . 11
       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 12
       Full Copyright Statement . . . . . . . . . . . . . . . . . . . 13
   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
   2.  Session Establishment in SIP . . . . . . . . . . . . . . . . .  3
   3.  The Gateway Model. . . . . . . . . . . . . . . . . . . . . . .  4
       3.1.  Forking. . . . . . . . . . . . . . . . . . . . . . . . .  4
       3.2.  Ringing Tone Generation. . . . . . . . . . . . . . . . .  5
       3.3.  Absence of an Early Media Indicator. . . . . . . . . . .  7
       3.4.  Applicability of the Gateway Model . . . . . . . . . . .  8
   4.  The Application Server Model . . . . . . . . . . . . . . . . .  8
       4.1.  In-Band Versus Out-of-Band Session Progress Information.  9
   5.  Alert-Info Header Field. . . . . . . . . . . . . . . . . . . .  9
   6.  Security Considerations. . . . . . . . . . . . . . . . . . . .  9
   7.  Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . 10
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
       8.1.  Normative References . . . . . . . . . . . . . . . . . . 11
       8.2.  Informative References . . . . . . . . . . . . . . . . . 11
       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 12
       Full Copyright Statement . . . . . . . . . . . . . . . . . . . 13
1. Introduction
1. 介绍

Early media refers to media (e.g., audio and video) that is exchanged before a particular session is accepted by the called user. Within a dialog, early media occurs from the moment the initial INVITE is sent until the User Agent Server (UAS) generates a final response. It may be unidirectional or bidirectional, and can be generated by the caller, the callee, or both. Typical examples of early media generated by the callee are ringing tone and announcements (e.g., queuing status). Early media generated by the caller typically consists of voice commands or dual tone multi-frequency (DTMF) tones to drive interactive voice response (IVR) systems.


The basic SIP specification (RFC 3261 [1]) only supports very simple early media mechanisms. These simple mechanisms have a number of problems which relate to forking and security, and do not satisfy the requirements of most applications. This document goes beyond the mechanisms defined in RFC 3261 [1] and describes two models of early media implementations using SIP: the gateway model and the application server model.

基本SIP规范(RFC3261[1])只支持非常简单的早期媒体机制。这些简单的机制存在许多与分叉和安全性相关的问题,不能满足大多数应用程序的要求。本文档超越了RFC 3261[1]中定义的机制,并描述了使用SIP的早期媒体实现的两种模型:网关模型和应用服务器模型。

Although both early media models described in this document are superior to the one specified in RFC 3261 [1], the gateway model still presents a set of issues. In particular, the gateway model does not work well with forking. Nevertheless, the gateway model is needed because some SIP entities (in particular, some gateways) cannot implement the application server model.

尽管本文档中描述的两种早期媒体模型都优于RFC 3261[1]中指定的模型,但网关模型仍然存在一系列问题。特别是,网关模型不能很好地用于分叉。然而,由于某些SIP实体(特别是某些网关)无法实现应用服务器模型,因此需要网关模型。

The application server model addresses some of the issues present in the gateway model. This model uses the early-session disposition type, which is specified in [2].


The remainder of this document is organized as follows: Section 2 describes the offer/answer model in the absence of early media, and Section 3 introduces the gateway model. In this model, the early media session is established using the early dialog established by the original INVITE. Sections 3.1, 3.2, and 3.4 describe the limitations of the gateway model and the scenarios where it is appropriate to use this model. Section 4 introduces the application server model, which, as stated previously, resolves some of the issues present in the gateway model. Section 5 discusses the interactions between the Alert-Info header field in both early media models.


The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", " NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [9].


2. Session Establishment in SIP
2. SIP中的会话建立

Before presenting both early media models, we will briefly summarize how session establishment works in SIP. This will let us keep separate features that are intrinsic to SIP (e.g., media being played before the 200 (OK) to avoid media clipping) from early media operations.


SIP [1] uses the offer/answer model [3] to negotiate session parameters. One of the user agents - the offerer - prepares a session description that is called the offer. The other user agent - the answerer - responds with another session description called the answer. This two-way handshake allows both user agents to agree upon the session parameters to be used to exchange media.


The offer/answer model decouples the offer/answer exchange from the messages used to transport the session descriptions. For example, the offer can be sent in an INVITE request and the answer can arrive in the 200 (OK) response for that INVITE, or, alternatively, the offer can be sent in the 200 (OK) for an empty INVITE and the answer can be sent in the ACK. When reliable provisional responses [4] and UPDATE requests [5] are used, there are many more possible ways to exchange offers and answers.


Media clipping occurs when the user (or the machine generating media) believes that the media session is already established, but the establishment process has not finished yet. The user starts speaking (i.e., generating media) and the first few syllables or even the first few words are lost.


When the offer/answer exchange takes place in the 200 (OK) response and in the ACK, media clipping is unavoidable. The called user starts speaking at the same time the 200 (OK) is sent, but the UAS cannot send any media until the answer from the User Agent Client (UAC) arrives in the ACK.


On the other hand, media clipping does not appear in the most common offer/answer exchange (an INVITE with an offer and a 200 (OK) with an answer). UACs are ready to play incoming media packets as soon as they send an offer, because they cannot count on the reception of the 200 (OK) to start playing out media for the caller; SIP signalling and media packets typically traverse different paths, and so, media packets may arrive before the 200 (OK) response.


Another form of media clipping (not related to early media either) occurs in the caller-to-callee direction. When the callee picks up and starts speaking, the UAS sends a 200 (OK) response with an answer, in parallel with the first media packets. If the first media


packets arrive at the UAC before the answer and the caller starts speaking, the UAC cannot send media until the 200 (OK) response from the UAS arrives.


3. The Gateway Model
3. 网关模型

SIP uses the offer/answer model to negotiate session parameters (as described in Section 2). An offer/answer exchange that takes place before a final response for the INVITE is sent establishes an "early" media session. Early media sessions terminate when a final response for the INVITE is sent. If the final response is a 200 (OK), the early media session transitions to a regular media session. If the final response is a non-200 class final response, the early media session is simply terminated.


Not surprisingly, media exchanged within an early media session is referred to as early media. The gateway model consists of managing early media sessions using offer/answer exchanges in reliable provisional responses, PRACKs, and UPDATEs.


The gateway model is seriously limited in the presence of forking, as described in Section 3.1. Therefore, its use is only acceptable when the User Agent (UA) cannot distinguish between early and regular media, as described in Section 3.4. In any other situation (the majority of UAs), use of the application server model described in Section 4 is strongly recommended instead.


3.1. Forking
3.1. 分叉

In the absence of forking, assuming that the initial INVITE contains an offer, the gateway model does not introduce media clipping. Following normal SIP procedures, the UAC is ready to play any incoming media as soon as it sends the initial offer in the INVITE. The UAS sends the answer in a reliable provisional response and can send media as soon as there is media to send. Even if the first media packets arrive at the UAC before the 1xx response, the UAC will play them.


Note that, in some situations, the UAC needs to receive the answer before being able to play any media. UAs in such a situation (e.g., QoS, media authorization, or media encryption is required) use preconditions to avoid media clipping.


On the other hand, if the INVITE forks, the gateway model may introduce media clipping. This happens when the UAC receives different answers to its offer in several provisional responses from different UASs. The UAC has to deal with bandwidth limitations and early media session selection.


If the UAC receives early media from different UASs, it needs to present it to the user. If the early media consists of audio, playing several audio streams to the user at the same time may be confusing. On the other hand, other media types (e.g., video) can be presented to the user at the same time. For example, the UAC can build a mosaic with the different inputs.


However, even with media types that can be played at the same time to the user, if the UAC has limited bandwidth, it will not be able to receive early media from all the different UASs at the same time. Therefore, many times, the UAC needs to choose a single early media session and "mute" those sending UPDATE requests.


It is difficult to decide which early media sessions carry more important information from the caller's perspective. In fact, in some scenarios, the UA cannot even correlate media packets with their particular SIP early dialog. Therefore, UACs typically pick one early dialog randomly and mute the rest.


If one of the early media sessions that was muted transitions to a regular media session (i.e., the UAS sends a 2xx response), media clipping is likely. The UAC typically sends an UPDATE with a new offer (upon reception of the 200 (OK) for the INVITE) to unmute the media session. The UAS cannot send any media until it receives the offer from the UAC. Therefore, if the caller starts speaking before the offer from the UAC is received, his words will get lost.


Having the UAS send the UPDATE to unmute the media session (instead of the UAC) does not avoid media clipping in the backward direction and it causes possible race conditions.


3.2. Ringing Tone Generation
3.2. 铃声生成

In the PSTN, telephone switches typically play ringing tones for the caller, indicating that the callee is being alerted. When, where, and how these ringing tones are generated has been standardized (i.e., the local exchange of the callee generates a standardized ringing tone while the callee is being alerted). It makes sense for a standardized approach to provide this type of feedback for the user in a homogeneous environment such as the PSTN, where all the terminals have a similar user interface.


This homogeneity is not found among SIP user agents. SIP user agents have different capabilities, different user interfaces, and may be used to establish sessions that do not involve audio at all. Because of this, the way a SIP UA provides the user with information about the progress of session establishment is a matter of local policy. For example, a UA with a Graphical User Interface (GUI) may choose to

在SIP用户代理中没有发现这种同质性。SIP用户代理具有不同的功能、不同的用户界面,可用于建立完全不涉及音频的会话。因此,SIP UA向用户提供会话建立进度信息的方式取决于本地策略。例如,具有图形用户界面(GUI)的UA可以选择

display a message on the screen when the callee is being alerted, while another UA may choose to show a picture of a phone ringing instead. Many SIP UAs choose to imitate the user interface of the PSTN phones. They provide a ringing tone to the caller when the callee is being alerted. Such a UAC is supposed to generate ringing tones locally for its user as long as no early media is received from the UAS. If the UAS generates early media (e.g., an announcement or a special ringing tone), the UAC is supposed to play it rather than generate the ringing tone locally.

当被呼叫方收到警报时,在屏幕上显示消息,而另一个UA可以选择显示电话铃声的图片。许多SIP UAs选择模仿PSTN电话的用户界面。当被呼叫者收到警报时,它们会向呼叫者提供铃声。这样的UAC应该在本地为其用户生成铃声,只要没有收到来自UAS的早期媒体。如果UAS生成早期媒体(例如公告或特殊铃声),UAC应该播放该媒体,而不是在本地生成铃声。

The problem is that, sometimes, it is not an easy task for a UAC to know whether it will be receiving early media or it should generate local ringing. A UAS can send early media without using reliable provisional responses (very simple UASs do that) or it can send an answer in a reliable provisional response without any intention of sending early media (this is the case when preconditions are used). Therefore, by only looking at the SIP signalling, a UAC cannot be sure whether or not there will be early media for a particular session. The UAC needs to check if media packets are arriving at a given moment.


An implementation could even choose to look at the contents of the media packets, since they could carry only silence or comfort noise.


With this in mind, a UAC should develop its local policy regarding local ringing generation. For example, a POTS ("Plain Old Telephone Service")-like SIP User Agent (UA) could implement the following local policy:


1. Unless a 180 (Ringing) response is received, never generate local ringing.

1. 除非收到180(振铃)响应,否则切勿生成本地振铃。

2. If a 180 (Ringing) has been received but there are no incoming media packets, generate local ringing.

2. 如果已收到180(振铃),但没有传入的媒体数据包,则生成本地振铃。

3. If a 180 (Ringing) has been received and there are incoming media packets, play them and do not generate local ringing.

3. 如果收到180(振铃)且有传入的媒体数据包,则播放这些数据包,并且不生成本地振铃。

Note that a 180 (Ringing) response means that the callee is being alerted, and a UAS should send such a response if the callee is being alerted, regardless of the status of the early media session.


At first sight, such a policy may look difficult to implement in decomposed UAs (i.e., media gateway controller and media gateway), but this policy is the same as the one described in Section 2, which must be implemented by any UA. That is, any UA should play incoming


media packets (and stop local ringing tone generation if it was being performed) in order to avoid media clipping, even if the 200 (OK) response has not arrived. So, the tools to implement this early media policy are already available to any UA that uses SIP.


Note that, while it is not desirable to standardize a common local policy to be followed by every SIP UA, a particular subset of more or less homogeneous SIP UAs could use the same local policy by convention. Examples of such subsets of SIP UAs may be "all the PSTN/SIP gateways" or "every 3GPP IMS (Third Generation Partnership Project Internet Multimedia System) terminal". However, defining the particular common policy that such groups of SIP devices may use is outside the scope of this document.

注意,虽然不希望标准化每个SIP-UA要遵循的公共本地策略,但或多或少同质SIP-UA的特定子集可以根据约定使用相同的本地策略。SIP ua的此类子集的示例可以是“所有PSTN/SIP网关”或“每个3GPP IMS(第三代合作伙伴项目互联网多媒体系统)终端”。然而,定义此类SIP设备组可能使用的特定公共策略超出了本文档的范围。

3.3. Absence of an Early Media Indicator
3.3. 缺乏早期媒体指标

SIP, as opposed to other signalling protocols, does not provide an early media indicator. That is, there is no information about the presence or absence of early media in SIP. Such an indicator could be potentially used to avoid the generation of local ringing tone by the UAC when UAS intends to provide an in-band ringing tone or some type of announcement. However, in the majority of the cases, such an indicator would be of little use due to the way SIP works.


One important reason limiting the benefit of a potential early media indicator is the loose coupling between SIP signalling and the media path. SIP signalling traverses a different path than the media. The media path is typically optimized to reduce the end-to-end delay (e.g., minimum number of intermediaries), while the SIP signalling path typically traverses a number of proxies providing different services for the session. Hence, it is very likely that the media packets with early media reach the UAC before any SIP message that could contain an early media indicator.


Nevertheless, sometimes SIP responses arrive at the UAC before any media packet. There are situations in which the UAS intends to send early media but cannot do it straight away. For example, UAs using Interactive Connectivity Establishment (ICE) [6] may need to exchange several Simple Traversals of the UDP Protocol through NAT (STUN) messages before being able to exchange media. In this situation, an early media indicator would keep the UAC from generating a local ringing tone during this time. However, while the early media is not arriving at the UAC, the user would not be aware that the remote user is being alerted, even though a 180 (Ringing) had been received. Therefore, a better solution would be to apply a local ringing tone until the early media packets could be sent from the UAS to the UAC. This solution does not require any early media indicator.


Note that migrations from local ringing tone to early media at the UAC happen in the presence of forking as well; one UAS sends a 180 (Ringing) response, and later, another UAS starts sending early media.


3.4. Applicability of the Gateway Model
3.4. 网关模型的适用性

Section 3 described some of the limitations of the gateway model. It produces media clipping in forking scenarios and requires media detection to generate local ringing properly. These issues are addressed by the application server model, described in Section 4, which is the recommended way of generating early media that is not continuous with the regular media generated during the session.


The gateway model is, therefore, acceptable in situations where the UA cannot distinguish between early media and regular media. A PSTN gateway is an example of this type of situation. The PSTN gateway receives media from the PSTN over a circuit, and sends it to the IP network. The gateway is not aware of the contents of the media, and it does not exactly know when the transition from early to regular media takes place. From the PSTN perspective, the circuit is a continuous source of media.


4. The Application Server Model
4. 应用服务器模型

The application server model consists of having the UAS behave as an application server to establish early media sessions with the UAC. The UAC indicates support for the early-session disposition type (defined in [2]) using the early-session option tag. This way, UASs know that they can keep offer/answer exchanges for early media (early-session disposition type) separate from regular media (session disposition type).


Sending early media using a different offer/answer exchange than the one used for sending regular media helps avoid media clipping in cases of forking. The UAC can reject or mute