Network Working Group Q. Xie, Ed. Request for Comments: 3557 Motorola, Inc. Category: Standards Track July 2003
Network Working Group Q. Xie, Ed. Request for Comments: 3557 Motorola, Inc. Category: Standards Track July 2003
RTP Payload Format for European Telecommunications Standards Institute (ETSI) European Standard ES 201 108 Distributed Speech Recognition Encoding
欧洲电信标准协会(ETSI)欧洲标准ES 201 108分布式语音识别编码的RTP有效载荷格式
Status of this Memo
本备忘录的状况
This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.
本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。
Copyright Notice
版权公告
Copyright (C) The Internet Society (2003). All Rights Reserved.
版权所有(C)互联网协会(2003年)。版权所有。
Abstract
摘要
This document specifies an RTP payload format for encapsulating European Telecommunications Standards Institute (ETSI) European Standard (ES) 201 108 front-end signal processing feature streams for distributed speech recognition (DSR) systems.
本文件规定了用于封装欧洲电信标准协会(ETSI)欧洲标准(ES)201 108分布式语音识别(DSR)系统前端信号处理特征流的RTP有效载荷格式。
Table of Contents
目录
1. Conventions and Acronyms . . . . . . . . . . . . . . . . . . . 2 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1. ETSI ES 201 108 DSR Front-end Codec. . . . . . . . . . . 3 2.2. Typical Scenarios for Using DSR Payload Format . . . . . 4 3. ES 201 108 DSR RTP Payload Format. . . . . . . . . . . . . . . 5 3.1. Consideration on Number of FPs in Each RTP Packet. . . . 6 3.2. Support for Discontinuous Transmission . . . . . . . . . 6 4. Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Format of Speech and Non-speech FPs. . . . . . . . . . . 7 4.2. Format of Null FP. . . . . . . . . . . . . . . . . . . . 8 4.3. RTP header usage . . . . . . . . . . . . . . . . . . . . 8 5. IANA Considerations. . . . . . . . . . . . . . . . . . . . . . 9 5.1. Mapping MIME Parameters into SDP . . . . . . . . . . . . 10 6. Security Considerations. . . . . . . . . . . . . . . . . . . . 11 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 11 8. Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . 11 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 9.1. Normative References . . . . . . . . . . . . . . . . . . 11 9.2. Informative References . . . . . . . . . . . . . . . . . 12 10. IPR Notices. . . . . . . . . . . . . . . . . . . . . . . . . . 12 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 13 12. Editor's Address . . . . . . . . . . . . . . . . . . . . . . . 14 13. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 15
1. Conventions and Acronyms . . . . . . . . . . . . . . . . . . . 2 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1. ETSI ES 201 108 DSR Front-end Codec. . . . . . . . . . . 3 2.2. Typical Scenarios for Using DSR Payload Format . . . . . 4 3. ES 201 108 DSR RTP Payload Format. . . . . . . . . . . . . . . 5 3.1. Consideration on Number of FPs in Each RTP Packet. . . . 6 3.2. Support for Discontinuous Transmission . . . . . . . . . 6 4. Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Format of Speech and Non-speech FPs. . . . . . . . . . . 7 4.2. Format of Null FP. . . . . . . . . . . . . . . . . . . . 8 4.3. RTP header usage . . . . . . . . . . . . . . . . . . . . 8 5. IANA Considerations. . . . . . . . . . . . . . . . . . . . . . 9 5.1. Mapping MIME Parameters into SDP . . . . . . . . . . . . 10 6. Security Considerations. . . . . . . . . . . . . . . . . . . . 11 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 11 8. Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . 11 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 9.1. Normative References . . . . . . . . . . . . . . . . . . 11 9.2. Informative References . . . . . . . . . . . . . . . . . 12 10. IPR Notices. . . . . . . . . . . . . . . . . . . . . . . . . . 12 11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 13 12. Editor's Address . . . . . . . . . . . . . . . . . . . . . . . 14 13. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 15
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照[RFC2119]中所述进行解释。
The following acronyms are used in this document:
本文件中使用了以下首字母缩略词:
DSR - Distributed Speech Recognition
分布式语音识别
ETSI - the European Telecommunications Standards Institute
欧洲电信标准协会
FP - Frame Pair
帧对
DTX - Discontinuous Transmission
不连续传输
Motivated by technology advances in the field of speech recognition, voice interfaces to services (such as airline information systems, unified messaging) are becoming more prevalent. In parallel, the popularity of mobile devices has also increased dramatically.
由于语音识别领域的技术进步,服务(如航空公司信息系统、统一消息)的语音接口变得越来越普遍。与此同时,移动设备的普及程度也显著提高。
However, the voice codecs typically employed in mobile devices were designed to optimize audible voice quality and not speech recognition accuracy, and using these codecs with speech recognizers can result in poor recognition performance. For systems that can be accessed from heterogeneous networks using multiple speech codecs, recognition system designers are further challenged to accommodate the characteristics of these differences in a robust manner. Channel errors and lost data packets in these networks result in further degradation of the speech signal.
然而,通常在移动设备中使用的语音编解码器被设计为优化可听语音质量而不是语音识别精度,并且将这些编解码器与语音识别器一起使用可能导致识别性能差。对于可以使用多个语音编解码器从异构网络访问的系统,识别系统设计人员面临的进一步挑战是如何以稳健的方式适应这些差异的特征。这些网络中的信道错误和丢失的数据包会导致语音信号进一步恶化。
In traditional systems as described above, the entire speech recognizer lies on the server. It is forced to use incoming speech in whatever condition it arrives after the network decodes the vocoded speech. To address this problem, we use a distributed speech recognition (DSR) architecture. In such a system, the remote device acts as a thin client, also known as the front-end, in communication with a speech recognition server, also called a speech engine. The remote device processes the speech, compresses the data, and adds error protection to the bitstream in a manner optimal for speech recognition. The speech engine then uses this representation directly, minimizing the signal processing necessary and benefiting from enhanced error concealment.
在如上所述的传统系统中,整个语音识别器位于服务器上。在网络解码声码语音后,它被迫在任何情况下使用传入语音。为了解决这个问题,我们使用了分布式语音识别(DSR)体系结构。在这样的系统中,远程设备充当瘦客户端(也称为前端),与语音识别服务器(也称为语音引擎)通信。远程设备处理语音,压缩数据,并以最适合语音识别的方式向比特流添加错误保护。然后,语音引擎直接使用此表示,从而最大限度地减少必要的信号处理,并从增强的错误隐藏中获益。
To achieve interoperability with different client devices and speech engines, a common format is needed. Within the "Aurora" DSR working group of the European Telecommunications Standards Institute (ETSI), a payload has been defined and was published as a standard [ES201108] in February 2000.
为了实现与不同客户端设备和语音引擎的互操作性,需要一种通用格式。在欧洲电信标准协会(ETSI)的“极光”DSR工作组内,有效载荷已被定义,并于2000年2月发布为标准[ES201108]。
For voice dialogues between a caller and a voice service, low latency is a high priority along with accurate speech recognition. While jitter in the speech recognizer input is not particularly important, many issues related to speech interaction over an IP-based connection are still relevant. Therefore, it is desirable to use the DSR payload in an RTP-based session.
对于呼叫者和语音服务之间的语音对话,低延迟和准确的语音识别是一个高优先级。虽然语音识别器输入中的抖动不是特别重要,但与基于IP连接的语音交互相关的许多问题仍然是相关的。因此,希望在基于RTP的会话中使用DSR有效负载。
The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal processing front-end and compression scheme for speech input to a speech recognition system. Some relevant characteristics of this ETSI DSR front-end codec are summarized below.
用于DSR的ETSI标准ES 201 108[ES201108]定义了用于语音识别系统的语音输入的信号处理前端和压缩方案。下面总结了该ETSI DSR前端编解码器的一些相关特性。
The coding algorithm, a standard mel-cepstral technique common to many speech recognition systems, supports three raw sampling rates: 8 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame-based scheme that produces an output vector every 10 ms.
编码算法是许多语音识别系统通用的标准mel倒谱技术,支持三种原始采样率:8 kHz、11 kHz和16 kHz。mel倒谱计算是一种基于帧的方案,每10毫秒产生一个输出向量。
After calculation of the mel-cepstral representation, the representation is first quantized via split-vector quantization to reduce the data rate of the encoded stream. Then, the quantized vectors from two consecutive frames are put into an FP, as described in more detail in Section 4.1.
在计算mel倒谱表示之后,首先通过分割矢量量化对该表示进行量化,以降低编码流的数据速率。然后,如第4.1节中更详细地描述的,将来自两个连续帧的量化向量放入FP。
The diagrams in Figure 1 show some typical use scenarios of the ES 201 108 DSR RTP payload format.
图1中的图表显示了ES 201 108 DSR RTP有效负载格式的一些典型使用场景。
+--------+ +----------+ |IP USER | IP/UDP/RTP/DSR |IP SPEECH | |TERMINAL|-------------------->| ENGINE | | | | | +--------+ +----------+
+--------+ +----------+ |IP USER | IP/UDP/RTP/DSR |IP SPEECH | |TERMINAL|-------------------->| ENGINE | | | | | +--------+ +----------+
a) IP user terminal to IP speech engine
a) IP用户终端到IP语音引擎
+--------+ DSR over +-------+ +----------+ | Non-IP | Circuit link | | IP/UDP/RTP/DSR |IP SPEECH | | USER |:::::::::::::::>|GATEWAY|--------------->| ENGINE | |TERMINAL| ETSI payload | | | | +--------+ format +-------+ +----------+
+--------+ DSR over +-------+ +----------+ | Non-IP | Circuit link | | IP/UDP/RTP/DSR |IP SPEECH | | USER |:::::::::::::::>|GATEWAY|--------------->| ENGINE | |TERMINAL| ETSI payload | | | | +--------+ format +-------+ +----------+
b) non-IP user terminal to IP speech engine via a gateway
b) 非IP用户终端通过网关连接到IP语音引擎
+--------+ +-------+ DSR over +----------+ |IP USER | IP/UDP/RTP/DSR | | circuit link | Non-IP | |TERMINAL|----------------->|GATEWAY|::::::::::::::::>| SPEECH | | | | | ETSI payload | ENGINE | +--------+ +-------+ format +----------+
+--------+ +-------+ DSR over +----------+ |IP USER | IP/UDP/RTP/DSR | | circuit link | Non-IP | |TERMINAL|----------------->|GATEWAY|::::::::::::::::>| SPEECH | | | | | ETSI payload | ENGINE | +--------+ +-------+ format +----------+
c) IP user terminal to non-IP speech engine via a gateway
c) IP用户终端通过网关连接到非IP语音引擎
Figure 1: Typical Scenarios for Using DSR Payload Format.
图1:使用DSR有效负载格式的典型场景。
For the different scenarios in Figure 1, the speech recognizer always resides in the speech engine. A DSR front-end encoder inside the User Terminal performs front-end speech processing and sends the resultant data to the speech engine in the form of "frame pairs" (FPs). Each FP contains two sets of encoded speech vectors representing 20ms of original speech.
对于图1中的不同场景,语音识别器始终驻留在语音引擎中。用户终端内的DSR前端编码器执行前端语音处理,并以“帧对”(FPs)的形式将结果数据发送到语音引擎。每个FP包含两组编码语音向量,代表20毫秒的原始语音。
An ES 201 108 DSR RTP payload datagram consists of a standard RTP header [RFC3550] followed by a DSR payload. The DSR payload itself is formed by concatenating a series of ES 201 108 DSR FPs (defined in Section 4).
ES 201 108 DSR RTP有效负载数据报由标准RTP报头[RFC3550]和DSR有效负载组成。DSR有效载荷本身是通过连接一系列ES 201 108 DSR FPs(在第4节中定义)形成的。
FPs are always packed bit-contiguously into the payload octets beginning with the most significant bit. For ES 201 108 front-end, the size of each FP is 96 bits or 12 octets (see Sections 4.1 and 4.2). This ensures that a DSR payload will always end on an octet boundary.
FP总是从最高有效位开始,连续地将位压缩到有效负载八位字节中。对于ES 201 108前端,每个FP的大小为96位或12个八位字节(参见第4.1和4.2节)。这确保了DSR有效负载始终以八位字节边界结束。
The following example shows a DSR RTP datagram carrying a DSR payload containing three 96-bit-long FPs (bit 0 is the MSB):
以下示例显示了一个DSR RTP数据报,该数据报承载的DSR有效负载包含三个96位长的FPs(位0为MSB):
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / RTP header in [RFC3550] / \ \ +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | + + | FP #1 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | FP #2 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | FP #3 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / RTP header in [RFC3550] / \ \ +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | + + | FP #1 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | FP #2 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | FP #3 (96 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2. An example of an ES 201 108 DSR RTP payload.
图2。ES 201 108 DSR RTP有效负载的示例。
The number of FPs per payload packet should be determined by the latency and bandwidth requirements of the DSR application using this payload format. In particular, using a smaller number of FPs per payload packet in a session will result in lowered bandwidth efficiency due to the RTP/UDP/IP header overhead, while using a larger number of FPs per packet will cause longer end-to-end delay and hence increased recognition latency. Furthermore, carrying a larger number of FPs per packet will increase the possibility of catastrophic packet loss; the loss of a large number of consecutive FPs is a situation most speech recognizers have difficulty dealing with.
每个有效负载数据包的FPs数应由使用此有效负载格式的DSR应用程序的延迟和带宽要求确定。特别是,由于RTP/UDP/IP报头开销,在会话中每个有效负载数据包使用较少数量的FPs将导致带宽效率降低,而每个数据包使用较多数量的FPs将导致更长的端到端延迟,从而增加识别延迟。此外,每个数据包携带更多FPs将增加灾难性数据包丢失的可能性;大量连续FPs的丢失是大多数语音识别器难以处理的一种情况。
It is therefore RECOMMENDED that the number of FPs per DSR payload packet be minimized, subject to meeting the application's requirements on network bandwidth efficiency. RTP header compression techniques, such as those defined in [RFC2508] and [RFC3095], should be considered to improve network bandwidth efficiency.
因此,建议在满足应用程序对网络带宽效率的要求的前提下,尽量减少每个DSR有效负载数据包的FPs数。应考虑采用RTP报头压缩技术,如[RFC2508]和[RFC3095]中定义的技术,以提高网络带宽效率。
The DSR RTP payloads may be used to support discontinuous transmission (DTX) of speech, which allows that DSR FPs are sent only when speech has been detected at the terminal equipment.
DSR RTP有效载荷可用于支持语音的不连续传输(DTX),这允许仅在终端设备检测到语音时发送DSR FPs。
In DTX a set of DSR frames coding an unbroken speech segment transmitted from the terminal to the server is called a transmission segment. A DSR frame inside such a transmission segment can be either a speech frame or a non-speech frame, depending on the nature of the section of the speech signal it represents.
在DTX中,对从终端传输到服务器的未中断语音段进行编码的一组DSR帧称为传输段。这种传输段内的DSR帧可以是语音帧或非语音帧,这取决于它所表示的语音信号部分的性质。
The end of a transmission segment is determined at the sending end equipment when the number of consecutive non-speech frames exceeds a pre-set threshold, called the hangover time. A typical value used for the hangover time is 1.5 seconds.
当连续非语音帧的数目超过预设阈值(称为宿醉时间)时,在发送端设备处确定传输段的结束。宿醉时间的典型值为1.5秒。
After all FPs in a transmission segment are sent, the front-end SHOULD indicate the end of the current transmission segment by sending one or more Null FPs (defined in Section 4.2).
发送传输段中的所有FPs后,前端应通过发送一个或多个空FPs(定义见第4.2节)指示当前传输段的结束。
The following mel-cepstral frame MUST be used, as defined in [ES201108]:
必须使用[ES201108]中定义的以下mel倒谱帧:
As defined in [ES201108], pairs of the quantized 10ms mel-cepstral frames MUST be grouped together and protected with a 4-bit CRC, forming a 92-bit long FP:
如[ES201108]中所定义,量化10ms mel倒谱帧对必须分组在一起,并使用4位CRC进行保护,形成92位长的FP:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Frame #1 (44 bits) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Frame #2 (44 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ | | CRC |0|0|0|0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Frame #1 (44 bits) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Frame #2 (44 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ | | CRC |0|0|0|0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The length of each frame is 44 bits representing 10ms of voice. The following mel-cepstral frame formats MUST be used when forming an FP:
每帧的长度为44位,代表10ms的语音。形成FP时,必须使用以下mel倒谱帧格式:
Frame #1 in FP: =============== (MSB) (LSB) 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(2,3) | idx(0,1) | Octet 1 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(4,5) | idx(2,3) (cont) : Octet 2 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(6,7) |idx(4,5)(cont) Octet 3 +-----+-----+-----+-----+-----+-----+-----+-----+ idx(10,11) | idx(8,9) | Octet 4 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(12,13) | idx(10,11) (cont) : Octet 5 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(12,13) (cont) : Octet 6/1 +-----+-----+-----+-----+
Frame #1 in FP: =============== (MSB) (LSB) 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(2,3) | idx(0,1) | Octet 1 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(4,5) | idx(2,3) (cont) : Octet 2 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(6,7) |idx(4,5)(cont) Octet 3 +-----+-----+-----+-----+-----+-----+-----+-----+ idx(10,11) | idx(8,9) | Octet 4 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(12,13) | idx(10,11) (cont) : Octet 5 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(12,13) (cont) : Octet 6/1 +-----+-----+-----+-----+
Frame #2 in FP: =============== (MSB) (LSB) 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+ : idx(0,1) | Octet 6/2 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(2,3) |idx(0,1)(cont) Octet 7 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(6,7) | idx(4,5) | Octet 8 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(8,9) | idx(6,7) (cont) : Octet 9 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(10,11) |idx(8,9)(cont) Octet 10 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(12,13) | Octet 11 +-----+-----+-----+-----+-----+-----+-----+-----+
Frame #2 in FP: =============== (MSB) (LSB) 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+ : idx(0,1) | Octet 6/2 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(2,3) |idx(0,1)(cont) Octet 7 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(6,7) | idx(4,5) | Octet 8 +-----+-----+-----+-----+-----+-----+-----+-----+ : idx(8,9) | idx(6,7) (cont) : Octet 9 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(10,11) |idx(8,9)(cont) Octet 10 +-----+-----+-----+-----+-----+-----+-----+-----+ | idx(12,13) | Octet 11 +-----+-----+-----+-----+-----+-----+-----+-----+
Therefore, each FP represents 20ms of original speech. Note, as shown above, each FP MUST be padded with 4 zeros to the end in order to make it aligned to the 32-bit word boundary. This makes the size of an FP 96 bits, or 12 octets. Note, this padding is separate from padding indicated by the P bit in the RTP header.
因此,每个FP代表20毫秒的原始语音。注意,如上所示,每个FP必须在末尾加上4个零,以使其与32位字边界对齐。这使得FP的大小为96位或12个八位字节。注意,此填充与RTP标头中的P位指示的填充是分开的。
The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in [ES201108]. The definition of the indices and the determination of their value are also described in [ES201108].
必须使用[ES201108]中6.2.4中定义的公式计算4位CRC。[ES201108]中还描述了指数的定义及其值的确定。
A Null FP for the ES 201 108 front-end codec is defined by setting the content of the first and second frame in the FP to null (i.e., filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be calculated the same way as described in 6.2.4 in [ES201108], and 4 zeros MUST be padded to the end of the Null FP to make it 32-bit word aligned.
通过将FP中的第一和第二帧的内容设置为Null(即,用0填充FP的前88位),来定义ES 201 108前端编解码器的Null FP。4位CRC的计算方法必须与[ES201108]中6.2.4中所述的方法相同,并且必须将4个零填充到空FP的末尾,使其32位字对齐。
The format of the RTP header is specified in [RFC3550]. This payload format uses the fields of the header in a manner consistent with that specification.
RTP标头的格式在[RFC3550]中指定。此有效负载格式以与该规范一致的方式使用报头的字段。
The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first FP in the packet. The timestamp clock frequency is the same as the sampling frequency, so the timestamp unit is in samples.
RTP时间戳对应于为分组中的第一FP编码的第一样本的采样瞬间。时间戳时钟频率与采样频率相同,因此时间戳单元在采样中。
As defined by ES 201 108 front-end codec, the duration of one FP is 20 ms, corresponding to 160, 220, or 320 encoded samples with sampling rate of 8, 11, or 16 kHz being used at the front-end, respectively. Thus, the timestamp is increased by 160, 220, or 320 for each consecutive FP, respectively.
如ES 201 108前端编解码器所定义,一个FP的持续时间为20ms,分别对应于前端使用的采样率为8、11或16kHz的160、220或320编码样本。因此,对于每个连续FP,时间戳分别增加160、220或320。
The DSR payload for ES 201 108 front-end codes is always an integral number of octets. If additional padding is required for some other purpose, then the P bit in the RTP in the header may be set and padding appended as specified in [RFC3550].
ES 201 108前端代码的DSR有效载荷始终为八位字节的整数。如果出于某些其他目的需要额外的填充,则可以设置报头中RTP中的P位,并按照[RFC3550]中的规定追加填充。
The RTP header marker bit (M) should be set following the general rules defined in [RFC3551].
RTP头标记位(M)应按照[RFC3551]中定义的一般规则设置。
The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically.
此新数据包格式的RTP有效负载类型的分配超出了本文档的范围,此处将不进行指定。预计使用此有效负载格式的RTP配置文件将为此编码分配有效负载类型,或指定动态绑定有效负载类型。
One new MIME subtype registration is required for this payload type, as defined below.
此负载类型需要一个新的MIME子类型注册,定义如下。
This section also defines the optional parameters that may be used to describe a DSR session. The parameters are defined here as part of the MIME subtype registration. A mapping of the parameters into the Session Description Protocol (SDP) [RFC2327] is also provided in 5.1 for those applications that use SDP.
本节还定义了可用于描述DSR会话的可选参数。参数在这里定义为MIME子类型注册的一部分。5.1中还为使用SDP的应用程序提供了参数到会话描述协议(SDP)[RFC2327]的映射。
Media Type name: audio
媒体类型名称:音频
Media subtype name: dsr-es201108
媒体子类型名称:dsr-es201108
Required parameters: none
所需参数:无
Optional parameters:
可选参数:
rate: Indicates the sample rate of the speech. Valid values include: 8000, 11000, and 16000. If this parameter is not present, 8000 sample rate is assumed.
速率:表示语音的采样速率。有效值包括:8000、11000和16000。如果不存在此参数,则假定采样率为8000。
maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time shall be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the frame pair size (i.e., one FP <-> 20ms).
maxptime:每个数据包中可封装的最大媒体量,以毫秒表示。时间应计算为数据包中存在的媒体代表的时间之和。时间应该是帧对大小的倍数(即,一FP<->20ms)。
If this parameter is not present, maxptime is assumed to be 80ms.
如果此参数不存在,则假定maxptime为80ms。
Note, since the performance of most speech recognizers are extremely sensitive to consecutive FP losses, if the user of the payload format expects a high packet loss ratio for the session, it MAY consider to explicitly choose a maxptime value for the session that is shorter than the default value.
注意,由于大多数语音识别器的性能对连续的FP损失非常敏感,如果有效载荷格式的用户期望会话的高分组丢失率,则它可以考虑明确地选择比默认值短的会话的MppTimeValk值。
ptime: see RFC2327 [RFC2327].
ptime:参见RFC2327[RFC2327]。
Encoding considerations : This type is defined for transfer via RTP [RFC3550] as described in Sections 3 and 4 of RFC 3557.
编码注意事项:如RFC 3557第3节和第4节所述,此类型定义为通过RTP[RFC3550]传输。
Security considerations : See Section 6 of RFC 3557.
安全注意事项:见RFC 3557第6节。
Person & email address to contact for further information: Qiaobing.Xie@motorola.com
联系人和电子邮件地址,以获取更多信息:Qiaobing。Xie@motorola.com
Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type.
预期用途:普通。预计许多VoIP应用程序(以及移动应用程序)将使用这种类型。
Author/Change controller: Qiaobing.Xie@motorola.com IETF Audio/Video transport working group
作者/变更负责人:乔冰。Xie@motorola.comIETF音频/视频传输工作组
The information carried in the MIME media type specification has a specific mapping to fields in the Session Description Protocol (SDP) [RFC2327], which is commonly used to describe RTP sessions. When SDP is used to specify sessions employing ES 201 018 DSR codec, the mapping is as follows:
MIME媒体类型规范中包含的信息具有到会话描述协议(SDP)[RFC2327]中的字段的特定映射,该协议通常用于描述RTP会话。当使用SDP指定使用ES 201 018 DSR编解码器的会话时,映射如下:
o The MIME type ("audio") goes in SDP "m=" as the media name.
o MIME类型(“音频”)以SDP“m=”作为媒体名称。
o The MIME subtype ("dsr-es201108") goes in SDP "a=rtpmap" as the encoding name.
o MIME子类型(“dsr-es201108”)以SDP“a=rtpmap”作为编码名称。
o The optional parameter "rate" also goes in "a=rtpmap" as clock rate.
o 可选参数“rate”也作为时钟频率进入“a=rtpmap”。
o The optional parameters "ptime" and "maxptime" go in the SDP "a=ptime" and "a=maxptime" attributes, respectively.
o 可选参数“ptime”和“maxptime”分别位于SDP“a=ptime”和“a=maxptime”属性中。
Example of usage of ES 201 108 DSR:
ES 201 108 DSR的使用示例:
m=audio 49120 RTP/AVP 101 a=rtpmap:101 dsr-es201108/8000 a=maxptime:40
m=audio 49120 RTP/AVP 101 a=rtpmap:101 dsr-es201108/8000 a=maxptime:40
Implementations using the payload defined in this specification are subject to the security considerations discussed in the RTP specification [RFC3550] and the RTP profile [RFC3551]. This payload does not specify any different security services.
使用本规范中定义的有效负载的实现受RTP规范[RFC3550]和RTP配置文件[RFC3551]中讨论的安全注意事项的约束。此有效负载未指定任何不同的安全服务。
The following individuals contributed to the design of this payload format and the writing of this document: Q. Xie (Motorola), D. Pearce (Motorola), S. Balasuriya (Motorola), Y. Kim (VerbalTek), S. H. Maes (IBM), and, Hari Garudadri (Qualcomm).
以下个人对该有效载荷格式的设计和本文件的编写做出了贡献:谢Q(摩托罗拉)、皮尔斯(摩托罗拉)、巴拉苏里亚(摩托罗拉)、金Y(Rarportek)、梅斯(IBM)和哈里·加鲁达里(高通)。
The design presented here benefits greatly from an earlier work on DSR RTP payload design by Jeff Meunier and Priscilla Walther. The authors also wish to thank Brian Eberman, John Lazzaro, Magnus Westerlund, Rainu Pierce, Priscilla Walther, and others for their review and valuable comments on this document.
这里介绍的设计大大受益于Jeff Meunier和Priscilla Walther在DSR RTP有效载荷设计方面的早期工作。作者还要感谢Brian Eberman、John Lazzaro、Magnus Westerlund、Rainu Pierce、Priscilla Walther和其他人对本文件的评论和宝贵意见。
[ES201108] European Telecommunications Standards Institute (ETSI) Standard ES 201 108, "Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 11, 2000.
[ES201108]欧洲电信标准协会(ETSI)标准ES 201 108,“语音处理、传输和质量方面(STQ);分布式语音识别;前端特征提取算法;压缩算法”,版本。1.1.22000年4月11日。
[RFC3550] Schulzrinne, H., Casner, S., Jacobson, V. and R. Frederick, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003.
[RFC3550]Schulzrinne,H.,Casner,S.,Jacobson,V.和R.Frederick,“RTP:实时应用的传输协议”,RFC 35502003年7月。
[RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996.
[RFC2026]Bradner,S.,“互联网标准过程——第3版”,BCP 9,RFC 2026,1996年10月。
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。
[RFC2327] Handley, M. and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998.
[RFC2327]Handley,M.和V.Jacobson,“SDP:会话描述协议”,RFC 2327,1998年4月。
[RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 3551, July 2003.
[RFC3551]Schulzrinne,H.和S.Casner,“具有最小控制的音频和视频会议的RTP配置文件”,RFC 35512003年7月。
[RFC2508] Casner, S. and V. Jacobson, "Compressing IP/UDP/RTP Headers for Low-Speed Serial Links", RFC 2508, February 1999.
[RFC2508]Casner,S.和V.Jacobson,“压缩低速串行链路的IP/UDP/RTP报头”,RFC 2508,1999年2月。
[RFC3095] Bormann, C., Burmeister, C., Degermark, M., Fukushima, H., Hannu, H., Jonsson, L-E, Hakenberg, R., Koren, T., Le, K., Liu, Z., Martensson, A., Miyazaki, A., Svanbro, K., Wiebke, T., Yoshimura, T. and H. Zheng, "RObust Header Compression (ROHC): Framework and four profiles", RFC 3095, July 2001.
[RFC3095]Bormann,C.,Burmeister,C.,Degermark,M.,Fukushima,H.,Hannu,H.,Jonsson,L-E,Hakenberg,R.,Koren,T.,Le,K.,Liu,Z.,Martenson,A.,Miyazaki,A.,Svanbro,K.,Wiebke,T.,Yoshimura,T.和H.Zheng,“鲁棒头压缩(ROHC):框架和四个配置文件”,RFC 3095,2001年7月。
The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat.
IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何努力来确定任何此类权利。有关IETF在标准跟踪和标准相关文件中权利的程序信息,请参见BCP-11。可提供权利主张的副本,或从IETF秘书处获得本规范实施者或用户使用此类专有权利的一般许可或许可的结果。
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director.
IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涉及实施本标准所需技术的专有权利。请将信息发送给IETF执行董事。
David Pearce Motorola Labs UK Research Laboratory Jays Close Viables Industrial Estate Basingstoke, HANTS, RG22 4PD
David Pearce摩托罗拉实验室英国研究实验室Jays Close Viables工业区Basingstoke,HANTS,RG22 4PD
Phone: +44 (0)1256 484 436 EMail: bdp003@motorola.com
Phone: +44 (0)1256 484 436 EMail: bdp003@motorola.com
Senaka Balasuriya Motorola, Inc. 600 U.S Highway 45 Libertyville, IL 60048, USA
Senaka Balasuriya Motorola,Inc.美国伊利诺伊州利伯蒂维尔45号公路600号,邮编:60048
Phone: +1-847-523-0440 EMail: Senaka.Balasuriya@motorola.com
Phone: +1-847-523-0440 EMail: Senaka.Balasuriya@motorola.com
Yoon Kim VerbalTek, Inc. 2921 Copper Rd. Santa Clara, CA 95051
Yoon Kim Rarbotek,Inc.加利福尼亚州圣克拉拉铜路2921号,邮编95051
Phone: +1-408-768-4974 EMail: yoonie@verbaltek.com
Phone: +1-408-768-4974 EMail: yoonie@verbaltek.com
Stephane H. Maes, PhD, Oracle 500 Oracle Parkway, M/S 4op634 Redwood City, CA 94065 USA
Stephane H.Maes博士,美国加利福尼亚州红木市4op634号甲骨文公园路甲骨文500号,邮编94065
Phone: +1-650-607-6296. EMail: stephane.maes@oracle.com
Phone: +1-650-607-6296. EMail: stephane.maes@oracle.com
Hari Garudadri Qualcomm Inc. 5775, Morehouse Dr. San Diego, CA 92121-1714, USA
Hari Garudadri Qualcomm Inc.5775,美国加利福尼亚州圣地亚哥Morehouse Dr.92121-1714
Phone: +1-858-651-6383 EMail: hgarudad@qualcomm.com
Phone: +1-858-651-6383 EMail: hgarudad@qualcomm.com
Qiaobing Xie Motorola, Inc. 1501 W. Shure Drive, 2-F9 Arlington Heights, IL 60004, USA
谢乔冰摩托罗拉有限公司,地址:美国伊利诺伊州阿灵顿高地2楼9号舒尔大道西1501号,邮编60004
Phone: +1-847-632-3028 EMail: Qiaobing.Xie@motorola.com
Phone: +1-847-632-3028 EMail: Qiaobing.Xie@motorola.com
Copyright (C) The Internet Society (2003). All Rights Reserved.
版权所有(C)互联网协会(2003年)。版权所有。
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.
本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。
The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.
上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。
This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。
Acknowledgement
确认
Funding for the RFC Editor function is currently provided by the Internet Society.
RFC编辑功能的资金目前由互联网协会提供。