Network Working Group                                            D. Oran
Request for Comments: 4313                           Cisco Systems, Inc.
Category: Informational                                    December 2005
        
Network Working Group                                            D. Oran
Request for Comments: 4313                           Cisco Systems, Inc.
Category: Informational                                    December 2005
        

Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources

自动语音识别(ASR)、说话人识别/说话人验证(SI/SV)和文本语音(TTS)资源的分布式控制要求

Status of this Memo

本备忘录的状况

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

Abstract

摘要

This document outlines the needs and requirements for a protocol to control distributed speech processing of audio streams. By speech processing, this document specifically means automatic speech recognition (ASR), speaker recognition -- which includes both speaker identification (SI) and speaker verification (SV) -- and text-to-speech (TTS). Other IETF protocols, such as SIP and Real Time Streaming Protocol (RTSP), address rendezvous and control for generalized media streams. However, speech processing presents additional requirements that none of the extant IETF protocols address.

本文件概述了控制音频流分布式语音处理协议的需求和要求。通过语音处理,本文档具体指的是自动语音识别(ASR)、说话人识别(包括说话人识别(SI)和说话人验证(SV)以及文本到语音(TTS)。其他IETF协议,如SIP和实时流协议(RTSP),用于通用媒体流的地址会合和控制。然而,语音处理提出了现有IETF协议都无法满足的附加要求。

Table of Contents

目录

   1. Introduction ....................................................3
      1.1. Document Conventions .......................................3
   2. SPEECHSC Framework ..............................................4
      2.1. TTS Example ................................................5
      2.2. Automatic Speech Recognition Example .......................6
      2.3. Speaker Identification example .............................6
   3. General Requirements ............................................7
      3.1. Reuse Existing Protocols ...................................7
      3.2. Maintain Existing Protocol Integrity .......................7
      3.3. Avoid Duplicating Existing Protocols .......................7
      3.4. Efficiency .................................................8
      3.5. Invocation of Services .....................................8
      3.6. Location and Load Balancing ................................8
        
   1. Introduction ....................................................3
      1.1. Document Conventions .......................................3
   2. SPEECHSC Framework ..............................................4
      2.1. TTS Example ................................................5
      2.2. Automatic Speech Recognition Example .......................6
      2.3. Speaker Identification example .............................6
   3. General Requirements ............................................7
      3.1. Reuse Existing Protocols ...................................7
      3.2. Maintain Existing Protocol Integrity .......................7
      3.3. Avoid Duplicating Existing Protocols .......................7
      3.4. Efficiency .................................................8
      3.5. Invocation of Services .....................................8
      3.6. Location and Load Balancing ................................8
        
      3.7. Multiple Services ..........................................8
      3.8. Multiple Media Sessions ....................................8
      3.9. Users with Disabilities ....................................9
      3.10. Identification of Process That Produced Media or
            Control Output ............................................9
   4. TTS Requirements ................................................9
      4.1. Requesting Text Playback ...................................9
      4.2. Text Formats ...............................................9
           4.2.1. Plain Text ..........................................9
           4.2.2. SSML ................................................9
           4.2.3. Text in Control Channel ............................10
           4.2.4. Document Type Indication ...........................10
      4.3. Control Channel ...........................................10
      4.4. Media Origination/Termination by Control Elements .........10
      4.5. Playback Controls .........................................10
      4.6. Session Parameters ........................................11
      4.7. Speech Markers ............................................11
   5. ASR Requirements ...............................................11
      5.1. Requesting Automatic Speech Recognition ...................11
      5.2. XML .......................................................11
      5.3. Grammar Requirements ......................................12
           5.3.1. Grammar Specification ..............................12
           5.3.2. Explicit Indication of Grammar Format ..............12
           5.3.3. Grammar Sharing ....................................12
      5.4. Session Parameters ........................................12
      5.5. Input Capture .............................................12
   6. Speaker Identification and Verification Requirements ...........13
      6.1. Requesting SI/SV ..........................................13
      6.2. Identifiers for SI/SV .....................................13
      6.3. State for Multiple Utterances .............................13
      6.4. Input Capture .............................................13
      6.5. SI/SV Functional Extensibility ............................13
   7. Duplexing and Parallel Operation Requirements ..................13
      7.1. Full Duplex Operation .....................................14
      7.2. Multiple Services in Parallel .............................14
      7.3. Combination of Services ...................................14
   8. Additional Considerations (Non-Normative) ......................14
   9. Security Considerations ........................................15
      9.1. SPEECHSC Protocol Security ................................15
      9.2. Client and Server Implementation and Deployment ...........16
      9.3. Use of SPEECHSC for Security Functions ....................16
   10. Acknowledgements ..............................................17
   11. References ....................................................18
      11.1. Normative References .....................................18
      11.2. Informative References ...................................18
        
      3.7. Multiple Services ..........................................8
      3.8. Multiple Media Sessions ....................................8
      3.9. Users with Disabilities ....................................9
      3.10. Identification of Process That Produced Media or
            Control Output ............................................9
   4. TTS Requirements ................................................9
      4.1. Requesting Text Playback ...................................9
      4.2. Text Formats ...............................................9
           4.2.1. Plain Text ..........................................9
           4.2.2. SSML ................................................9
           4.2.3. Text in Control Channel ............................10
           4.2.4. Document Type Indication ...........................10
      4.3. Control Channel ...........................................10
      4.4. Media Origination/Termination by Control Elements .........10
      4.5. Playback Controls .........................................10
      4.6. Session Parameters ........................................11
      4.7. Speech Markers ............................................11
   5. ASR Requirements ...............................................11
      5.1. Requesting Automatic Speech Recognition ...................11
      5.2. XML .......................................................11
      5.3. Grammar Requirements ......................................12
           5.3.1. Grammar Specification ..............................12
           5.3.2. Explicit Indication of Grammar Format ..............12
           5.3.3. Grammar Sharing ....................................12
      5.4. Session Parameters ........................................12
      5.5. Input Capture .............................................12
   6. Speaker Identification and Verification Requirements ...........13
      6.1. Requesting SI/SV ..........................................13
      6.2. Identifiers for SI/SV .....................................13
      6.3. State for Multiple Utterances .............................13
      6.4. Input Capture .............................................13
      6.5. SI/SV Functional Extensibility ............................13
   7. Duplexing and Parallel Operation Requirements ..................13
      7.1. Full Duplex Operation .....................................14
      7.2. Multiple Services in Parallel .............................14
      7.3. Combination of Services ...................................14
   8. Additional Considerations (Non-Normative) ......................14
   9. Security Considerations ........................................15
      9.1. SPEECHSC Protocol Security ................................15
      9.2. Client and Server Implementation and Deployment ...........16
      9.3. Use of SPEECHSC for Security Functions ....................16
   10. Acknowledgements ..............................................17
   11. References ....................................................18
      11.1. Normative References .....................................18
      11.2. Informative References ...................................18
        
1. Introduction
1. 介绍

There are multiple IETF protocols for establishment and termination of media sessions (SIP [6]), low-level media control (Media Gateway Control Protocol (MGCP) [7] and Media Gateway Controller (MEGACO) [8]), and media record and playback (RTSP [9]). This document focuses on requirements for one or more protocols to support the control of network elements that perform Automated Speech Recognition (ASR), speaker identification or verification (SI/SV), and rendering text into audio, also known as Text-to-Speech (TTS). Many multimedia applications can benefit from having automatic speech recognition (ASR) and text-to-speech (TTS) processing available as a distributed, network resource. This requirements document limits its focus to the distributed control of ASR, SI/SV, and TTS servers.

有多种IETF协议用于建立和终止媒体会话(SIP[6])、低级媒体控制(媒体网关控制协议(MGCP)[7]和媒体网关控制器(MEGACO)[8])以及媒体记录和播放(RTSP[9])。本文档重点介绍一个或多个协议的要求,以支持对执行自动语音识别(ASR)、说话人识别或验证(SI/SV)以及将文本转换为音频(也称为文本到语音(TTS))的网络元素的控制。许多多媒体应用程序可以从作为分布式网络资源提供的自动语音识别(ASR)和文本到语音(TTS)处理中获益。本需求文件的重点仅限于ASR、SI/SV和TTS服务器的分布式控制。

There is a broad range of systems that can benefit from a unified approach to control of TTS, ASR, and SI/SV. These include environments such as Voice over IP (VoIP) gateways to the Public Switched Telephone Network (PSTN), IP telephones, media servers, and wireless mobile devices that obtain speech services via servers on the network.

有许多系统可以从控制TTS、ASR和SI/SV的统一方法中获益。这些环境包括到公共交换电话网络(PSTN)的IP语音(VoIP)网关、IP电话、媒体服务器以及通过网络上的服务器获得语音服务的无线移动设备。

To date, there are a number of proprietary ASR and TTS APIs, as well as two IETF documents that address this problem [13], [14]. However, there are serious deficiencies to the existing documents. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer.

迄今为止,有许多专有的ASR和TTS API,以及两个IETF文档解决了这个问题[13],[14]。然而,现有文件存在严重缺陷。特别是,它们混合了现有协议的语义,但又与其他协议非常接近,从而使实现者感到困惑。

This document sets forth requirements for protocols to support distributed speech processing of audio streams. For simplicity, and to remove confusion with existing protocol proposals, this document presents the requirements as being for a "framework" that addresses the distributed control of speech resources. It refers to such a framework as "SPEECHSC", for Speech Services Control.

本文件规定了支持音频流分布式语音处理的协议要求。为简单起见,并消除与现有协议提案的混淆,本文件提出了解决语音资源分布式控制的“框架”要求。它将这种框架称为“SPEECHSC”,用于语音服务控制。

1.1. Document Conventions
1.1. 文件惯例

In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119 [3].

本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[3]中的说明进行解释。

2. SPEECHSC Framework
2. 演讲框架

Figure 1 below shows the SPEECHSC framework for speech processing.

下面的图1显示了用于语音处理的SPEECHSC框架。

                          +-------------+
                          | Application |
                          |   Server    |\
                          +-------------+ \ SPEECHSC
            SIP, VoiceXML,  /              \
             etc.          /                \
           +------------+ /                  \    +-------------+
           |   Media    |/       SPEECHSC     \---| ASR, SI/SV, |
           | Processing |-------------------------| and/or TTS  |
       RTP |   Entity   |           RTP           |    Server   |
      =====|            |=========================|             |
           +------------+                         +-------------+
        
                          +-------------+
                          | Application |
                          |   Server    |\
                          +-------------+ \ SPEECHSC
            SIP, VoiceXML,  /              \
             etc.          /                \
           +------------+ /                  \    +-------------+
           |   Media    |/       SPEECHSC     \---| ASR, SI/SV, |
           | Processing |-------------------------| and/or TTS  |
       RTP |   Entity   |           RTP           |    Server   |
      =====|            |=========================|             |
           +------------+                         +-------------+
        

Figure 1: SPEECHSC Framework

图1:SPEECHSC框架

The "Media Processing Entity" is a network element that processes media. It may be a pure media handler, or it may also have an associated SIP user agent, VoiceXML browser, or other control entity. The "ASR, SI/SV, and/or TTS Server" is a network element that performs the back-end speech processing. It may generate an RTP stream as output based on text input (TTS) or return recognition results in response to an RTP stream as input (ASR, SI/SV). The "Application Server" is a network element that instructs the Media Processing Entity on what transformations to make to the media stream. Those instructions may be established via a session protocol such as SIP, or provided via a client/server exchange such as VoiceXML. The framework allows either the Media Processing Entity or the Application Server to control the ASR or TTS Server using SPEECHSC as a control protocol, which accounts for the SPEECHSC protocol appearing twice in the diagram.

“媒体处理实体”是处理媒体的网络元件。它可以是纯媒体处理程序,也可以具有关联的SIP用户代理、VoiceXML浏览器或其他控制实体。“ASR、SI/SV和/或TTS服务器”是执行后端语音处理的网元。它可以基于文本输入(TTS)生成RTP流作为输出,或者响应RTP流作为输入(ASR、SI/SV)返回识别结果。“应用服务器”是一个网络元素,指示媒体处理实体对媒体流进行哪些转换。这些指令可以通过会话协议(例如SIP)建立,或者通过客户端/服务器交换(例如VoiceXML)提供。该框架允许媒体处理实体或应用服务器使用SPEECHSC作为控制协议来控制ASR或TTS服务器,这说明SPEECHSC协议在图中出现两次。

Physical embodiments of the entities can reside in one physical instance per entity, or some combination of entities. For example, a VoiceXML [11] gateway may combine the ASR and TTS functions on the same platform as the Media Processing Entity. Note that VoiceXML gateways themselves are outside the scope of this protocol. Likewise, one can combine the Application Server and Media Processing Entity, as would be the case in an interactive voice response (IVR) platform.

实体的物理实施例可以驻留在每个实体的一个物理实例中,或者驻留在实体的一些组合中。例如,VoiceXML[11]网关可以在与媒体处理实体相同的平台上组合ASR和TTS功能。请注意,VoiceXML网关本身不在本协议的范围内。同样,可以将应用服务器和媒体处理实体结合起来,就像交互式语音响应(IVR)平台中的情况一样。

One can also decompose the Media Processing Entity into an entity that controls media endpoints and entities that process media directly. Such would be the case with a decomposed gateway using MGCP or MEGACO. However, this decomposition is again orthogonal to

还可以将媒体处理实体分解为控制媒体端点的实体和直接处理媒体的实体。使用MGCP或MEGACO的分解网关就是这种情况。然而,这种分解再次与

the scope of SPEECHSC. The following subsections provide a number of example use cases of the SPEECHSC, one each for TTS, ASR, and SI/SV. They are intended to be illustrative only, and not to imply any restriction on the scope of the framework or to limit the decomposition or configuration to that shown in the example.

演讲的范围。以下小节提供了许多SPEECHSC的示例用例,TTS、ASR和SI/SV各一个。它们仅用于说明,并不意味着对框架的范围有任何限制,也不意味着将分解或配置限制在示例中所示的范围内。

2.1. TTS Example
2.1. TTS示例

This example illustrates a simple usage of SPEECHSC to provide a Text-to-Speech service for playing announcements to a user on a phone with no display for textual error messages. The example scenario is shown below in Figure 2. In the figure, the VoIP gateway acts as both the Media Processing Entity and the Application Server of the SPEECHSC framework in Figure 1.

此示例演示了SPEECHSC的一个简单用法,它提供了一种文本到语音服务,用于在没有显示文本错误消息的手机上向用户播放公告。示例场景如下图2所示。在图中,VoIP网关充当图1中SPEECHSC框架的媒体处理实体和应用服务器。

                                      +---------+
                                     _|   SIP   |
                                   _/ |  Server |
                +-----------+  SIP/   +---------+
                |           |  _/
    +-------+   |   VoIP    |_/
    | POTS  |___| Gateway   |   RTP   +---------+
    | Phone |   | (SIP UA)  |=========|         |
    +-------+   |           |\_       | SPEECHSC|
                +-----------+  \      |   TTS   |
                                \__   |  Server |
                             SPEECHSC |         |
                                    \_|         |
                                      +---------+
        
                                      +---------+
                                     _|   SIP   |
                                   _/ |  Server |
                +-----------+  SIP/   +---------+
                |           |  _/
    +-------+   |   VoIP    |_/
    | POTS  |___| Gateway   |   RTP   +---------+
    | Phone |   | (SIP UA)  |=========|         |
    +-------+   |           |\_       | SPEECHSC|
                +-----------+  \      |   TTS   |
                                \__   |  Server |
                             SPEECHSC |         |
                                    \_|         |
                                      +---------+
        

Figure 2: Text-to-Speech Example of SPEECHSC

图2:SPEECHSC的文本到语音示例

The Plain Old Telephone Service (POTS) phone on the left attempts to make a phone call. The VoIP gateway, acting as a SIP UA, tries to establish a SIP session to complete the call, but gets an error, such as a SIP "486 Busy Here" response. Without SPEECHSC, the gateway would most likely just output a busy signal to the POTS phone. However, with SPEECHSC access to a TTS server, it can provide a spoken error message. The VoIP gateway therefore constructs a text error string using information from the SIP messages, such as "Your call to 978-555-1212 did not go through because the called party was busy". It then can use SPEECHSC to establish an association with a SPEECHSC server, open an RTP stream between itself and the server, and issue a TTS request for the error message, which will be played to the user on the POTS phone.

左侧的普通老式电话服务(POTS)尝试拨打电话。作为SIP UA的VoIP网关尝试建立SIP会话以完成呼叫,但收到错误,例如SIP“486 Busy Here”响应。如果没有SPEECHSC,网关很可能只是向POTS电话输出忙信号。但是,通过SPEECHSC访问TTS服务器,它可以提供语音错误消息。因此,VoIP网关使用SIP消息中的信息构造文本错误字符串,例如“您对978-555-1212的呼叫未通过,因为被叫方忙”。然后,它可以使用SPEECHSC与SPEECHSC服务器建立关联,在自身和服务器之间打开RTP流,并发出错误消息的TTS请求,该错误消息将在POTS电话上播放给用户。

2.2. Automatic Speech Recognition Example
2.2. 自动语音识别实例

This example illustrates a VXML-enabled media processing entity and associated application server using the SPEECHSC framework to supply an ASR-based user interface through an Interactive Voice Response (IVR) system. The example scenario is shown below in Figure 3. The VXML-client corresponds to the "media processing entity", while the IVR application server corresponds to the "application server" of the SPEECHSC framework of Figure 1.

该示例说明了一个支持VXML的媒体处理实体和相关应用服务器,使用SPEECHSC框架通过交互式语音应答(IVR)系统提供基于ASR的用户界面。示例场景如下图3所示。VXML客户端对应于“媒体处理实体”,而IVR应用服务器对应于图1的SPEECHSC框架的“应用服务器”。

                                      +------------+
                                      |    IVR     |
                                     _|Application |
                               VXML_/ +------------+
                +-----------+  __/
                |           |_/       +------------+
    PSTN Trunk  |   VoIP    | SPEECHSC|            |
   =============| Gateway   |---------| SPEECHSC   |
                |(VXML voice|         |   ASR      |
                | browser)  |=========|  Server    |
                +-----------+   RTP   +------------+
        
                                      +------------+
                                      |    IVR     |
                                     _|Application |
                               VXML_/ +------------+
                +-----------+  __/
                |           |_/       +------------+
    PSTN Trunk  |   VoIP    | SPEECHSC|            |
   =============| Gateway   |---------| SPEECHSC   |
                |(VXML voice|         |   ASR      |
                | browser)  |=========|  Server    |
                +-----------+   RTP   +------------+
        

Figure 3: Automatic Speech Recognition Example

图3:自动语音识别示例

In this example, users call into the service in order to obtain stock quotes. The VoIP gateway answers their PSTN call. An IVR application feeds VXML scripts to the gateway to drive the user interaction. The VXML interpreter on the gateway directs the user's media stream to the SPEECHSC ASR server and uses SPEECHSC to control the ASR server.

在本例中,用户调用该服务以获取股票报价。VoIP网关应答他们的PSTN呼叫。IVR应用程序向网关提供VXML脚本,以驱动用户交互。网关上的VXML解释器将用户的媒体流定向到SPEECHSC ASR服务器,并使用SPEECHSC控制ASR服务器。

When, for example, the user speaks the name of a stock in response to an IVR prompt, the SPEECHSC ASR server attempts recognition of the name, and returns the results to the VXML gateway. The VXML gateway, following standard VXML mechanisms, informs the IVR Application of the recognized result. The IVR Application can then do the appropriate information lookup. The answer, of course, can be sent back to the user using text-to-speech. This example does not show this scenario, but it would work analogously to the scenario shown in section Section 2.1.

例如,当用户响应IVR提示说出股票名称时,SPEECHSC ASR服务器尝试识别该名称,并将结果返回到VXML网关。VXML网关遵循标准VXML机制,将识别结果通知IVR应用程序。然后,IVR应用程序可以执行适当的信息查找。当然,答案可以通过文本到语音发送回用户。此示例未显示此场景,但其工作原理与第2.1节中所示的场景类似。

2.3. Speaker Identification example
2.3. 说话人识别示例

This example illustrates using speaker identification to allow voice-actuated login to an IP phone. The example scenario is shown below in Figure 4. In the figure, the IP Phone acts as both the "Media Processing Entity" and the "Application Server" of the SPEECHSC framework in Figure 1.

此示例说明了如何使用扬声器标识允许语音启动登录到IP电话。示例场景如下图4所示。在图中,IP电话充当图1中SPEECHSC框架的“媒体处理实体”和“应用服务器”。

   +-----------+         +---------+
   |           |   RTP   |         |
   |   IP      |=========| SPEECHSC|
   |  Phone    |         |   TTS   |
   |           |_________|  Server |
   |           | SPEECHSC|         |
   +-----------+         +---------+
        
   +-----------+         +---------+
   |           |   RTP   |         |
   |   IP      |=========| SPEECHSC|
   |  Phone    |         |   TTS   |
   |           |_________|  Server |
   |           | SPEECHSC|         |
   +-----------+         +---------+
        

Figure 4: Speaker Identification Example

图4:说话人识别示例

In this example, a user speaks into a SIP phone in order to get "logged in" to that phone to make and receive phone calls using his identity and preferences. The IP phone uses the SPEECHSC framework to set up an RTP stream between the phone and the SPEECHSC SI/SV server and to request verification. The SV server verifies the user's identity and returns the result, including the necessary login credentials, to the phone via SPEECHSC. The IP Phone may use the identity directly to identify the user in outgoing calls, to fetch the user's preferences from a configuration server, or to request authorization from an Authentication, Authorization, and Accounting (AAA) server, in any combination. Since this example uses SPEECHSC to perform a security-related function, be sure to note the associated material in Section 9.

在本例中,用户对SIP电话讲话,以便“登录”该电话,使用其身份和偏好拨打和接听电话。IP电话使用SPEECHSC框架在电话和SPEECHSC SI/SV服务器之间建立RTP流,并请求验证。SV服务器验证用户的身份,并通过SPEECHSC将结果(包括必要的登录凭据)返回到手机。IP电话可以直接使用该标识来识别呼出呼叫中的用户,从配置服务器获取用户的偏好,或者以任何组合从认证、授权和计费(AAA)服务器请求授权。由于本例使用SPEECHSC执行安全相关功能,请务必注意第9节中的相关内容。

3. General Requirements
3. 一般要求
3.1. Reuse Existing Protocols
3.1. 重用现有协议

To the extent feasible, the SPEECHSC framework SHOULD use existing protocols.

在可行的范围内,SPEECHSC框架应使用现有协议。

3.2. Maintain Existing Protocol Integrity
3.2. 维护现有协议的完整性

In meeting the requirement of Section 3.1, the SPEECHSC framework MUST NOT redefine the semantics of an existing protocol. Said differently, we will not break existing protocols or cause backward-compatibility problems.

为了满足第3.1节的要求,SPEECHSC框架不得重新定义现有协议的语义。换言之,我们不会破坏现有协议或导致向后兼容性问题。

3.3. Avoid Duplicating Existing Protocols
3.3. 避免重复现有协议

To the extent feasible, SPEECHSC SHOULD NOT duplicate the functionality of existing protocols. For example, network announcements using SIP [12] and RTSP [9] already define how to request playback of audio. The focus of SPEECHSC is new functionality not addressed by existing protocols or extending existing protocols within the strictures of the requirement in

在可行的范围内,SPEECHSC不应重复现有协议的功能。例如,使用SIP[12]和RTSP[9]的网络公告已经定义了如何请求音频播放。SPEECHSC的重点是现有协议未解决的新功能,或在中的需求范围内扩展现有协议

Section 3.2. Where an existing protocol can be gracefully extended to support SPEECHSC requirements, such extensions are acceptable alternatives for meeting the requirements.

第3.2节。如果现有协议可以优雅地扩展以支持SPEECHSC需求,则此类扩展是满足需求的可接受替代方案。

As a corollary to this, the SPEECHSC should not require a separate protocol to perform functions that could be easily added into the SPEECHSC protocol (like redirecting media streams, or discovering capabilities), unless it is similarly easy to embed that protocol directly into the SPEECHSC framework.

由此推论,SPEECHSC不应要求单独的协议来执行可轻松添加到SPEECHSC协议中的功能(如重定向媒体流或发现功能),除非直接将该协议嵌入SPEECHSC框架同样容易。

3.4. Efficiency
3.4. 效率

The SPEECHSC framework SHOULD employ protocol elements known to result in efficient operation. Techniques to be considered include:

SPEECHSC框架应采用已知的协议元素,以实现高效运行。需要考虑的技术包括:

o Re-use of transport connections across sessions o Piggybacking of responses on requests in the reverse direction o Caching of state across requests

o 跨会话重用传输连接o反向搭载请求响应o跨请求缓存状态

3.5. Invocation of Services
3.5. 服务调用

The SPEECHSC framework MUST be compliant with the IAB Open Pluggable Edge Services (OPES) [4] framework. The applicability of the SPEECHSC protocol will therefore be specified as occurring between clients and servers at least one of which is operating directly on behalf of the user requesting the service.

SPEECHSC框架必须符合IAB开放可插拔边缘服务(OPES)[4]框架。因此,SPEECHSC协议的适用性将被指定为发生在客户端和服务器之间,其中至少一个服务器直接代表请求服务的用户操作。

3.6. Location and Load Balancing
3.6. 位置和负载平衡

To the extent feasible, the SPEECHSC framework SHOULD exploit existing schemes for supporting service location and load balancing, such as the Service Location Protocol [13] or DNS SRV records [14]. Where such facilities are not deemed adequate, the SPEECHSC framework MAY define additional load balancing techniques.

在可行的范围内,SPEECHSC框架应利用现有方案来支持服务位置和负载平衡,如服务位置协议[13]或DNS SRV记录[14]。如果认为这些设施不够,SPEECHSC框架可以定义额外的负载平衡技术。

3.7. Multiple Services
3.7. 多种服务

The SPEECHSC framework MUST permit multiple services to operate on a single media stream so that either the same or different servers may be performing speech recognition, speaker identification or verification, etc., in parallel.

SPEECHSC框架必须允许多个服务在单个媒体流上运行,以便相同或不同的服务器可以并行执行语音识别、说话人识别或验证等。

3.8. Multiple Media Sessions
3.8. 多种媒体会议

The SPEECHSC framework MUST allow a 1:N mapping between session and RTP channels. For example, a single session may include an outbound RTP channel for TTS, an inbound for ASR, and a different inbound for SI/SV (e.g., if processed by different elements on the Media Resource

SPEECHSC框架必须允许会话和RTP通道之间的1:N映射。例如,单个会话可以包括用于TTS的出站RTP信道、用于ASR的入站和用于SI/SV的不同入站(例如,如果由媒体资源上的不同元素处理)

Element). Note: All of these can be described via SDP, so if SDP is utilized for media channel description, this requirement is met "for free".

元素)。注意:所有这些都可以通过SDP进行描述,因此,如果SDP用于媒体频道描述,则“免费”满足此要求。

3.9. Users with Disabilities
3.9. 残疾用户

The SPEECHSC framework must have sufficient capabilities to address the critical needs of people with disabilities. In particular, the set of requirements set forth in RFC 3351 [5] MUST be taken into account by the framework. It is also important that implementers of SPEECHSC clients and servers be cognizant that some interaction modalities of SPEECHSC may be inconvenient or simply inappropriate for disabled users. Hearing-impaired individuals may find TTS of limited utility. Speech-impaired users may be unable to make use of ASR or SI/SV capabilities. Therefore, systems employing SPEECHSC MUST provide alternative interaction modes or avoid the use of speech processing entirely.

演讲框架必须具备足够的能力,以满足残疾人的关键需求。特别是,框架必须考虑RFC 3351[5]中规定的要求。SPEECHSC客户端和服务器的实现者必须认识到SPEECHSC的某些交互模式可能不方便或不适合残疾用户,这一点也很重要。听力受损的个人可能会发现TTS的效用有限。言语障碍用户可能无法使用ASR或SI/SV功能。因此,使用SPEECHSC的系统必须提供可选的交互模式或完全避免使用语音处理。

3.10. Identification of Process That Produced Media or Control Output
3.10. 识别产生介质或控制输出的过程

The client of a SPEECHSC operation SHOULD be able to ascertain via the SPEECHSC framework what speech process produced the output. For example, an RTP stream containing the spoken output of TTS should be identifiable as TTS output, and the recognized utterance of ASR should be identifiable as having been produced by ASR processing.

SPEECHSC操作的客户端应该能够通过SPEECHSC框架确定产生输出的语音过程。例如,包含TTS的语音输出的RTP流应可识别为TTS输出,并且所识别的ASR话语应可识别为已由ASR处理产生。

4. TTS Requirements
4. TTS要求
4.1. Requesting Text Playback
4.1. 请求文本播放

The SPEECHSC framework MUST allow a Media Processing Entity or Application Server, using a control protocol, to request the TTS Server to play back text as voice in an RTP stream.

SPEECHSC框架必须允许媒体处理实体或应用服务器使用控制协议请求TTS服务器在RTP流中将文本作为语音播放。

4.2. Text Formats
4.2. 文本格式
4.2.1. Plain Text
4.2.1. 纯文本

The SPEECHSC framework MAY assume that all TTS servers are capable of reading plain text. For reading plain text, framework MUST allow the language and voicing to be indicated via session parameters. For finer control over such properties, see [1].

SPEECHSC框架可能假设所有TTS服务器都能够读取纯文本。对于阅读纯文本,框架必须允许通过会话参数指示语言和语音。有关对此类属性的更精细控制,请参见[1]。

4.2.2. SSML
4.2.2. SSML

The SPEECHSC framework MUST support Speech Synthesis Markup Language (SSML)[1] <speak> basics, and SHOULD support other SSML tags. The framework assumes all TTS servers are capable of reading SSML

SPEECHSC框架必须支持语音合成标记语言(SSML)[1]<speak>基础,并应支持其他SSML标记。该框架假设所有TTS服务器都能够读取SSML

formatted text. Internationalization of TTS in the SPEECHSC framework, including multi-lingual output within a single utterance, is accomplished via SSML xml:lang tags.

格式化文本。SPEECHSC框架中TTS的国际化,包括单个话语中的多语言输出,是通过SSML xml:lang标记实现的。

4.2.3. Text in Control Channel
4.2.3. 控制通道中的文本

The SPEECHSC framework assumes all TTS servers accept text over the SPEECHSC connection for reading over the RTP connection. The framework assumes the server can accept text either "by value" (embedded in the protocol) or "by reference" (e.g., by de-referencing a Uniform Resource Identifier (URI) embedded in the protocol).

SPEECHSC框架假设所有TTS服务器都通过SPEECHSC连接接受文本,以便通过RTP连接进行读取。该框架假设服务器可以“按值”(嵌入在协议中)或“按引用”(例如,通过取消引用嵌入在协议中的统一资源标识符(URI))接受文本。

4.2.4. Document Type Indication
4.2.4. 文件类型指示

A document type specifies the syntax in which the text to be read is encoded. The SPEECHSC framework MUST be capable of explicitly indicating the document type of the text to be processed, as opposed to forcing the server to infer the content by other means.

文档类型指定要读取的文本的编码语法。SPEECHSC框架必须能够明确指示要处理的文本的文档类型,而不是强制服务器通过其他方式推断内容。

4.3. Control Channel
4.3. 控制通道

The SPEECHSC framework MUST be capable of establishing the control channel between the client and server on a per-session basis, where a session is loosely defined to be associated with a single "call" or "dialog". The protocol SHOULD be capable of maintaining a long-lived control channel for multiple sessions serially, and MAY be capable of shorter time horizons as well, including as short as for the processing of a single utterance.

SPEECHSC框架必须能够在每个会话的基础上建立客户端和服务器之间的控制通道,其中会话被松散地定义为与单个“调用”或“对话”关联。该协议应当能够连续地为多个会话维护长寿命的控制信道,并且也能够缩短时间范围,包括短至处理单个话语。

4.4. Media Origination/Termination by Control Elements
4.4. 由控制元件发起/终止媒体

The SPEECHSC framework MUST NOT require the controlling element (application server, media processing entity) to accept or originate media streams. Media streams MAY source & sink from the controlled element (ASR, TTS, etc.).

SPEECHSC框架不得要求控制元素(应用程序服务器、媒体处理实体)接受或发起媒体流。媒体流可从受控元件(ASR、TTS等)发出和接收。

4.5. Playback Controls
4.5. 播放控制

The SPEECHSC framework MUST support "VCR controls" for controlling the playout of streaming media output from SPEECHSC processing, and MUST allow for servers with varying capabilities to accommodate such controls. The protocol SHOULD allow clients to state what controls they wish to use, and for servers to report which ones they honor. These capabilities include:

SPEECHSC框架必须支持“VCR控制”,以控制SPEECHSC处理流媒体输出的播放,并且必须允许具有不同功能的服务器容纳此类控制。该协议应允许客户端声明他们希望使用的控件,并允许服务器报告他们遵守的控件。这些能力包括:

o The ability to jump in time to the location of a specific marker. o The ability to jump in time, forwards or backwards, by a specified amount of time. Valid time units MUST include seconds, words, paragraphs, sentences, and markers. o The ability to increase and decrease playout speed. o The ability to fast-forward and fast-rewind the audio, where snippets of audio are played as the server moves forwards or backwards in time. o The ability to pause and resume playout. o The ability to increase and decrease playout volume.

o 在时间上跳到特定标记位置的能力。o在规定时间内向前或向后跳跃的能力。有效的时间单位必须包括秒、单词、段落、句子和标记。o提高和降低播放速度的能力。o能够快进和快退音频,当服务器在时间上向前或向后移动时播放音频片段。o暂停和恢复播放的能力。o增加和减少播放音量的能力。

These controls SHOULD be made easily available to users through the client user interface and through per-user customization capabilities of the client. This is particularly important for hearing-impaired users, who will likely desire settings and control regimes different from those that would be acceptable for non-impaired users.

这些控件应通过客户端用户界面和客户端的每用户自定义功能轻松提供给用户。这对于听力受损的用户尤其重要,他们可能希望设置和控制制度不同于非受损用户可以接受的设置和控制制度。

4.6. Session Parameters
4.6. 会话参数

The SPEECHSC framework MUST support the specification of session parameters, such as language, prosody, and voicing.

SPEECHSC框架必须支持会话参数的规范,如语言、韵律和语音。

4.7. Speech Markers
4.7. 言语标记

The SPEECHSC framework MUST accommodate speech markers, with capability at least as flexible as that provided in SSML [1]. The framework MUST further provide an efficient mechanism for reporting that a marker has been reached during playout.

SPEECHSC框架必须适应语音标记,其功能至少与SSML[1]中提供的功能一样灵活。该框架必须进一步提供一种有效的机制,用于报告在播放期间已到达标记。

5. ASR Requirements
5. ASR要求
5.1. Requesting Automatic Speech Recognition
5.1. 请求自动语音识别

The SPEECHSC framework MUST allow a Media Processing Entity or Application Server to request the ASR Server to perform automatic speech recognition on an RTP stream, returning the results over SPEECHSC.

SPEECHSC框架必须允许媒体处理实体或应用服务器请求ASR服务器对RTP流执行自动语音识别,并通过SPEECHSC返回结果。

5.2. XML
5.2. XML

The SPEECHSC framework assumes that all ASR servers support the VoiceXML speech recognition grammar specification (SRGS) for speech recognition [2].

SPEECHSC框架假设所有ASR服务器都支持语音识别的VoiceXML语音识别语法规范(SRGS)[2]。

5.3. Grammar Requirements
5.3. 语法要求
5.3.1. Grammar Specification
5.3.1. 语法规范

The SPEECHSC framework assumes all ASR servers are capable of accepting grammar specifications either "by value" (embedded in the protocol) or "by reference" (e.g., by de-referencing a URI embedded in the protocol). The latter MUST allow the indication of a grammar already known to, or otherwise "built in" to, the server. The framework and protocol further SHOULD exploit the ability to store and later retrieve by reference large grammars that were originally supplied by the client.

SPEECHSC框架假设所有ASR服务器都能够接受“按值”(嵌入到协议中)或“按引用”(例如,通过取消引用嵌入到协议中的URI)的语法规范。后者必须允许指示服务器已经知道的语法,或者服务器“内置”的语法。框架和协议应该进一步利用存储和稍后通过引用客户机最初提供的大型语法检索的能力。

5.3.2. Explicit Indication of Grammar Format
5.3.2. 语法格式的显式表示

The SPEECHSC framework protocol MUST be able to explicitly convey the grammar format in which the grammar is encoded and MUST be extensible to allow for conveying new grammar formats as they are defined.

SPEECHSC框架协议必须能够显式地传递语法编码所使用的语法格式,并且必须是可扩展的,以允许按照定义传递新的语法格式。

5.3.3. Grammar Sharing
5.3.3. 语法共享

The SPEECHSC framework SHOULD exploit sharing grammars across sessions for servers that are capable of doing so. This supports applications with large grammars for which it is unrealistic to dynamically load. An example is a city-country grammar for a weather service.

SPEECHSC框架应该利用能够这样做的服务器会话间的共享语法。这支持具有大型语法的应用程序,对于这些应用程序,动态加载是不现实的。例如,气象服务的城乡语法。

5.4. Session Parameters
5.4. 会话参数

The SPEECHSC framework MUST accommodate at a minimum all of the protocol parameters currently defined in Media Resource Control Protocol (MRCP) [10] In addition, there SHOULD be a capability to reset parameters within a session.

SPEECHSC框架必须至少容纳媒体资源控制协议(MRCP)[10]中当前定义的所有协议参数。此外,还应能够在会话中重置参数。

5.5. Input Capture
5.5. 输入捕获

The SPEECHSC framework MUST support a method directing the ASR Server to capture the input media stream for later analysis and tuning of the ASR engine.

SPEECHSC框架必须支持一种方法,指导ASR服务器捕获输入媒体流,以便以后分析和调整ASR引擎。

6. Speaker Identification and Verification Requirements
6. 说话人识别和验证要求
6.1. Requesting SI/SV
6.1. 请求SI/SV

The SPEECHSC framework MUST allow a Media Processing Entity to request the SI/SV Server to perform speaker identification or verification on an RTP stream, returning the results over SPEECHSC.

SPEECHSC框架必须允许媒体处理实体请求SI/SV服务器在RTP流上执行说话人识别或验证,并通过SPEECHSC返回结果。

6.2. Identifiers for SI/SV
6.2. SI/SV的标识符

The SPEECHSC framework MUST accommodate an identifier for each verification resource and permit control of that resource by ID, because voiceprint format and contents are vendor specific.

SPEECHSC框架必须为每个验证资源提供一个标识符,并允许通过ID控制该资源,因为声纹格式和内容是特定于供应商的。

6.3. State for Multiple Utterances
6.3. 多个语句的状态

The SPEECHSC framework MUST work with SI/SV servers that maintain state to handle multi-utterance verification.

SPEECHSC框架必须与维护状态的SI/SV服务器协同工作,以处理多话语验证。

6.4. Input Capture
6.4. 输入捕获

The SPEECHSC framework MUST support a method for capturing the input media stream for later analysis and tuning of the SI/SV engine. The framework may assume all servers are capable of doing so. In addition, the framework assumes that the captured stream contains enough timestamp context (e.g., the NTP time range from the RTP Control Protocol (RTCP) packets, which corresponds to the RTP timestamps of the captured input) to ascertain after the fact exactly when the verification was requested.

SPEECHSC框架必须支持捕获输入媒体流的方法,以便以后分析和调整SI/SV引擎。该框架可能假设所有服务器都能够这样做。此外,该框架假设捕获的流包含足够的时间戳上下文(例如,来自RTP控制协议(RTCP)分组的NTP时间范围,其对应于捕获的输入的RTP时间戳),以便在事实发生后准确地确定何时请求验证。

6.5. SI/SV Functional Extensibility
6.5. SI/SV功能扩展性

The SPEECHSC framework SHOULD be extensible to additional functions associated with SI/SV, such as prompting, utterance verification, and retraining.

SPEECHSC框架应可扩展到与SI/SV相关的其他功能,如提示、话语验证和再培训。

7. Duplexing and Parallel Operation Requirements
7. 双工和并行操作要求

One very important requirement for an interactive speech-driven system is that user perception of the quality of the interaction depends strongly on the ability of the user to interrupt a prompt or rendered TTS with speech. Interrupting, or barging, the speech output requires more than energy detection from the user's direction. Many advanced systems halt the media towards the user by employing the ASR engine to decide if an utterance is likely to be real speech, as opposed to a cough, for example.

交互语音驱动系统的一个非常重要的要求是,用户对交互质量的感知很大程度上取决于用户中断语音提示或呈现TTS的能力。中断或打断语音输出需要的不仅仅是来自用户方向的能量检测。例如,许多先进的系统通过使用ASR引擎来判断一个话语是否可能是真实的言语,而不是咳嗽,从而停止媒体对用户的传播。

7.1. Full Duplex Operation
7.1. 全双工操作

To achieve low latency between utterance detection and halting of playback, many implementations combine the speaking and ASR functions. The SPEECHSC framework MUST support such full-duplex implementations.

为了实现语音检测和停止播放之间的低延迟,许多实现结合了语音和ASR功能。SPEECHSC框架必须支持这种全双工实现。

7.2. Multiple Services in Parallel
7.2. 并行多个服务

Good spoken user interfaces typically depend upon the ease with which the user can accomplish his or her task. When making use of speaker identification or verification technologies, user interface improvements often come from the combination of the different technologies: simultaneous identity claim and verification (on the same utterance), simultaneous knowledge and voice verification (using ASR and verification simultaneously). Using ASR and verification on the same utterance is in fact the only way to support rolling or dynamically-generated challenge phrases (e.g., "say 51723"). The SPEECHSC framework MUST support such parallel service implementations.

良好的口语用户界面通常取决于用户完成任务的难易程度。在使用说话人识别或验证技术时,用户界面的改进通常来自不同技术的组合:同时身份声明和验证(在同一话语上)、同时知识和语音验证(同时使用ASR和验证)。事实上,对同一话语使用ASR和验证是支持滚动或动态生成挑战短语(例如,“say 51723”)的唯一方法。SPEECHSC框架必须支持此类并行服务实现。

7.3. Combination of Services
7.3. 服务组合

It is optionally of interest that the SPEECHSC framework support more complex remote combination and controls of speech engines:

SPEECHSC框架支持更复杂的远程组合和语音引擎控制,这是值得关注的:

o Combination in series of engines that may then act on the input or output of ASR, TTS, or Speaker recognition engines. The control MAY then extend beyond such engines to include other audio input and output processing and natural language processing. o Intermediate exchanges and coordination between engines. o Remote specification of flows between engines.

o 一系列引擎的组合,可作用于ASR、TTS或扬声器识别引擎的输入或输出。然后,控制可以扩展到这些引擎之外,以包括其他音频输入和输出处理以及自然语言处理。o发动机之间的中间交换和协调。o发动机之间流量的远程规范。

These capabilities MAY benefit from service discovery mechanisms (e.g., engines, properties, and states discovery).

这些功能可能受益于服务发现机制(例如,引擎、属性和状态发现)。

8. Additional Considerations (Non-Normative)
8. 其他注意事项(非规范性)

The framework assumes that Session Description Protocol (SDP) will be used to describe media sessions and streams. The framework further assumes RTP carriage of media. However, since SDP can be used to describe other media transport schemes (e.g., ATM) these could be used if they provide the necessary elements (e.g., explicit timestamps).

该框架假设会话描述协议(SDP)将用于描述媒体会话和流。该框架还假设RTP承载媒体。然而,由于SDP可用于描述其他媒体传输方案(例如ATM),如果它们提供必要的元素(例如,显式时间戳),则可以使用这些方案。

The working group will not be defining distributed speech recognition (DSR) methods, as exemplified by the European Telecommunications Standards Institute (ETSI) Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP.

工作组不会定义分布式语音识别(DSR)方法,欧洲电信标准协会(ETSI)Aurora项目就是一个例子。工作组不会重新创建其他协议(如SIP或SDP)中可用的功能。

TTS looks very much like playing back a file. Extending RTSP looks promising for when one requires VCR controls or markers in the text to be spoken. When one does not require VCR controls, SIP in a framework such as Network Announcements [12] works directly without modification.

TTS看起来非常像回放文件。当需要在文本中使用VCR控件或标记时,扩展RTSP看起来很有希望。当不需要VCR控制时,网络公告[12]等框架中的SIP直接工作,无需修改。

ASR has an entirely different set of characteristics. For barge-in support, ASR requires real-time return of intermediate results. Barring the discovery of a good reuse model for an existing protocol, this will most likely become the focus of SPEECHSC.

ASR有一套完全不同的特性。对于驳船的支援,ASR要求实时返回中间结果。除非为现有协议发现良好的重用模型,否则这很可能成为SPEECHSC关注的焦点。

9. Security Considerations
9. 安全考虑

Protocols relating to speech processing must take security and privacy into account. Many applications of speech technology deal with sensitive information, such as the use of Text-to-Speech to read financial information. Likewise, popular uses for automatic speech recognition include executing financial transactions and shopping.

与语音处理相关的协议必须考虑安全性和隐私性。语音技术的许多应用处理敏感信息,例如使用文本到语音来读取金融信息。类似地,自动语音识别的流行用途包括执行金融交易和购物。

There are at least three aspects of speech processing security that intersect with the SPEECHSC requirements -- securing the SPEECHSC protocol itself, implementing and deploying the servers that run the protocol, and ensuring that utilization of the technology for providing security functions is appropriate. Each of these aspects in discussed in the following subsections. While some of these considerations are, strictly speaking, out of scope of the protocol itself, they will be carefully considered and accommodated during protocol design, and will be called out as part of the applicability statement accompanying the protocol specification(s). Privacy considerations are discussed as well.

语音处理安全至少有三个方面与SPEECHSC要求交叉——保护SPEECHSC协议本身,实施和部署运行该协议的服务器,以及确保适当利用该技术提供安全功能。以下小节将讨论这些方面中的每一个。虽然严格来说,其中一些考虑因素超出了协议本身的范围,但在协议设计期间,将仔细考虑和考虑这些因素,并将作为协议规范随附的适用性声明的一部分提出。还讨论了隐私方面的考虑。

9.1. SPEECHSC Protocol Security
9.1. SPEECHSC协议安全

The SPEECHSC protocol MUST in all cases support authentication, authorization, and integrity, and SHOULD support confidentiality. For privacy-sensitive applications, the protocol MUST support confidentiality. We envision that rather than providing protocol-specific security mechanisms in SPEECHSC itself, the resulting protocol will employ security machinery of either a containing protocol or the transport on which it runs. For example, we will consider solutions such as using Transport Layer Security (TLS) for securing the control channel, and Secure Realtime Transport

SPEECHSC协议在任何情况下都必须支持身份验证、授权和完整性,并且应该支持机密性。对于隐私敏感的应用程序,协议必须支持保密性。我们设想,与SPEECHSC本身提供特定于协议的安全机制不同,生成的协议将使用包含协议或其运行的传输的安全机制。例如,我们将考虑解决方案,例如使用传输层安全(TLS)来确保控制信道,以及安全的实时传输。

Protocol (SRTP) for securing the media channel. Third-party dependencies necessitating transitive trust will be minimized or explicitly dealt with through the authentication and authorization aspects of the protocol design.

用于保护媒体通道的协议(SRTP)。需要传递信任的第三方依赖关系将通过协议设计的身份验证和授权方面最小化或显式处理。

9.2. Client and Server Implementation and Deployment
9.2. 客户端和服务器的实现和部署

Given the possibly sensitive nature of the information carried, SPEECHSC clients and servers need to take steps to ensure confidentiality and integrity of the data and its transformations to and from spoken form. In addition to these general considerations, certain SPEECHSC functions, such as speaker verification and identification, employ voiceprints whose privacy, confidentiality, and integrity must be maintained. Similarly, the requirement to support input capture for analysis and tuning can represent a privacy vulnerability because user utterances are recorded and could be either revealed or replayed inappropriately. Implementers must take care to prevent the exploitation of any centralized voiceprint database and the recorded material from which such voiceprints may be derived. Specific actions that are recommended to minimize these threats include:

鉴于所携带的信息可能具有敏感性,SPEECHSC客户机和服务器需要采取措施确保数据的机密性和完整性,以及与口语形式的转换。除了这些一般注意事项外,某些演讲功能(如说话人验证和识别)使用的声纹必须保持其隐私性、机密性和完整性。类似地,支持输入捕获以进行分析和调整的需求可能会表现出隐私漏洞,因为用户的话语会被记录下来,并且可能会被不适当地泄露或重播。实施者必须注意防止利用任何集中式声纹数据库和可从中提取声纹的记录材料。建议采取的将这些威胁降至最低的具体行动包括:

o End-to-end authentication, confidentiality, and integrity protection (like TLS) of access to the database to minimize the exposure to external attack. o Database protection measures such as read/write access control and local login authentication to minimize the exposure to insider threats. o Copies of the database, especially ones that are maintained at off-site locations, need the same protection as the operational database.

o 对数据库访问的端到端身份验证、机密性和完整性保护(如TLS),以最大限度地减少受到外部攻击的风险。o数据库保护措施,如读/写访问控制和本地登录身份验证,以最大限度地减少对内部威胁的暴露。o数据库副本,尤其是在非现场位置维护的副本,需要与运行数据库相同的保护。

Inappropriate disclosure of this data does not as of the date of this document represent an exploitable threat, but quite possibly might in the future. Specific vulnerabilities that might become feasible are discussed in the next subsection. It is prudent to take measures such as encrypting the voiceprint database and permitting access only through programming interfaces enforcing adequate authorization machinery.

截至本文件发布之日,不当披露该数据并不代表可利用的威胁,但很可能在未来会发生。下一小节将讨论可能变得可行的特定漏洞。谨慎的做法是采取措施,如加密声纹数据库,只允许通过编程接口访问,强制执行适当的授权机制。

9.3. Use of SPEECHSC for Security Functions
9.3. 使用SPEECHSC进行安全功能

Either speaker identification or verification can be used directly as an authentication technology. Authorization decisions can be coupled with speaker verification in a direct fashion through challenge-response protocols, or indirectly with speaker identification through the use of access control lists or other identity-based authorization mechanisms. When so employed, there are

说话人识别或验证都可以直接用作身份验证技术。授权决策可以通过质询-响应协议与说话人验证直接耦合,或者通过使用访问控制列表或其他基于身份的授权机制与说话人识别间接耦合。当被雇用时,有

additional security concerns that need to be addressed through the use of protocol security mechanisms for clients and servers. For example, the ability to manipulate the media stream of a speaker verification request could inappropriately permit or deny access based on impersonation, or simple garbling via noise injection, making it critical to properly secure both the control and data channels, as recommended above. The following issues specific to the use of SI/SV for authentication should be carefully considered:

需要通过使用客户端和服务器的协议安全机制来解决的其他安全问题。例如,操纵说话人验证请求的媒体流的能力可能不适当地允许或拒绝基于模拟的访问,或者通过噪声注入进行简单的混淆,使得如上所述正确保护控制通道和数据通道至关重要。应仔细考虑与使用SI/SV进行身份验证相关的以下问题:

1. Theft of voiceprints or the recorded samples used to construct them represents a future threat against the use of speaker identification/verification as a biometric authentication technology. A plausible attack vector (not feasible today) is to use the voiceprint information as parametric input to a text-to-speech synthesis system that could mimic the user's voice accurately enough to match the voiceprint. Since it is not very difficult to surreptitiously record reasonably large corpuses of voice samples, the ability to construct voiceprints for input to this attack would render the security of voice-based biometric authentication, even using advanced challenge-response techniques, highly vulnerable. Users of speaker verification for authentication should monitor technological developments in this area closely for such future vulnerabilities (much as users of other authentication technologies should monitor advances in factoring as a way to break asymmetric keying systems). 2. As with other biometric authentication technologies, a downside to the use of speech identification is that revocation is not possible. Once compromised, the biometric information can be used in identification and authentication to other independent systems. 3. Enrollment procedures can be vulnerable to impersonation if not protected both by protocol security mechanisms and some independent proof of identity. (Proof of identity may not be needed in systems that only need to verify continuity of identity since enrollment, as opposed to association with a particular individual.

1. 盗取声纹或用于构建声纹的录制样本是对使用说话人识别/验证作为生物特征认证技术的未来威胁。一个似是而非的攻击向量(目前不可行)是使用声纹信息作为文本输入到语音合成系统的参数输入,该系统可以精确地模仿用户的声音以匹配声纹。由于秘密记录相当大的语音样本并不十分困难,因此构建语音指纹以输入到该攻击的能力将使基于语音的生物认证的安全性极易受到攻击,即使使用高级质询响应技术。用于身份验证的说话人验证的用户应密切监测该领域的技术发展,以防将来出现此类漏洞(正如其他身份验证技术的用户应监测作为打破非对称密钥系统的一种方法的因数分解的进展一样)。2.与其他生物特征认证技术一样,使用语音识别的缺点是不可能撤销。一旦泄露,生物特征信息可用于识别和认证其他独立系统。3.如果不受协议安全机制和一些独立身份证明的保护,注册过程可能容易被冒充。(在只需要验证注册后身份连续性的系统中,可能不需要身份证明,而不是与特定个人关联。

Further discussion of the use of SI/SV as an authentication technology, and some recommendations concerning advantages and vulnerabilities, can be found in Chapter 5 of [15].

关于使用SI/SV作为身份验证技术的进一步讨论,以及关于优势和漏洞的一些建议,见[15]第5章。

10. Acknowledgements
10. 致谢

Eric Burger wrote the original version of these requirements and has continued to contribute actively throughout their development. He is a co-author in all but formal authorship, and is instead acknowledged here as it is preferable that working group co-chairs have non-conflicting roles with respect to the progression of documents.

Eric Burger编写了这些需求的原始版本,并在整个开发过程中继续做出积极贡献。除正式作者外,他是其他所有方面的共同作者,在这里得到承认,因为工作组共同主席在文件的进展方面具有不冲突的作用是可取的。

11. References
11. 工具书类
11.1. Normative References
11.1. 规范性引用文件

[1] Walker, M., Burnett, D., and A. Hunt, "Speech Synthesis Markup Language (SSML) Version 1.0", W3C REC REC-speech-synthesis-20040907, September 2004.

[1] Walker,M.,Burnett,D.,和A.Hunt,“语音合成标记语言(SSML)1.0版”,W3C REC-Speech-Synthesis-20040907,2004年9月。

[2] McGlashan, S. and A. Hunt, "Speech Recognition Grammar Specification Version 1.0", W3C REC REC-speech-grammar-20040316, March 2004.

[2] McGrashan,S.和A.Hunt,“语音识别语法规范1.0版”,W3C REC-Speech-Grammar-20040316,2004年3月。

[3] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[3] Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[4] Floyd, S. and L. Daigle, "IAB Architectural and Policy Considerations for Open Pluggable Edge Services", RFC 3238, January 2002.

[4] Floyd,S.和L.Daigle,“开放可插拔边缘服务的IAB架构和政策考虑”,RFC 3238,2002年1月。

[5] Charlton, N., Gasson, M., Gybels, G., Spanner, M., and A. van Wijk, "User Requirements for the Session Initiation Protocol (SIP) in Support of Deaf, Hard of Hearing and Speech-impaired Individuals", RFC 3351, August 2002.

[5] N.查尔顿、M.加森、G.吉贝尔斯、M.斯潘纳和A.范威克,“支持聋人、重听人和言语障碍者的会话启动协议(SIP)的用户需求”,RFC 3351,2002年8月。

11.2. Informative References
11.2. 资料性引用

[6] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002.

[6] Rosenberg,J.,Schulzrinne,H.,Camarillo,G.,Johnston,A.,Peterson,J.,Sparks,R.,Handley,M.,和E.Schooler,“SIP:会话启动协议”,RFC 3261,2002年6月。

[7] Andreasen, F. and B. Foster, "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 3435, January 2003.

[7] Andreasen,F.和B.Foster,“媒体网关控制协议(MGCP)1.0版”,RFC 3435,2003年1月。

[8] Groves, C., Pantaleo, M., Ericsson, LM., Anderson, T., and T. Taylor, "Gateway Control Protocol Version 1", RFC 3525, June 2003.

[8] Groves,C.,Pantaleo,M.,Ericsson,LM.,Anderson,T.,和T.Taylor,“网关控制协议版本1”,RFC 35252003年6月。

[9] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, April 1998.

[9] Schulzrinne,H.,Rao,A.,和R.Lanphier,“实时流协议(RTSP)”,RFC2326,1998年4月。

[10] Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource Control Protocol", Work in Progress.

[10] Shanmugham,S.,摩纳哥,P.,和B.Eberman,“MRCP:媒体资源控制协议”,正在进行的工作。

[11] World Wide Web Consortium, "Voice Extensible Markup Language (VoiceXML) Version 2.0", W3C Working Draft , April 2002, <http://www.w3.org/TR/2002/WD-voicexml20-20020424/>.

[11] 万维网联盟,“语音可扩展标记语言(VoiceXML)2.0版”,W3C工作草案,2002年4月<http://www.w3.org/TR/2002/WD-voicexml20-20020424/>.

[12] Burger, E., Ed., Van Dyke, J., and A. Spitzer, "Basic Network Media Services with SIP", RFC 4240, December 2005.

[12] Burger,E.,Ed.,Van Dyke,J.,和A.Spitzer,“具有SIP的基本网络媒体服务”,RFC 42402005年12月。

[13] Guttman, E., Perkins, C., Veizades, J., and M. Day, "Service Location Protocol, Version 2", RFC 2608, June 1999.

[13] Guttman,E.,Perkins,C.,Veizades,J.,和M.Day,“服务位置协议,版本2”,RFC 26081999年6月。

[14] Gulbrandsen, A., Vixie, P., and L. Esibov, "A DNS RR for specifying the location of services (DNS SRV)", RFC 2782, February 2000.

[14] Gulbrandsen,A.,Vixie,P.和L.Esibov,“用于指定服务位置(DNS SRV)的DNS RR”,RFC 2782,2000年2月。

[15] Committee on Authentication Technologies and Their Privacy Implications, National Research Council, "Who Goes There?: Authentication Through the Lens of Privacy", Computer Science and Telecommunications Board (CSTB) , 2003, <http://www.nap.edu/catalog/10656.html/ >.

[15] 认证技术及其隐私影响委员会,国家研究委员会,“谁去那里?:从隐私角度进行认证”,计算机科学和电信委员会(CSTB),2003年<http://www.nap.edu/catalog/10656.html/ >.

Author's Address

作者地址

David R. Oran Cisco Systems, Inc. 7 Ladyslipper Lane Acton, MA USA

David R.Oran Cisco Systems,Inc.美国马萨诸塞州阿克顿市利珀巷7号

   EMail: oran@cisco.com
        
   EMail: oran@cisco.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。

This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件及其包含的信息是按“原样”提供的,贡献者、他/她所代表或赞助的组织(如有)、互联网协会和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Intellectual Property

知识产权

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关RFC文件中权利的程序信息,请参见BCP 78和BCP 79。

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.

Acknowledgement

确认

Funding for the RFC Editor function is currently provided by the Internet Society.

RFC编辑功能的资金目前由互联网协会提供。