Internet Engineering Task Force (IETF) D. Burnett Request for Comments: 6787 Voxeo Category: Standards Track S. Shanmugham ISSN: 2070-1721 Cisco Systems, Inc. November 2012 Media Resource Control Protocol Version 2 (MRCPv2) Abstract The Media Resource Control Protocol Version 2 (MRCPv2) allows client hosts to control media service resources such as speech synthesizers, recognizers, verifiers, and identifiers residing in servers on the network. MRCPv2 is not a "stand-alone" protocol -- it relies on other protocols, such as the Session Initiation Protocol (SIP), to coordinate MRCPv2 clients and servers and manage sessions between them, and the Session Description Protocol (SDP) to describe, discover, and exchange capabilities. It also depends on SIP and SDP to establish the media sessions and associated parameters between the media source or sink and the media server. Once this is done, the MRCPv2 exchange operates over the control session established above, allowing the client to control the media processing resources on the speech resource server. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6787. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Burnett & Shanmugham Standards Track [Page 1] RFC 6787 MRCPv2 November 2012 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 8 2. Document Conventions . . . . . . . . . . . . . . . . . . . . 9 2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 10 2.2. State-Machine Diagrams . . . . . . . . . . . . . . . . . 10 2.3. URI Schemes . . . . . . . . . . . . . . . . . . . . . . 11 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1. MRCPv2 Media Resource Types . . . . . . . . . . . . . . 12 3.2. Server and Resource Addressing . . . . . . . . . . . . . 14 4. MRCPv2 Basics . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1. Connecting to the Server . . . . . . . . . . . . . . . . 14 4.2. Managing Resource Control Channels . . . . . . . . . . . 14 4.3. SIP Session Example . . . . . . . . . . . . . . . . . . 17 4.4. Media Streams and RTP Ports . . . . . . . . . . . . . . 22 4.5. MRCPv2 Message Transport . . . . . . . . . . . . . . . . 24 4.6. MRCPv2 Session Termination . . . . . . . . . . . . . . . 24 5. MRCPv2 Specification . . . . . . . . . . . . . . . . . . . . 24 5.1. Common Protocol Elements . . . . . . . . . . . . . . . . 25 5.2. Request . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3. Response . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4. Status Codes . . . . . . . . . . . . . . . . . . . . . . 30 5.5. Events . . . . . . . . . . . . . . . . . . . . . . . . . 31 6. MRCPv2 Generic Methods, Headers, and Result Structure . . . . 32 6.1. Generic Methods . . . . . . . . . . . . . . . . . . . . 32 6.1.1. SET-PARAMS . . . . . . . . . . . . . . . . . . . . . 32 6.1.2. GET-PARAMS . . . . . . . . . . . . . . . . . . . . . 33 6.2. Generic Message Headers . . . . . . . . . . . . . . . . 34 6.2.1. Channel-Identifier . . . . . . . . . . . . . . . . . 35 6.2.2. Accept . . . . . . . . . . . . . . . . . . . . . . . 36 Burnett & Shanmugham Standards Track [Page 2] RFC 6787 MRCPv2 November 2012 6.2.3. Active-Request-Id-List . . . . . . . . . . . . . . . 36 6.2.4. Proxy-Sync-Id . . . . . . . . . . . . . . . . . . . 36 6.2.5. Accept-Charset . . . . . . . . . . . . . . . . . . . 37 6.2.6. Content-Type . . . . . . . . . . . . . . . . . . . . 37 6.2.7. Content-ID . . . . . . . . . . . . . . . . . . . . . 38 6.2.8. Content-Base . . . . . . . . . . . . . . . . . . . . 38 6.2.9. Content-Encoding . . . . . . . . . . . . . . . . . . 38 6.2.10. Content-Location . . . . . . . . . . . . . . . . . . 39 6.2.11. Content-Length . . . . . . . . . . . . . . . . . . . 39 6.2.12. Fetch Timeout . . . . . . . . . . . . . . . . . . . 39 6.2.13. Cache-Control . . . . . . . . . . . . . . . . . . . 40 6.2.14. Logging-Tag . . . . . . . . . . . . . . . . . . . . 41 6.2.15. Set-Cookie . . . . . . . . . . . . . . . . . . . . . 42 6.2.16. Vendor-Specific Parameters . . . . . . . . . . . . . 44 6.3. Generic Result Structure . . . . . . . . . . . . . . . . 44 6.3.1. Natural Language Semantics Markup Language . . . . . 45 7. Resource Discovery . . . . . . . . . . . . . . . . . . . . . 46 8. Speech Synthesizer Resource . . . . . . . . . . . . . . . . . 47 8.1. Synthesizer State Machine . . . . . . . . . . . . . . . 48 8.2. Synthesizer Methods . . . . . . . . . . . . . . . . . . 48 8.3. Synthesizer Events . . . . . . . . . . . . . . . . . . . 49 8.4. Synthesizer Header Fields . . . . . . . . . . . . . . . 49 8.4.1. Jump-Size . . . . . . . . . . . . . . . . . . . . . 49 8.4.2. Kill-On-Barge-In . . . . . . . . . . . . . . . . . . 50 8.4.3. Speaker-Profile . . . . . . . . . . . . . . . . . . 51 8.4.4. Completion-Cause . . . . . . . . . . . . . . . . . . 51 8.4.5. Completion-Reason . . . . . . . . . . . . . . . . . 52 8.4.6. Voice-Parameter . . . . . . . . . . . . . . . . . . 52 8.4.7. Prosody-Parameters . . . . . . . . . . . . . . . . . 53 8.4.8. Speech-Marker . . . . . . . . . . . . . . . . . . . 53 8.4.9. Speech-Language . . . . . . . . . . . . . . . . . . 54 8.4.10. Fetch-Hint . . . . . . . . . . . . . . . . . . . . . 54 8.4.11. Audio-Fetch-Hint . . . . . . . . . . . . . . . . . . 55 8.4.12. Failed-URI . . . . . . . . . . . . . . . . . . . . . 55 8.4.13. Failed-URI-Cause . . . . . . . . . . . . . . . . . . 55 8.4.14. Speak-Restart . . . . . . . . . . . . . . . . . . . 56 8.4.15. Speak-Length . . . . . . . . . . . . . . . . . . . . 56 8.4.16. Load-Lexicon . . . . . . . . . . . . . . . . . . . . 57 8.4.17. Lexicon-Search-Order . . . . . . . . . . . . . . . . 57 8.5. Synthesizer Message Body . . . . . . . . . . . . . . . . 57 8.5.1. Synthesizer Speech Data . . . . . . . . . . . . . . 57 8.5.2. Lexicon Data . . . . . . . . . . . . . . . . . . . . 59 8.6. SPEAK Method . . . . . . . . . . . . . . . . . . . . . . 60 8.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8.8. BARGE-IN-OCCURRED . . . . . . . . . . . . . . . . . . . 63 8.9. PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . 65 8.10. RESUME . . . . . . . . . . . . . . . . . . . . . . . . . 66 8.11. CONTROL . . . . . . . . . . . . . . . . . . . . . . . . 67 Burnett & Shanmugham Standards Track [Page 3] RFC 6787 MRCPv2 November 2012 8.12. SPEAK-COMPLETE . . . . . . . . . . . . . . . . . . . . . 69 8.13. SPEECH-MARKER . . . . . . . . . . . . . . . . . . . . . 70 8.14. DEFINE-LEXICON . . . . . . . . . . . . . . . . . . . . . 71 9. Speech Recognizer Resource . . . . . . . . . . . . . . . . . 72 9.1. Recognizer State Machine . . . . . . . . . . . . . . . . 74 9.2. Recognizer Methods . . . . . . . . . . . . . . . . . . . 74 9.3. Recognizer Events . . . . . . . . . . . . . . . . . . . 75 9.4. Recognizer Header Fields . . . . . . . . . . . . . . . . 75 9.4.1. Confidence-Threshold . . . . . . . . . . . . . . . . 77 9.4.2. Sensitivity-Level . . . . . . . . . . . . . . . . . 77 9.4.3. Speed-Vs-Accuracy . . . . . . . . . . . . . . . . . 77 9.4.4. N-Best-List-Length . . . . . . . . . . . . . . . . . 78 9.4.5. Input-Type . . . . . . . . . . . . . . . . . . . . . 78 9.4.6. No-Input-Timeout . . . . . . . . . . . . . . . . . . 78 9.4.7. Recognition-Timeout . . . . . . . . . . . . . . . . 79 9.4.8. Waveform-URI . . . . . . . . . . . . . . . . . . . . 79 9.4.9. Media-Type . . . . . . . . . . . . . . . . . . . . . 80 9.4.10. Input-Waveform-URI . . . . . . . . . . . . . . . . . 80 9.4.11. Completion-Cause . . . . . . . . . . . . . . . . . . 80 9.4.12. Completion-Reason . . . . . . . . . . . . . . . . . 83 9.4.13. Recognizer-Context-Block . . . . . . . . . . . . . . 83 9.4.14. Start-Input-Timers . . . . . . . . . . . . . . . . . 83 9.4.15. Speech-Complete-Timeout . . . . . . . . . . . . . . 84 9.4.16. Speech-Incomplete-Timeout . . . . . . . . . . . . . 84 9.4.17. DTMF-Interdigit-Timeout . . . . . . . . . . . . . . 85 9.4.18. DTMF-Term-Timeout . . . . . . . . . . . . . . . . . 85 9.4.19. DTMF-Term-Char . . . . . . . . . . . . . . . . . . . 85 9.4.20. Failed-URI . . . . . . . . . . . . . . . . . . . . . 86 9.4.21. Failed-URI-Cause . . . . . . . . . . . . . . . . . . 86 9.4.22. Save-Waveform . . . . . . . . . . . . . . . . . . . 86 9.4.23. New-Audio-Channel . . . . . . . . . . . . . . . . . 86 9.4.24. Speech-Language . . . . . . . . . . . . . . . . . . 87 9.4.25. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 87 9.4.26. Recognition-Mode . . . . . . . . . . . . . . . . . . 87 9.4.27. Cancel-If-Queue . . . . . . . . . . . . . . . . . . 88 9.4.28. Hotword-Max-Duration . . . . . . . . . . . . . . . . 88 9.4.29. Hotword-Min-Duration . . . . . . . . . . . . . . . . 88 9.4.30. Interpret-Text . . . . . . . . . . . . . . . . . . . 89 9.4.31. DTMF-Buffer-Time . . . . . . . . . . . . . . . . . . 89 9.4.32. Clear-DTMF-Buffer . . . . . . . . . . . . . . . . . 89 9.4.33. Early-No-Match . . . . . . . . . . . . . . . . . . . 90 9.4.34. Num-Min-Consistent-Pronunciations . . . . . . . . . 90 9.4.35. Consistency-Threshold . . . . . . . . . . . . . . . 90 9.4.36. Clash-Threshold . . . . . . . . . . . . . . . . . . 90 9.4.37. Personal-Grammar-URI . . . . . . . . . . . . . . . . 91 9.4.38. Enroll-Utterance . . . . . . . . . . . . . . . . . . 91 9.4.39. Phrase-Id . . . . . . . . . . . . . . . . . . . . . 91 9.4.40. Phrase-NL . . . . . . . . . . . . . . . . . . . . . 92 Burnett & Shanmugham Standards Track [Page 4] RFC 6787 MRCPv2 November 2012 9.4.41. Weight . . . . . . . . . . . . . . . . . . . . . . . 92 9.4.42. Save-Best-Waveform . . . . . . . . . . . . . . . . . 92 9.4.43. New-Phrase-Id . . . . . . . . . . . . . . . . . . . 93 9.4.44. Confusable-Phrases-URI . . . . . . . . . . . . . . . 93 9.4.45. Abort-Phrase-Enrollment . . . . . . . . . . . . . . 93 9.5. Recognizer Message Body . . . . . . . . . . . . . . . . 93 9.5.1. Recognizer Grammar Data . . . . . . . . . . . . . . 93 9.5.2. Recognizer Result Data . . . . . . . . . . . . . . . 97 9.5.3. Enrollment Result Data . . . . . . . . . . . . . . . 98 9.5.4. Recognizer Context Block . . . . . . . . . . . . . . 98 9.6. Recognizer Results . . . . . . . . . . . . . . . . . . . 99 9.6.1. Markup Functions . . . . . . . . . . . . . . . . . . 99 9.6.2. Overview of Recognizer Result Elements and Their Relationships . . . . . . . . . . . . . . . . . . . 100 9.6.3. Elements and Attributes . . . . . . . . . . . . . . 101 9.7. Enrollment Results . . . . . . . . . . . . . . . . . . . 106 9.7.1. Element . . . . . . . . . . . . . . . 106 9.7.2. Element . . . . . . . . . . . 106 9.7.3. Element . . . . . . . 107 9.7.4. Element . . . . . . . . . . . . 107 9.7.5. Element . . . . . . . . . . . . . 107 9.7.6. Element . . . . . . . . . . . . . . 107 9.7.7. Element . . . . . . . . . . . . 107 9.8. DEFINE-GRAMMAR . . . . . . . . . . . . . . . . . . . . . 107 9.9. RECOGNIZE . . . . . . . . . . . . . . . . . . . . . . . 111 9.10. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 118 9.11. GET-RESULT . . . . . . . . . . . . . . . . . . . . . . . 119 9.12. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 120 9.13. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 120 9.14. RECOGNITION-COMPLETE . . . . . . . . . . . . . . . . . . 120 9.15. START-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . 123 9.16. ENROLLMENT-ROLLBACK . . . . . . . . . . . . . . . . . . 124 9.17. END-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . . 124 9.18. MODIFY-PHRASE . . . . . . . . . . . . . . . . . . . . . 125 9.19. DELETE-PHRASE . . . . . . . . . . . . . . . . . . . . . 125 9.20. INTERPRET . . . . . . . . . . . . . . . . . . . . . . . 125 9.21. INTERPRETATION-COMPLETE . . . . . . . . . . . . . . . . 127 9.22. DTMF Detection . . . . . . . . . . . . . . . . . . . . . 128 10. Recorder Resource . . . . . . . . . . . . . . . . . . . . . . 129 10.1. Recorder State Machine . . . . . . . . . . . . . . . . . 129 10.2. Recorder Methods . . . . . . . . . . . . . . . . . . . . 130 10.3. Recorder Events . . . . . . . . . . . . . . . . . . . . 130 10.4. Recorder Header Fields . . . . . . . . . . . . . . . . . 130 10.4.1. Sensitivity-Level . . . . . . . . . . . . . . . . . 130 10.4.2. No-Input-Timeout . . . . . . . . . . . . . . . . . . 131 10.4.3. Completion-Cause . . . . . . . . . . . . . . . . . . 131 10.4.4. Completion-Reason . . . . . . . . . . . . . . . . . 132 10.4.5. Failed-URI . . . . . . . . . . . . . . . . . . . . . 132 Burnett & Shanmugham Standards Track [Page 5] RFC 6787 MRCPv2 November 2012 10.4.6. Failed-URI-Cause . . . . . . . . . . . . . . . . . . 132 10.4.7. Record-URI . . . . . . . . . . . . . . . . . . . . . 132 10.4.8. Media-Type . . . . . . . . . . . . . . . . . . . . . 133 10.4.9. Max-Time . . . . . . . . . . . . . . . . . . . . . . 133 10.4.10. Trim-Length . . . . . . . . . . . . . . . . . . . . 134 10.4.11. Final-Silence . . . . . . . . . . . . . . . . . . . 134 10.4.12. Capture-On-Speech . . . . . . . . . . . . . . . . . 134 10.4.13. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 134 10.4.14. Start-Input-Timers . . . . . . . . . . . . . . . . . 135 10.4.15. New-Audio-Channel . . . . . . . . . . . . . . . . . 135 10.5. Recorder Message Body . . . . . . . . . . . . . . . . . 135 10.6. RECORD . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.8. RECORD-COMPLETE . . . . . . . . . . . . . . . . . . . . 137 10.9. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 138 10.10. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 138 11. Speaker Verification and Identification . . . . . . . . . . . 139 11.1. Speaker Verification State Machine . . . . . . . . . . . 140 11.2. Speaker Verification Methods . . . . . . . . . . . . . . 142 11.3. Verification Events . . . . . . . . . . . . . . . . . . 144 11.4. Verification Header Fields . . . . . . . . . . . . . . . 144 11.4.1. Repository-URI . . . . . . . . . . . . . . . . . . . 144 11.4.2. Voiceprint-Identifier . . . . . . . . . . . . . . . 145 11.4.3. Verification-Mode . . . . . . . . . . . . . . . . . 145 11.4.4. Adapt-Model . . . . . . . . . . . . . . . . . . . . 146 11.4.5. Abort-Model . . . . . . . . . . . . . . . . . . . . 146 11.4.6. Min-Verification-Score . . . . . . . . . . . . . . . 147 11.4.7. Num-Min-Verification-Phrases . . . . . . . . . . . . 147 11.4.8. Num-Max-Verification-Phrases . . . . . . . . . . . . 147 11.4.9. No-Input-Timeout . . . . . . . . . . . . . . . . . . 148 11.4.10. Save-Waveform . . . . . . . . . . . . . . . . . . . 148 11.4.11. Media-Type . . . . . . . . . . . . . . . . . . . . . 148 11.4.12. Waveform-URI . . . . . . . . . . . . . . . . . . . . 148 11.4.13. Voiceprint-Exists . . . . . . . . . . . . . . . . . 149 11.4.14. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 149 11.4.15. Input-Waveform-URI . . . . . . . . . . . . . . . . . 149 11.4.16. Completion-Cause . . . . . . . . . . . . . . . . . . 150 11.4.17. Completion-Reason . . . . . . . . . . . . . . . . . 151 11.4.18. Speech-Complete-Timeout . . . . . . . . . . . . . . 151 11.4.19. New-Audio-Channel . . . . . . . . . . . . . . . . . 152 11.4.20. Abort-Verification . . . . . . . . . . . . . . . . . 152 11.4.21. Start-Input-Timers . . . . . . . . . . . . . . . . . 152 11.5. Verification Message Body . . . . . . . . . . . . . . . 152 11.5.1. Verification Result Data . . . . . . . . . . . . . . 152 11.5.2. Verification Result Elements . . . . . . . . . . . . 153 11.6. START-SESSION . . . . . . . . . . . . . . . . . . . . . 157 11.7. END-SESSION . . . . . . . . . . . . . . . . . . . . . . 158 11.8. QUERY-VOICEPRINT . . . . . . . . . . . . . . . . . . . . 159 Burnett & Shanmugham Standards Track [Page 6] RFC 6787 MRCPv2 November 2012 11.9. DELETE-VOICEPRINT . . . . . . . . . . . . . . . . . . . 160 11.10. VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 160 11.11. VERIFY-FROM-BUFFER . . . . . . . . . . . . . . . . . . . 160 11.12. VERIFY-ROLLBACK . . . . . . . . . . . . . . . . . . . . 164 11.13. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 164 11.14. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 165 11.15. VERIFICATION-COMPLETE . . . . . . . . . . . . . . . . . 165 11.16. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 166 11.17. CLEAR-BUFFER . . . . . . . . . . . . . . . . . . . . . . 166 11.18. GET-INTERMEDIATE-RESULT . . . . . . . . . . . . . . . . 167 12. Security Considerations . . . . . . . . . . . . . . . . . . . 168 12.1. Rendezvous and Session Establishment . . . . . . . . . . 168 12.2. Control Channel Protection . . . . . . . . . . . . . . . 168 12.3. Media Session Protection . . . . . . . . . . . . . . . . 169 12.4. Indirect Content Access . . . . . . . . . . . . . . . . 169 12.5. Protection of Stored Media . . . . . . . . . . . . . . . 170 12.6. DTMF and Recognition Buffers . . . . . . . . . . . . . . 171 12.7. Client-Set Server Parameters . . . . . . . . . . . . . . 171 12.8. DELETE-VOICEPRINT and Authorization . . . . . . . . . . 171 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 171 13.1. New Registries . . . . . . . . . . . . . . . . . . . . . 171 13.1.1. MRCPv2 Resource Types . . . . . . . . . . . . . . . 171 13.1.2. MRCPv2 Methods and Events . . . . . . . . . . . . . 172 13.1.3. MRCPv2 Header Fields . . . . . . . . . . . . . . . . 173 13.1.4. MRCPv2 Status Codes . . . . . . . . . . . . . . . . 176 13.1.5. Grammar Reference List Parameters . . . . . . . . . 176 13.1.6. MRCPv2 Vendor-Specific Parameters . . . . . . . . . 176 13.2. NLSML-Related Registrations . . . . . . . . . . . . . . 177 13.2.1. 'application/nlsml+xml' Media Type Registration . . 177 13.3. NLSML XML Schema Registration . . . . . . . . . . . . . 178 13.4. MRCPv2 XML Namespace Registration . . . . . . . . . . . 178 13.5. Text Media Type Registrations . . . . . . . . . . . . . 178 13.5.1. text/grammar-ref-list . . . . . . . . . . . . . . . 178 13.6. 'session' URI Scheme Registration . . . . . . . . . . . 180 13.7. SDP Parameter Registrations . . . . . . . . . . . . . . 181 13.7.1. Sub-Registry "proto" . . . . . . . . . . . . . . . . 181 13.7.2. Sub-Registry "att-field (media-level)" . . . . . . . 182 14. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 183 14.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . 183 14.2. Recognition Result Examples . . . . . . . . . . . . . . 192 14.2.1. Simple ASR Ambiguity . . . . . . . . . . . . . . . . 192 14.2.2. Mixed Initiative . . . . . . . . . . . . . . . . . . 192 14.2.3. DTMF Input . . . . . . . . . . . . . . . . . . . . . 193 14.2.4. Interpreting Meta-Dialog and Meta-Task Utterances . 194 14.2.5. Anaphora and Deixis . . . . . . . . . . . . . . . . 195 14.2.6. Distinguishing Individual Items from Sets with One Member . . . . . . . . . . . . . . . . . . . . . 195 14.2.7. Extensibility . . . . . . . . . . . . . . . . . . . 196 Burnett & Shanmugham Standards Track [Page 7] RFC 6787 MRCPv2 November 2012 15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 196 16. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 211 16.1. NLSML Schema Definition . . . . . . . . . . . . . . . . 211 16.2. Enrollment Results Schema Definition . . . . . . . . . . 213 16.3. Verification Results Schema Definition . . . . . . . . . 214 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 218 17.1. Normative References . . . . . . . . . . . . . . . . . . 218 17.2. Informative References . . . . . . . . . . . . . . . . . 220 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 223 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 223 1. Introduction MRCPv2 is designed to allow a client device to control media processing resources on the network. Some of these media processing resources include speech recognition engines, speech synthesis engines, speaker verification, and speaker identification engines. MRCPv2 enables the implementation of distributed Interactive Voice Response platforms using VoiceXML [W3C.REC-voicexml20-20040316] browsers or other client applications while maintaining separate back-end speech processing capabilities on specialized speech processing servers. MRCPv2 is based on the earlier Media Resource Control Protocol (MRCP) [RFC4463] developed jointly by Cisco Systems, Inc., Nuance Communications, and Speechworks, Inc. Although some of the method names are similar, the way in which these methods are communicated is different. There are also more resources and more methods for each resource. The first version of MRCP was essentially taken only as input to the development of this protocol. There is no expectation that an MRCPv2 client will work with an MRCPv1 server or vice versa. There is no migration plan or gateway definition between the two protocols. The protocol requirements of Speech Services Control (SPEECHSC) [RFC4313] include that the solution be capable of reaching a media processing server, setting up communication channels to the media resources, and sending and receiving control messages and media streams to/from the server. The Session Initiation Protocol (SIP) [RFC3261] meets these requirements. The proprietary version of MRCP ran over the Real Time Streaming Protocol (RTSP) [RFC2326]. At the time work on MRCPv2 was begun, the consensus was that this use of RTSP would break the RTSP protocol or cause backward-compatibility problems, something forbidden by Section 3.2 of [RFC4313]. This is the reason why MRCPv2 does not run over RTSP. Burnett & Shanmugham Standards Track [Page 8] RFC 6787 MRCPv2 November 2012 MRCPv2 leverages these capabilities by building upon SIP and the Session Description Protocol (SDP) [RFC4566]. MRCPv2 uses SIP to set up and tear down media and control sessions with the server. In addition, the client can use a SIP re-INVITE method (an INVITE dialog sent within an existing SIP session) to change the characteristics of these media and control session while maintaining the SIP dialog between the client and server. SDP is used to describe the parameters of the media sessions associated with that dialog. It is mandatory to support SIP as the session establishment protocol to ensure interoperability. Other protocols can be used for session establishment by prior agreement. This document only describes the use of SIP and SDP. MRCPv2 uses SIP and SDP to create the speech client/server dialog and set up the media channels to the server. It also uses SIP and SDP to establish MRCPv2 control sessions between the client and the server for each media processing resource required for that dialog. The MRCPv2 protocol exchange between the client and the media resource is carried on that control session. MRCPv2 exchanges do not change the state of the SIP dialog, the media sessions, or other parameters of the dialog initiated via SIP. It controls and affects the state of the media processing resource associated with the MRCPv2 session(s). MRCPv2 defines the messages to control the different media processing resources and the state machines required to guide their operation. It also describes how these messages are carried over a transport- layer protocol such as the Transmission Control Protocol (TCP) [RFC0793] or the Transport Layer Security (TLS) Protocol [RFC5246]. (Note: the Stream Control Transmission Protocol (SCTP) [RFC4960] is a viable transport for MRCPv2 as well, but the mapping onto SCTP is not described in this specification.) 2. Document Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Since many of the definitions and syntax are identical to those for the Hypertext Transfer Protocol -- HTTP/1.1 [RFC2616], this specification refers to the section where they are defined rather than copying it. For brevity, [HX.Y] is to be taken to refer to Section X.Y of RFC 2616. All the mechanisms specified in this document are described in both prose and an augmented Backus-Naur form (ABNF [RFC5234]). Burnett & Shanmugham Standards Track [Page 9] RFC 6787 MRCPv2 November 2012 The complete message format in ABNF form is provided in Section 15 and is the normative format definition. Note that productions may be duplicated within the main body of the document for reading convenience. If a production in the body of the text conflicts with one in the normative definition, the latter rules. 2.1. Definitions Media Resource An entity on the speech processing server that can be controlled through MRCPv2. MRCP Server Aggregate of one or more "Media Resource" entities on a server, exposed through MRCPv2. Often, 'server' in this document refers to an MRCP server. MRCP Client An entity controlling one or more Media Resources through MRCPv2 ("Client" for short). DTMF Dual-Tone Multi-Frequency; a method of transmitting key presses in-band, either as actual tones (Q.23 [Q.23]) or as named tone events (RFC 4733 [RFC4733]). Endpointing The process of automatically detecting the beginning and end of speech in an audio stream. This is critical both for speech recognition and for automated recording as one would find in voice mail systems. Hotword Mode A mode of speech recognition where a stream of utterances is evaluated for match against a small set of command words. This is generally employed either to trigger some action or to control the subsequent grammar to be used for further recognition. 2.2. State-Machine Diagrams The state-machine diagrams in this document do not show every possible method call. Rather, they reflect the state of the resource based on the methods that have moved to IN-PROGRESS or COMPLETE states (see Section 5.3). Note that since PENDING requests essentially have not affected the resource yet and are in the queue to be processed, they are not reflected in the state-machine diagrams. Burnett & Shanmugham Standards Track [Page 10] RFC 6787 MRCPv2 November 2012 2.3. URI Schemes This document defines many protocol headers that contain URIs (Uniform Resource Identifiers [RFC3986]) or lists of URIs for referencing media. The entire document, including the Security Considerations section (Section 12), assumes that HTTP or HTTP over TLS (HTTPS) [RFC2818] will be used as the URI addressing scheme unless otherwise stated. However, implementations MAY support other schemes (such as 'file'), provided they have addressed any security considerations described in this document and any others particular to the specific scheme. For example, implementations where the client and server both reside on the same physical hardware and the file system is secured by traditional user-level file access controls could be reasonable candidates for supporting the 'file' scheme. 3. Architecture A system using MRCPv2 consists of a client that requires the generation and/or consumption of media streams and a media resource server that has the resources or "engines" to process these streams as input or generate these streams as output. The client uses SIP and SDP to establish an MRCPv2 control channel with the server to use its media processing resources. MRCPv2 servers are addressed using SIP URIs. SIP uses SDP with the offer/answer model described in RFC 3264 [RFC3264] to set up the MRCPv2 control channels and describe their characteristics. A separate MRCPv2 session is needed to control each of the media processing resources associated with the SIP dialog between the client and server. Within a SIP dialog, the individual resource control channels for the different resources are added or removed through SDP offer/answer carried in a SIP re-INVITE transaction. The server, through the SDP exchange, provides the client with a difficult-to-guess, unambiguous channel identifier and a TCP port number (see Section 4.2). The client MAY then open a new TCP connection with the server on this port number. Multiple MRCPv2 channels can share a TCP connection between the client and the server. All MRCPv2 messages exchanged between the client and the server carry the specified channel identifier that the server MUST ensure is unambiguous among all MRCPv2 control channels that are active on that server. The client uses this channel identifier to indicate the media processing resource associated with that channel. For information on message framing, see Section 5. SIP also establishes the media sessions between the client (or other source/sink of media) and the MRCPv2 server using SDP "m=" lines. Burnett & Shanmugham Standards Track [Page 11] RFC 6787 MRCPv2 November 2012 One or more media processing resources may share a media session under a SIP session, or each media processing resource may have its own media session. The following diagram shows the general architecture of a system that uses MRCPv2. To simplify the diagram, only a few resources are shown. MRCPv2 client MRCPv2 Media Resource Server |--------------------| |------------------------------------| ||------------------|| ||----------------------------------|| || Application Layer|| ||Synthesis|Recognition|Verification|| ||------------------|| || Engine | Engine | Engine || ||Media Resource API|| || || | || | || || ||------------------|| ||Synthesis|Recognizer | Verifier || || SIP | MRCPv2 || ||Resource | Resource | Resource || ||Stack | || || Media Resource Management || || | || ||----------------------------------|| ||------------------|| || SIP | MRCPv2 || || TCP/IP Stack ||---MRCPv2---|| Stack | || || || ||----------------------------------|| ||------------------||----SIP-----|| TCP/IP Stack || |--------------------| || || | ||----------------------------------|| SIP |------------------------------------| | / |-------------------| RTP | | / | Media Source/Sink |------------/ | | |-------------------| Figure 1: Architectural Diagram 3.1. MRCPv2 Media Resource Types An MRCPv2 server may offer one or more of the following media processing resources to its clients. Basic Synthesizer A speech synthesizer resource that has very limited capabilities and can generate its media stream exclusively from concatenated audio clips. The speech data is described using a limited subset of the Speech Synthesis Markup Language (SSML) [W3C.REC-speech-synthesis-20040907] elements. A basic synthesizer MUST support the SSML tags ,