-
Notifications
You must be signed in to change notification settings - Fork 235
Open
Labels
SpecsThis is related to Activity Protocol SpecificationThis is related to Activity Protocol Specification
Description
Extend Activity Schema to Support Multimodal Interactions with Streaming
Status
Draft – For Core Committee Review
1. Overview
The current Bot Framework Activity Schema is text-centric and lacks a unified approach for voice and multimodal streaming. As conversational AI evolves toward speech-first and multimodal experiences, there is a need for:
- A consistent schema for messages across modalities (text, voice, video, image, etc.).
- Standard signalling commands for lifecycle states across all modalities.
- Unified streaming semantics for real-time data transfer.
The proposal introduces:
- Support single payload per message (one modality per activity).
- Streaming modeled as events (
stream.start,stream.chunk,stream.end). - Lifecycle modeled as commands (
session.init,session.update,session.end) with commandResult responses. - Reuse of streamInfo attributes inside event value for sequencing and continuity.
- Future: Introduce modalities array as property ato support multiple modalities in one message.
2. Motivation
Current Limitations
| Area | Limitation |
|---|---|
| Streaming | typing is text-specific; no unified streaming semantics for voice or multimodal. |
| Signalling | Lifecycle states modeled as generic events; lacks clarity as commands. |
| Extensibility | Adding modalities requires schema hacks. |
3. Goals
- Use
messagefor all final payloads with payload-first design. - Model streaming as events for clarity.
- Standardize lifecycle commands across modalities.
- Maintain backward compatibility.
- Enable future multimodal sessions without schema churn.
4. Proposed Changes
4.1 Activity Types and New Properties
| Type | Description |
|---|---|
message |
Final message for any modality (text, voice, video, image). Enhanced with payload. |
command |
Used for session management (session.init, session.update, session.end). |
commandResult |
Response to a command, as per Activity Schema. |
event |
Used for streaming actions and real-time data transfer (e.g., stream.start, stream.chunk, stream.end). |
New Properties
payload
Encapsulates modality-specific properties to avoid bloating the schema.
Example:"payload": { "voice": { "contentType": "audio/webm", "contentUrl": "..." } }
4.2 Message Examples
- Existing text property in
messagewill continue to work for legacy bots. - New implementations should use
payloadfor multimodal support.
Text Message
{
"type": "message",
"payload": {
"text": {
"content": "Book a flight to Paris",
"textFormat": "plain",
"locale": "en-us"
}
}
}Example: Voice Message
{
"type": "message",
"payload": {
"voice": {
"contentType": "audio/webm",
"contentUrl": "data:audio/webm;base64,...",
"transcription": "Book a flight to Paris",
"timestamp": "2025-10-07T10:30:00Z",
"duration": "3.4s",
"sentiment": "neutral"
}
}
}4.3 Streaming as Events
Streaming is modeled as events to clarify lifecycle and align with industry standards for real-time data transfer.
Stream Start
{
"type": "event",
"name": "stream.start",
"value": {
"streamId": "abc123", // Unique stream identifier
"contentType": "audio/webm" // Content type for the stream
}
}Stream Chunk
{
"type": "event",
"name": "stream.chunk",
"value": {
"streamId": "abc123",
"seq": 2, // Sequence number for this chunk
"isFinal": false, // Indicates if this is the last chunk
"timestamp": "2025-10-07T10:30:05Z" // Timestamp of the chunk
},
"payload": {
"voice": {
"contentType": "audio/webm", // Audio content type
"contentUrl": "data:audio/webm;base64,...", // Audio chunk data (Base64 encoded)
"duration": "2.5s", // Duration of the chunk
"timestamp": "2025-10-07T10:30:05Z", // Timestamp of the chunk
"transcription": "Can you tell me your destination?" // Transcription of the audio
}
}
}Default for isFinal: false.
Set true only for the last chunk.
stream.end signals completion explicitly, so isFinal is not required there.
Stream End
{
"type": "event",
"name": "stream.end",
"value": {
"streamId": "abc123" // Stream ID to indicate the end of the stream
}
}4.4 Lifecycle Commands with CommandResult
Commands continue to be used for session management (e.g., initializing a session, updating its state, and ending the session).
Session Init
{
"type": "command",
"id": "cmd1",
"name": "session.init",
"value": {
"sessionId": "sess_123" // Session identifier
// Additional control parameters
}
}
// Command Result
{
"type": "commandResult",
"replyToId": "cmd1",
"value": { "status": "success", "sessionId": "sess_123" }
}Session Update
{
"type": "command",
"id": "cmd2",
"name": "session.update",
"value": {
"state": "listening"
// Bot's state
// ENUM: listening (input.expected) | thinking (processing) | speaking (output.generating) | idle | error
}
}
// Command Result
{
"type": "commandResult",
"replyToId": "cmd2",
"value": { "status": "acknowledged" }
}listening: Bot is awaiting user input (input.expected).thinking: Bot is processing the input (processing).speaking: Bot is generating or delivering output (output.generating).idle: The bot is not currently in an active state.error: An error has occurred during the interaction.
Session End
{
"type": "command",
"name": "session.end",
"modality": "voice",
"value": {
"reason": "completed"
}
}Barge-In
{
"type": "command",
"name": "session.update",
"value": {
"signal": "bargeIn",
"origin": "user" // could be system or user
}
}4.5 Lifecycle Flow
Client -> Server:
session.init → stream.start → stream.chunk → stream.end -> bargeIn (optional)
Server -> Client:
session.update (listening → thinking → speaking) → message
Barge-In:
Client sends bargeIn -> Server returns to listening
5. Backward Compatibility
- Existing text property in
messageremains supported. payload-firstis additive for new implementations.commandandeventare also additive.- No breaking changes for existing bots.
6. Sample Use Cases
- Voice Streaming: stream.start, stream.chunk, stream.end.
- Text Streaming: Same command pattern with modality: text.
- Multimodal Message: Combined text + voice in payload.
- Lifecycle Control:
session.updatefor states likelistening,thinking,speaking,idle,error,barge-in.
7. Alternative Considered
| Option | Description | Reason |
|---|---|---|
Nested multimodal object |
Combine voice, text, and other inputs under a single object. | Increases schema complexity; unnecessary for initial voice support. |
Voice under channelData |
Custom per-channel implementation. | Non-standard; leads to fragmentation and lack of portability. |
| Voice as an entity | Represent audio as a metadata entity. | Entities are not intended for content payloads. |
7. Conclusion
By:
- Single payload per message (one modality per activity).
- Modeling streaming as explicit events.
- Using clear command semantics for lifecycle.
- Reusing stream attributes for sequencing.
- Introduce modalities array when multi-modality in one message is required
We achieve a clean, extensible, and backward-compatible schema for multimodal interactions.
Appendix A: Full State Transition
Client -> Server
- session.init
{ "type": "command", "id": "cmd1", "name": "session.init", "value": { "sessionId": "sess_123" } }
// Command Result
{ "type": "commandResult", "replyToId": "cmd1", "value": { "status": "success", "sessionId": "sess_123" } }- stream.start
{ "type": "event", "name": "stream.start", "value": { "streamId": "abc123", "contentType": "audio/webm" } }- stream.chunk
{ "type": "event", "name": "stream.chunk", "value": { "streamId": "abc123", "seq": 1 }, "payload": { "voice": { "contentUrl": "data:audio/webm;base64,..." } } }- stream.end
{ "type": "event", "name": "stream.end", "value": { "streamId": "abc123" } }Server -> Client
- session.update (listening)
{ "type": "command", "id": "cmd2", "name": "session.update", "value": { "state": "listening" }
// Command Result
{ "type": "commandResult", "replyToId": "cmd2", "value": { "status": "acknowledged" } }- session.update (thinking)
{ "type": "command", "id": "cmd3", "name": "session.update", "value": { "state": "thinking" }
// Command Result
{ "type": "commandResult", "replyToId": "cmd3", "value": { "status": "acknowledged" } }- session.update (speaking)
{ "type": "command", "id": "cmd4", "name": "session.update", "value": { "state": "speaking" }
// Command Result
{ "type": "commandResult", "replyToId": "cmd4", "value": { "status": "acknowledged" } }- Final Message
{
"type": "message",
"payload": {
"voice": { "contentType": "audio/webm", "contentUrl": "data:audio/webm;base64,..." }
}
}Metadata
Metadata
Assignees
Labels
SpecsThis is related to Activity Protocol SpecificationThis is related to Activity Protocol Specification