Skip to content

Extend Activity Schema to Support Multimodal Interactions with Streaming #377

@gurubhg

Description

@gurubhg

Extend Activity Schema to Support Multimodal Interactions with Streaming

Status

Draft – For Core Committee Review


1. Overview

The current Bot Framework Activity Schema is text-centric and lacks a unified approach for voice and multimodal streaming. As conversational AI evolves toward speech-first and multimodal experiences, there is a need for:

  • A consistent schema for messages across modalities (text, voice, video, image, etc.).
  • Standard signalling commands for lifecycle states across all modalities.
  • Unified streaming semantics for real-time data transfer.

The proposal introduces:

  • Support single payload per message (one modality per activity).
  • Streaming modeled as events (stream.start, stream.chunk, stream.end).
  • Lifecycle modeled as commands (session.init, session.update, session.end) with commandResult responses.
  • Reuse of streamInfo attributes inside event value for sequencing and continuity.
  • Future: Introduce modalities array as property ato support multiple modalities in one message.

2. Motivation

Current Limitations

Area Limitation
Streaming typing is text-specific; no unified streaming semantics for voice or multimodal.
Signalling Lifecycle states modeled as generic events; lacks clarity as commands.
Extensibility Adding modalities requires schema hacks.

3. Goals

  • Use message for all final payloads with payload-first design.
  • Model streaming as events for clarity.
  • Standardize lifecycle commands across modalities.
  • Maintain backward compatibility.
  • Enable future multimodal sessions without schema churn.

4. Proposed Changes

4.1 Activity Types and New Properties

Type Description
message Final message for any modality (text, voice, video, image). Enhanced with payload.
command Used for session management (session.init, session.update, session.end).
commandResult Response to a command, as per Activity Schema.
event Used for streaming actions and real-time data transfer (e.g., stream.start, stream.chunk, stream.end).

New Properties

  • payload
    Encapsulates modality-specific properties to avoid bloating the schema.
    Example:
    "payload": {
      "voice": { "contentType": "audio/webm", "contentUrl": "..." }
    }

4.2 Message Examples

  • Existing text property in message will continue to work for legacy bots.
  • New implementations should use payload for multimodal support.

Text Message

{
  "type": "message",
  "payload": {
    "text": {
      "content": "Book a flight to Paris",
      "textFormat": "plain",
      "locale": "en-us"
    }
  }
}

Example: Voice Message

{
  "type": "message",
  "payload": {
    "voice": {
      "contentType": "audio/webm",
      "contentUrl": "data:audio/webm;base64,...",
      "transcription": "Book a flight to Paris",
      "timestamp": "2025-10-07T10:30:00Z",
      "duration": "3.4s",
      "sentiment": "neutral"
    }
  }
}

4.3 Streaming as Events

Streaming is modeled as events to clarify lifecycle and align with industry standards for real-time data transfer.

Stream Start

{
  "type": "event",
  "name": "stream.start",
  "value": {
    "streamId": "abc123", // Unique stream identifier
    "contentType": "audio/webm" // Content type for the stream
  }
}

Stream Chunk

{
  "type": "event",
  "name": "stream.chunk",
  "value": {
    "streamId": "abc123",
    "seq": 2, // Sequence number for this chunk
    "isFinal": false, // Indicates if this is the last chunk
    "timestamp": "2025-10-07T10:30:05Z"  // Timestamp of the chunk
  },
  "payload": {
    "voice": {
      "contentType": "audio/webm",  // Audio content type
      "contentUrl": "data:audio/webm;base64,...",  // Audio chunk data (Base64 encoded)
      "duration": "2.5s",  // Duration of the chunk
      "timestamp": "2025-10-07T10:30:05Z",  // Timestamp of the chunk
      "transcription": "Can you tell me your destination?"  // Transcription of the audio
    }
  }
}

Default for isFinal: false.
Set true only for the last chunk.
stream.end signals completion explicitly, so isFinal is not required there.

Stream End

{
  "type": "event",
  "name": "stream.end",
  "value": {
    "streamId": "abc123" // Stream ID to indicate the end of the stream
  }
}

4.4 Lifecycle Commands with CommandResult

Commands continue to be used for session management (e.g., initializing a session, updating its state, and ending the session).

Session Init

{
  "type": "command",
  "id": "cmd1",
  "name": "session.init",
  "value": {
    "sessionId": "sess_123" // Session identifier
    // Additional control parameters
  }
}

// Command Result
{
  "type": "commandResult",
  "replyToId": "cmd1",
  "value": { "status": "success", "sessionId": "sess_123" }
}

Session Update

{
  "type": "command",
  "id": "cmd2",
  "name": "session.update",
  "value": {
    "state": "listening"
    // Bot's state
    // ENUM: listening (input.expected) | thinking (processing) | speaking (output.generating) | idle | error
  }
}

// Command Result
{
  "type": "commandResult",
  "replyToId": "cmd2",
  "value": { "status": "acknowledged" }
}
  • listening: Bot is awaiting user input (input.expected).
  • thinking: Bot is processing the input (processing).
  • speaking: Bot is generating or delivering output (output.generating).
  • idle: The bot is not currently in an active state.
  • error: An error has occurred during the interaction.

Session End

{
  "type": "command",
  "name": "session.end",
  "modality": "voice",
  "value": {
    "reason": "completed"
  }
}

Barge-In

{
  "type": "command",
  "name": "session.update",
  "value": {
    "signal": "bargeIn",
    "origin": "user" // could be system or user
  }
}

4.5 Lifecycle Flow

Client -> Server:
session.init → stream.start → stream.chunk → stream.end -> bargeIn (optional)

Server -> Client:
session.update (listening → thinking → speaking) → message

Barge-In:
Client sends bargeIn -> Server returns to listening

5. Backward Compatibility

  • Existing text property in message remains supported.
  • payload-first is additive for new implementations.
  • command and event are also additive.
  • No breaking changes for existing bots.

6. Sample Use Cases

  • Voice Streaming: stream.start, stream.chunk, stream.end.
  • Text Streaming: Same command pattern with modality: text.
  • Multimodal Message: Combined text + voice in payload.
  • Lifecycle Control: session.update for states like listening, thinking, speaking,idle, error, barge-in.

7. Alternative Considered

Option Description Reason
Nested multimodal object Combine voice, text, and other inputs under a single object. Increases schema complexity; unnecessary for initial voice support.
Voice under channelData Custom per-channel implementation. Non-standard; leads to fragmentation and lack of portability.
Voice as an entity Represent audio as a metadata entity. Entities are not intended for content payloads.

7. Conclusion

By:

  • Single payload per message (one modality per activity).
  • Modeling streaming as explicit events.
  • Using clear command semantics for lifecycle.
  • Reusing stream attributes for sequencing.
  • Introduce modalities array when multi-modality in one message is required

We achieve a clean, extensible, and backward-compatible schema for multimodal interactions.


Appendix A: Full State Transition

Client -> Server

  1. session.init
{ "type": "command", "id": "cmd1", "name": "session.init", "value": { "sessionId": "sess_123" } }

// Command Result
{ "type": "commandResult", "replyToId": "cmd1", "value": { "status": "success", "sessionId": "sess_123" } }
  1. stream.start
{ "type": "event", "name": "stream.start", "value": { "streamId": "abc123", "contentType": "audio/webm" } }
  1. stream.chunk
{ "type": "event", "name": "stream.chunk", "value": { "streamId": "abc123", "seq": 1 }, "payload": { "voice": { "contentUrl": "data:audio/webm;base64,..." } } }
  1. stream.end
{ "type": "event", "name": "stream.end", "value": { "streamId": "abc123" } }

Server -> Client

  1. session.update (listening)
{ "type": "command", "id": "cmd2", "name": "session.update", "value": { "state": "listening" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd2", "value": { "status": "acknowledged" } }
  1. session.update (thinking)
{ "type": "command", "id": "cmd3", "name": "session.update", "value": { "state": "thinking" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd3", "value": { "status": "acknowledged" } }
  1. session.update (speaking)
{ "type": "command", "id": "cmd4", "name": "session.update", "value": { "state": "speaking" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd4", "value": { "status": "acknowledged" } }
  1. Final Message
{
  "type": "message",
  "payload": {
    "voice": { "contentType": "audio/webm", "contentUrl": "data:audio/webm;base64,..." }
  }
}

Metadata

Metadata

Assignees

Labels

SpecsThis is related to Activity Protocol Specification

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions