Extend Activity Schema to Support Multimodal Interactions with Streaming

# Extend Activity Schema to Support Multimodal Interactions with Streaming

## Status
Draft – For Core Committee Review

---

## 1. Overview

The current Bot Framework **Activity Schema** is text-centric and lacks a unified approach for **voice and multimodal streaming**. As conversational AI evolves toward **speech-first and multimodal** experiences, there is a need for:

- **A consistent schema for messages across modalities** (text, voice, video, image, etc.).
- **Standard signalling commands for lifecycle states across all modalities**.
- **Unified streaming semantics for real-time data transfer**.

The proposal introduces:
- Support single payload per message (one modality per activity).
- **Streaming modeled as events** (`stream.start`, `stream.chunk`, `stream.end`).
- **Lifecycle modeled as commands** (`session.init`, `session.update`, `session.end`) with **commandResult** responses.
- Reuse of **streamInfo** attributes inside event value for sequencing and continuity.
- **Future:** Introduce modalities array as property ato support multiple modalities in one message.
---

## 2. Motivation

### Current Limitations

| Area       | Limitation |
|------------|------------|
| Streaming  | `typing` is text-specific; no unified streaming semantics for voice or multimodal. |
| Signalling | Lifecycle states modeled as generic events; lacks clarity as commands. |
| Extensibility | Adding modalities requires schema hacks. |

---

## 3. Goals
- Use `message` for all final payloads with payload-first design.
- Model streaming as **events** for clarity.
- Standardize lifecycle commands across modalities.
- Maintain backward compatibility.
- Enable future multimodal sessions without schema churn.

---

## 4. Proposed Changes

### 4.1 Activity Types and New Properties
| Type | Description |
|------|------------|
| `message` | Final message for any modality (text, voice, video, image). Enhanced with `payload`. |
| `command` | Used for session management (`session.init`, `session.update`, `session.end`). |
| `commandResult` | Response to a command, as per Activity Schema. |
| `event` | Used for streaming actions and real-time data transfer (e.g., `stream.start`, `stream.chunk`, `stream.end`). |

#### **New Properties**
- **`payload`**  
  Encapsulates modality-specific properties to avoid bloating the schema.  
  Example:
  ```json
  "payload": {
    "voice": { "contentType": "audio/webm", "contentUrl": "..." }
  }
  ```

### 4.2 Message Examples

- Existing text property in `message` will **continue to work** for legacy bots.
- New implementations should use `payload` for multimodal support.

#### Text Message
```json
{
  "type": "message",
  "payload": {
    "text": {
      "content": "Book a flight to Paris",
      "textFormat": "plain",
      "locale": "en-us"
    }
  }
}
```

#### Example: Voice Message
```json
{
  "type": "message",
  "payload": {
    "voice": {
      "contentType": "audio/webm",
      "contentUrl": "data:audio/webm;base64,...",
      "transcription": "Book a flight to Paris",
      "timestamp": "2025-10-07T10:30:00Z",
      "duration": "3.4s",
      "sentiment": "neutral"
    }
  }
}
```

### 4.3 Streaming as Events

Streaming is modeled as events to clarify lifecycle and align with industry standards for real-time data transfer.

#### Stream Start
```json
{
  "type": "event",
  "name": "stream.start",
  "value": {
    "streamId": "abc123", // Unique stream identifier
    "contentType": "audio/webm" // Content type for the stream
  }
}
```

#### Stream Chunk
```json
{
  "type": "event",
  "name": "stream.chunk",
  "value": {
    "streamId": "abc123",
    "seq": 2, // Sequence number for this chunk
    "isFinal": false, // Indicates if this is the last chunk
    "timestamp": "2025-10-07T10:30:05Z"  // Timestamp of the chunk
  },
  "payload": {
    "voice": {
      "contentType": "audio/webm",  // Audio content type
      "contentUrl": "data:audio/webm;base64,...",  // Audio chunk data (Base64 encoded)
      "duration": "2.5s",  // Duration of the chunk
      "timestamp": "2025-10-07T10:30:05Z",  // Timestamp of the chunk
      "transcription": "Can you tell me your destination?"  // Transcription of the audio
    }
  }
}
```
**Default for isFinal**: false.
Set true only for the last chunk.
stream.end signals completion explicitly, so isFinal is not required there.

#### Stream End
```json
{
  "type": "event",
  "name": "stream.end",
  "value": {
    "streamId": "abc123" // Stream ID to indicate the end of the stream
  }
}
```

### 4.4 Lifecycle Commands with CommandResult

Commands continue to be used for **session management** (e.g., initializing a session, updating its state, and ending the session).

### Session Init
```json
{
  "type": "command",
  "id": "cmd1",
  "name": "session.init",
  "value": {
    "sessionId": "sess_123" // Session identifier
    // Additional control parameters
  }
}

// Command Result
{
  "type": "commandResult",
  "replyToId": "cmd1",
  "value": { "status": "success", "sessionId": "sess_123" }
}
```

### Session Update
```json
{
  "type": "command",
  "id": "cmd2",
  "name": "session.update",
  "value": {
    "state": "listening"
    // Bot's state
    // ENUM: listening (input.expected) | thinking (processing) | speaking (output.generating) | idle | error
  }
}

// Command Result
{
  "type": "commandResult",
  "replyToId": "cmd2",
  "value": { "status": "acknowledged" }
}
```

- `listening`: Bot is awaiting user input (input.expected).
- `thinking`: Bot is processing the input (processing).
- `speaking`: Bot is generating or delivering output (output.generating).
- `idle`: The bot is not currently in an active state.
- `error`: An error has occurred during the interaction.

### Session End
```json
{
  "type": "command",
  "name": "session.end",
  "modality": "voice",
  "value": {
    "reason": "completed"
  }
}
```

### Barge-In
```json
{
  "type": "command",
  "name": "session.update",
  "value": {
    "signal": "bargeIn",
    "origin": "user" // could be system or user
  }
}
```

### 4.5 Lifecycle Flow

```
Client -> Server:
session.init → stream.start → stream.chunk → stream.end -> bargeIn (optional)

Server -> Client:
session.update (listening → thinking → speaking) → message

Barge-In:
Client sends bargeIn -> Server returns to listening
```
---

## 5. Backward Compatibility

- Existing text property in `message` remains supported.
- `payload-first` is additive for new implementations.
- `command` and `event` are also additive.
- **No breaking changes** for existing bots.

---

## 6. Sample Use Cases
- **Voice Streaming**: stream.start, stream.chunk, stream.end.
- **Text Streaming**: Same command pattern with modality: text.
- **Multimodal Message**: Combined text + voice in payload.
- **Lifecycle Control**: `session.update` for states like `listening`, `thinking`, `speaking`,`idle`, `error`, `barge-in`.

---

## 7. Alternative Considered

| Option | Description | Reason |
|------|------------|--------|
| Nested `multimodal` object | Combine voice, text, and other inputs under a single object. | Increases schema complexity; unnecessary for initial voice support. |
| Voice under `channelData` | Custom per-channel implementation. | Non-standard; leads to fragmentation and lack of portability. |
| Voice as an entity | Represent audio as a metadata entity. | Entities are not intended for content payloads. |

## 7. Conclusion

By:
- Single payload per message (one modality per activity).
- Modeling streaming as explicit events.
- Using clear command semantics for lifecycle.
- Reusing stream attributes for sequencing.
- Introduce modalities array when multi-modality in one message is required

We achieve a **clean, extensible, and backward-compatible schema** for multimodal interactions.

---

## Appendix A: Full State Transition

### Client -> Server

1. session.init
```json
{ "type": "command", "id": "cmd1", "name": "session.init", "value": { "sessionId": "sess_123" } }

// Command Result
{ "type": "commandResult", "replyToId": "cmd1", "value": { "status": "success", "sessionId": "sess_123" } }
```

2. stream.start
```json
{ "type": "event", "name": "stream.start", "value": { "streamId": "abc123", "contentType": "audio/webm" } }
```

3. stream.chunk
```json
{ "type": "event", "name": "stream.chunk", "value": { "streamId": "abc123", "seq": 1 }, "payload": { "voice": { "contentUrl": "data:audio/webm;base64,..." } } }
```

4. stream.end
```json
{ "type": "event", "name": "stream.end", "value": { "streamId": "abc123" } }
```

## Server -> Client

1. session.update (listening)
```json
{ "type": "command", "id": "cmd2", "name": "session.update", "value": { "state": "listening" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd2", "value": { "status": "acknowledged" } }
```

2. session.update (thinking)
```json
{ "type": "command", "id": "cmd3", "name": "session.update", "value": { "state": "thinking" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd3", "value": { "status": "acknowledged" } }
```

3. session.update (speaking)
```json
{ "type": "command", "id": "cmd4", "name": "session.update", "value": { "state": "speaking" } 

// Command Result
{ "type": "commandResult", "replyToId": "cmd4", "value": { "status": "acknowledged" } }
```

4. Final Message
```json
{
  "type": "message",
  "payload": {
    "voice": { "contentType": "audio/webm", "contentUrl": "data:audio/webm;base64,..." }
  }
}
```
---

Type	Description
`message`	Final message for any modality (text, voice, video, image). Enhanced with `payload`.
`command`	Used for session management (`session.init`, `session.update`, `session.end`).
`commandResult`	Response to a command, as per Activity Schema.
`event`	Used for streaming actions and real-time data transfer (e.g., `stream.start`, `stream.chunk`, `stream.end`).

Area	Limitation
Streaming	`typing` is text-specific; no unified streaming semantics for voice or multimodal.
Signalling	Lifecycle states modeled as generic events; lacks clarity as commands.
Extensibility	Adding modalities requires schema hacks.

Option	Description	Reason
Nested `multimodal` object	Combine voice, text, and other inputs under a single object.	Increases schema complexity; unnecessary for initial voice support.
Voice under `channelData`	Custom per-channel implementation.	Non-standard; leads to fragmentation and lack of portability.
Voice as an entity	Represent audio as a metadata entity.	Entities are not intended for content payloads.

Extend Activity Schema to Support Multimodal Interactions with Streaming #377

Description

Extend Activity Schema to Support Multimodal Interactions with Streaming

Status

1. Overview

2. Motivation

Current Limitations

3. Goals

4. Proposed Changes

4.1 Activity Types and New Properties

New Properties

4.2 Message Examples

Text Message

Example: Voice Message

4.3 Streaming as Events

Stream Start

Stream Chunk

Stream End

4.4 Lifecycle Commands with CommandResult

Session Init

Session Update

Session End

Barge-In

4.5 Lifecycle Flow

5. Backward Compatibility

6. Sample Use Cases

7. Alternative Considered

7. Conclusion

Appendix A: Full State Transition

Client -> Server

Server -> Client

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions