Test-Driven Development for Assistant Workflows

Build and iterate on AI assistant workflows using visor's test framework. This guide covers the full cycle: define your workflow, write tests with mocks, then run against real AI to iterate on prompt quality and assertions.

The TDD Cycle

Define the workflow — skills, tools, knowledge, intents
Write tests with mocks — expected conversations and assertions
Run with mocks — validate structure, routing, and assertion logic
Run with --no-mocks — real AI + real tools, iterate on quality
Refine — fix prompts, relax over-strict assertions, improve knowledge

Setting Up

Project Structure

my-assistant/
├── assistant.yaml           # main workflow config
├── config/
│   └── skills.yaml          # skill definitions
├── docs/
│   └── api-reference.md     # knowledge docs for skills
├── tests/
│   └── skills.tests.yaml    # test file
└── .env                     # API tokens (not committed)

Test File Basics

Test files extend your main config:

# tests/skills.tests.yaml
version: "1.0"
extends: "../assistant.yaml"

tests:
  defaults:
    strict: false    # required for --no-mocks (internal steps run without expectations)

  cases:
    - name: basic-question
      conversation:
        routing: { max_loops: 2 }
        turns:
          - role: user
            text: "What services do we run?"
            mocks:
              chat:
                text: "We run 3 services: API gateway, dashboard, and pump."
                intent: chat
                skills: []
            expect:
              outputs:
                - step: chat
                  path: text
                  matches: "(?i)gateway|dashboard|pump"

Key fields:

extends — imports the full workflow so the test runner knows all steps, skills, and routing
strict: false — prevents failures from internal steps (routing, config building) that don't have assertions
conversation — sugar syntax that auto-expands turns into flow stages with message history

Writing Conversation Tests

Single-Turn Test

The simplest test: one user message, assert on the response.

- name: greeting
  conversation:
    routing: { max_loops: 2 }
    turns:
      - role: user
        text: "Hello, who are you?"
        mocks:
          chat:
            text: "I'm your engineering assistant. I can help with code, deployments, and more."
            intent: chat
            skills: []
        expect:
          calls:
            - step: chat
              exactly: 1
          outputs:
            - step: chat
              path: text
              matches: "(?i)assistant|help"

Multi-Turn Conversation

Each turn's history automatically includes previous turns. Mock response text becomes assistant messages in subsequent turns.

- name: code-explore-then-explain
  conversation:
    routing: { max_loops: 2 }
    turns:
      - role: user
        text: "Find the authentication middleware in the backend service"
        mocks:
          chat:
            text: "Found `auth.go` in `internal/middleware/`. It checks JWT tokens."
            intent: chat
            skills: [code-explorer]
        expect:
          outputs:
            - step: chat
              path: text
              matches: "(?i)auth|middleware"

      - role: user
        text: "Explain how the JWT validation works in that auth middleware you found"
        mocks:
          chat:
            text: "The middleware extracts the Bearer token, validates the signature..."
            intent: chat
            skills: [code-explorer]
        expect:
          llm_judge:
            - step: chat
              turn: current
              path: text
              prompt: "Does the response explain JWT validation with technical details?"
              schema:
                properties:
                  explains_jwt:
                    type: boolean
                required: [explains_jwt]
              assert:
                explains_jwt: true

Mock Response Format

Mocks simulate what the chat step returns:

mocks:
  chat:
    text: "The response text..."
    intent: chat              # which intent was classified
    skills: [code-explorer]   # which skills were activated

Use fictional data in mocks. The mock text becomes the assistant message in subsequent turn history.

Assertion Types

Regex Matching (`outputs`)

Pattern-match on response fields:

expect:
  outputs:
    - step: chat
      path: text
      matches: "(?i)kubernetes|k8s"
    - step: chat
      turn: current        # only check this turn's output
      path: text
      matches: "(?i)deploy"

Call Counting (`calls`)

Assert how many times a step ran:

expect:
  calls:
    - step: chat
      exactly: 1

LLM Judge (`llm_judge`)

Semantic evaluation — assert on meaning, not exact text. This is the most powerful assertion for AI responses:

expect:
  llm_judge:
    - step: chat
      turn: current
      path: text
      prompt: |
        Does the response provide a clear architectural overview
        with component names and their responsibilities?
      schema:
        properties:
          names_components:
            type: boolean
            description: "Names specific components or services?"
          explains_responsibilities:
            type: boolean
            description: "Explains what each component does?"
        required: [names_components, explains_responsibilities]
      assert:
        names_components: true
        explains_responsibilities: true

The judge returns a JSON object matching your schema. assert checks specific fields. You can include fields in the schema for observability without asserting them.

Cross-Turn Assertions

Reference previous turns using turn: N (1-based):

# In turn 2, verify turn 1 was good too
expect:
  llm_judge:
    - step: chat
      turn: current
      path: text
      prompt: "Does turn 2 build on the context from turn 1?"
      ...
    - step: chat
      turn: 1
      path: text
      prompt: "Did turn 1 provide a good foundation?"
      ...

Running Tests

# Validate structure (no AI calls)
visor test --config tests/skills.tests.yaml

# Run a single case
visor test --config tests/skills.tests.yaml --case basic-question

# Run with real AI and real tools
visor test --config tests/skills.tests.yaml --no-mocks

# Real AI, single case
visor test --config tests/skills.tests.yaml --case basic-question --no-mocks

# Debug mode
visor test --config tests/skills.tests.yaml --case basic-question --no-mocks --debug

Mock vs No-Mock Mode

	Mock mode	`--no-mocks`
AI calls	Mocked responses	Real AI provider
Tools	Not called	Real tool execution
`routing.max_loops`	Use `0`	Use `2+` (AI needs iterations for tool calls)
`strict`	Can be `true`	Must be `false` (internal steps fire)
Speed	Fast (seconds)	Slow (AI latency + tool calls)
Use for	Structure validation, CI	Prompt quality iteration

Iterating with `--no-mocks`

This is where the real work happens. Common issues and fixes:

AI ignores tool guidance

Symptom: AI calls wrong endpoint, uses wrong arguments, or skips required steps.

Fix: Improve the knowledge doc with explicit instructions:

# In your skill knowledge:
knowledge: |
  ### Important: Always search by project ID
  When looking up items, **always** use `/projects/{id}/items`
  — never the global `/items` endpoint which returns all projects.

Also make user prompts more specific:

# Before (too vague):
text: "List the items in review"

# After (explicit):
text: "List the items in review stage for the Backend project"

Assertion too strict for real responses

Symptom: Mock includes specific details but real AI response formats them differently.

Fix: Keep the field in the schema for observability but remove from assert:

schema:
  properties:
    lists_items:
      type: boolean
    includes_links:
      type: boolean
      description: "Includes profile links?"
  required: [lists_items, includes_links]
assert:
  lists_items: true
  # includes_links intentionally not asserted — real API
  # doesn't always return this without extra calls

Turn 2 loses context from Turn 1

Symptom: AI asks "which items?" instead of referencing turn 1 results.

Fix: Make follow-up prompts self-contained:

# Before:
text: "Now compare these candidates"

# After:
text: "Now compare the 3 candidates you just evaluated for the SRE role"

`strict: false` not working

Symptom: Test fails with "Step executed without expect: chat.route-intent".

Fix: strict: false must be at the test case level or in tests.defaults, not inside conversation::

# Wrong — ignored by conversation sugar:
conversation:
  strict: false
  turns: ...

# Correct — applied to expanded flow stages:
- name: my-test
  strict: false
  conversation:
    turns: ...

# Or set globally:
tests:
  defaults:
    strict: false

Example: Adding an API Integration Skill

Here's a complete example adding an external REST API as a skill.

1. Define the Skill

# config/skills.yaml
- id: hr-system
  description: |
    request relates to recruiting, hiring pipeline, candidates,
    job postings, or HR pipeline management.
    Examples: "list candidates", "show open positions", "evaluate candidate"
  tools:
    hr-api:
      type: http_client
      base_url: "https://api.hr-system.com/v3"
      auth:
        type: bearer
        token: "${HR_API_TOKEN}"
      headers:
        Content-Type: "application/json"
  knowledge: |
    {% readfile "docs/hr-api-reference.md" %}

2. Write the Knowledge Doc

## HR API Reference

Call the `hr-api` tool with these arguments:

| Argument | Required | Description |
|----------|----------|-------------|
| `path`   | yes      | API path (e.g. `/jobs`, `/candidates/{id}`) |
| `method` | no       | HTTP method (default: `GET`) |
| `query`  | no       | Query parameters |
| `body`   | no       | Request body for POST/PUT |

### Endpoints

| Operation | Method | Path |
|-----------|--------|------|
| List jobs | GET | `/jobs` |
| List candidates | GET | `/jobs/{id}/candidates` |
| Get candidate | GET | `/jobs/{id}/candidates/{cid}` |

### Important
Always use job-specific endpoints (`/jobs/{id}/candidates`)
— never the global `/candidates` endpoint.

3. Write Tests

# tests/hr.tests.yaml
version: "1.0"
extends: "../assistant.yaml"

tests:
  defaults:
    strict: false

  cases:
    - name: hr-pipeline-stats
      description: "Show candidate counts per pipeline stage"
      conversation:
        routing: { max_loops: 2 }
        turns:
          - role: user
            text: "Show me candidate statistics per stage for the SRE role"
            mocks:
              chat:
                text: |
                  Pipeline for **Site Reliability Engineer**:
                  | Stage | Count |
                  |-------|-------|
                  | Sourced | 12 |
                  | Applied | 8 |
                  | Screening | 5 |
                intent: chat
                skills: [hr-system]
            expect:
              outputs:
                - step: chat
                  path: text
                  matches: "(?i)sourced|applied|screen"
              llm_judge:
                - step: chat
                  turn: current
                  path: text
                  prompt: |
                    Does the response show candidates broken down
                    by pipeline stage with numbers?
                  schema:
                    properties:
                      has_stage_breakdown:
                        type: boolean
                      covers_multiple_stages:
                        type: boolean
                  assert:
                    has_stage_breakdown: true
                    covers_multiple_stages: true

4. Iterate

# First: validate test structure
visor test --config tests/hr.tests.yaml

# Then: run against real AI + API
visor test --config tests/hr.tests.yaml --no-mocks

# Iterate on a specific failing case
visor test --config tests/hr.tests.yaml --case hr-pipeline-stats --no-mocks --debug

Each iteration typically involves:

Run --no-mocks and read the failure
Fix the knowledge doc (wrong endpoint? missing guidance?) or the user prompt (too vague?)
Relax assertions that are too strict for real responses
Re-run until green

Example: MCP Command Tool Skill

Skills can also use external MCP servers:

# config/skills.yaml
- id: jira
  description: |
    request relates to Jira issues, tickets, sprints, or project tracking.
  tools:
    atlassian:
      command: uvx
      args: [mcp-atlassian]
      env:
        JIRA_URL: "${JIRA_URL}"
        JIRA_API_TOKEN: "${JIRA_API_TOKEN}"
      allowedMethods: [jira_get_issue, jira_search, jira_create_issue]
  knowledge: |
    You have access to Jira via the atlassian MCP tool.
    Use `jira_search` with JQL queries to find issues.
    Use `jira_get_issue` with an issue key like `PROJ-123`.

Test the same way — conversation sugar works identically regardless of tool type.

Checklist

When adding a new skill with tests:

Add skill to config/skills.yaml with description, tools, and knowledge
Write knowledge doc in docs/ (be explicit about which endpoints/methods to use)
Add secrets to .env
Create tests/<skill>.tests.yaml with extends
Write first test case with mocks — run visor test to validate
Run with --no-mocks — iterate on knowledge doc and prompts
Add multi-turn tests for complex flows
Relax assertions that are too strict for real responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test-Driven Development for Assistant Workflows

The TDD Cycle

Setting Up

Project Structure

Test File Basics

Writing Conversation Tests

Single-Turn Test

Multi-Turn Conversation

Mock Response Format

Assertion Types

Regex Matching (`outputs`)

Call Counting (`calls`)

LLM Judge (`llm_judge`)

Cross-Turn Assertions

Running Tests

Mock vs No-Mock Mode

Iterating with `--no-mocks`

AI ignores tool guidance

Assertion too strict for real responses

Turn 2 loses context from Turn 1

`strict: false` not working

Example: Adding an API Integration Skill

1. Define the Skill

2. Write the Knowledge Doc

3. Write Tests

4. Iterate

Example: MCP Command Tool Skill

Checklist

Related Documentation

Uh oh!

FilesExpand file tree

tdd-assistant-workflows.md

Latest commit

History

tdd-assistant-workflows.md

File metadata and controls

Test-Driven Development for Assistant Workflows

The TDD Cycle

Setting Up

Project Structure

Test File Basics

Writing Conversation Tests

Single-Turn Test

Multi-Turn Conversation

Mock Response Format

Assertion Types

Regex Matching (outputs)

Call Counting (calls)

LLM Judge (llm_judge)

Cross-Turn Assertions

Running Tests

Mock vs No-Mock Mode

Iterating with --no-mocks

AI ignores tool guidance

Assertion too strict for real responses

Turn 2 loses context from Turn 1

strict: false not working

Example: Adding an API Integration Skill

1. Define the Skill

2. Write the Knowledge Doc

3. Write Tests

4. Iterate

Example: MCP Command Tool Skill

Checklist

Related Documentation

Regex Matching (`outputs`)

Call Counting (`calls`)

LLM Judge (`llm_judge`)

Iterating with `--no-mocks`

`strict: false` not working