LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews
AI Psychiatrist implements a research paper's methodology for automated depression assessment using a four-agent LLM pipeline. The system analyzes clinical interview transcripts to selectively infer PHQ-8 item scores when supported by transcript evidence, abstaining (N/A) when evidence is insufficient.
Task validity note: PHQ-8 is a 2-week frequency self-report instrument, while DAIC-WOZ transcripts are not structured as PHQ administration. Transcript-only item scoring is often underdetermined; interpret results with coverage-aware metrics (AURC/AUGRC). See
docs/clinical/task-validity.md.
- Four-Agent Pipeline: Qualitative, Judge, Quantitative, and Meta-Review agents collaborate for comprehensive assessment
- Embedding-Based Few-Shot Learning: Paper reports 22% lower item-level MAE vs zero-shot (0.796 → 0.619, Section 3.2); this repo tracks coverage-adjusted metrics (AURC/AUGRC/Cmax) in run artifacts
- Iterative Self-Refinement: Judge agent feedback loop improves assessment quality
- Engineering-Focused: Clean architecture, strict type checking, structured logging, 80%+ test coverage
Greene et al. "AI Psychiatrist Assistant: An LLM-based Multi-Agent System for Depression Assessment from Clinical Interviews" OpenReview
Clinical disclaimer: This repository is a research/engineering implementation intended for paper reproduction and experimentation. It is not a medical device and should not be used for clinical diagnosis or treatment decisions.
- Python 3.11+
- Ollama installed and running
- 16GB+ RAM (for 27B models)
# Clone repository
git clone https://github.com/The-Obstacle-Is-The-Way/ai-psychiatrist.git
cd ai-psychiatrist
# Install dependencies (uses uv)
make dev # installs dev + docs + HuggingFace (recommended)
# Pull required models
ollama pull gemma3:27b-it-qat # or gemma3:27b
ollama pull qwen3-embedding:8b
# Configure (uses validated baseline configuration)
cp .env.example .env
# Start server
make serveNote (Embeddings backend): Chat and embeddings can use different backends:
LLM_BACKENDcontrols chat for agents (default:ollama)EMBEDDING_BACKENDcontrols embeddings (default:huggingface) If you want a pure-Ollama setup (no HuggingFace dependencies), setEMBEDDING_BACKEND=ollamain.env.
Why HF deps matter even if embeddings exist: few-shot retrieval embeds the query (participant evidence) at runtime in the same embedding space. If
EMBEDDING_BACKEND=huggingface, you still need HF deps (make dev) to compute query embeddings, even when reference*.npzfiles are already present.
Optional (Appendix F): The paper evaluates MedGemma 27B as an alternative model for the quantitative agent. There is no official MedGemma model in the Ollama library; use the HuggingFace backend (
make dev,LLM_BACKEND=huggingface,MODEL_QUANTITATIVE_MODEL=medgemma:27b) to load the official gated weights.
curl -X POST http://localhost:8000/full_pipeline \
-H "Content-Type: application/json" \
-d '{
"transcript_text": "Ellie: How are you doing today?\nParticipant: I have been feeling really down lately."
}'| Document | Description |
|---|---|
| Quickstart | Get running in 5 minutes |
| Architecture | System design and layers |
| Pipeline | How the 4-agent pipeline works |
| PHQ-8 | Understanding depression assessment |
| Configuration | All configuration options |
| API Reference | REST API documentation |
| Glossary | Terms and definitions |
| Reproduction Results | Current-state reproduction summary |
| Run History | Canonical timeline + per-run statistics |
| Document | Description |
|---|---|
| CLAUDE.md | Development guidelines |
| Specs | Specs index (implemented specs are distilled into canonical docs) |
| Data Schema | Dataset format documentation |
ai-psychiatrist/
├── src/ai_psychiatrist/
│ ├── agents/ # Four assessment agents
│ ├── domain/ # Entities, enums, value objects
│ ├── services/ # Business logic (feedback loop, embeddings)
│ ├── infrastructure/ # Ollama client, logging
│ └── config.py # Pydantic settings
├── tests/
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── e2e/ # End-to-end tests
├── docs/
│ ├── getting-started/ # Quickstart and tutorials
│ ├── configs/ # Configuration reference + philosophy
│ ├── embeddings/ # Embeddings + retrieval documentation
│ ├── pipeline-internals/ # Feature wiring and internals
│ ├── preflight-checklist/ # Run checklists (zero-shot / few-shot)
│ ├── results/ # Reproduction results + run history
│ ├── statistics/ # Metrics + evaluation methodology (AURC/AUGRC)
│ └── _specs/ # Specs index (implemented specs distilled into docs)
└── data/ # DAIC-WOZ dataset (gitignored)
# Full CI pipeline
make ci
# Individual commands
make test # Run all tests with coverage
make test-unit # Fast unit tests only
make lint-fix # Auto-fix linting issues
make typecheck # mypy strict mode
make format # Format code with ruff
# Development server with hot reload
make serve# Enable Ollama integration tests
AI_PSYCHIATRIST_OLLAMA_TESTS=1 make test-e2eAll settings via environment variables or .env file:
# Models (recommended defaults; see `.env.example`)
MODEL_QUALITATIVE_MODEL=gemma3:27b-it-qat # or gemma3:27b
MODEL_QUANTITATIVE_MODEL=gemma3:27b-it-qat # or gemma3:27b
MODEL_EMBEDDING_MODEL=qwen3-embedding:8b
# Backends (chat vs embeddings)
LLM_BACKEND=ollama
EMBEDDING_BACKEND=huggingface
# Few-shot retrieval (Appendix D optimal)
EMBEDDING_DIMENSION=4096
EMBEDDING_CHUNK_SIZE=8
EMBEDDING_TOP_K_REFERENCES=2
# Reference embeddings selection (NPZ + JSON sidecar)
# Default: FP16 HuggingFace embeddings (paper-train)
EMBEDDING_EMBEDDINGS_FILE=huggingface_qwen3_8b_paper_train_participant_only
# Transcript source must match how embeddings were built
DATA_TRANSCRIPTS_DIR=data/transcripts_participant_only
# Alternative: legacy Ollama embeddings (paper-train)
# EMBEDDING_EMBEDDINGS_FILE=paper_reference_embeddings
# DATA_EMBEDDINGS_PATH=/absolute/or/relative/path/to/artifact.npz # full-path override
# Chunk scoring (Spec 35; requires {name}.chunk_scores.json sidecar)
EMBEDDING_REFERENCE_SCORE_SOURCE=chunk
# Feedback loop (Section 2.3.1)
FEEDBACK_MAX_ITERATIONS=10
FEEDBACK_SCORE_THRESHOLD=3
# Appendix F (optional): use official MedGemma via HuggingFace backend
# LLM_BACKEND=huggingface
# MODEL_QUANTITATIVE_MODEL=medgemma:27bSee Configuration Reference for all options.
From the paper:
- Quantitative (PHQ-8 item scoring, 0–3 per item): MAE 0.796 (zero-shot) vs 0.619 (few-shot)
- Appendix F (optional): MedGemma 27B MAE 0.505, with lower coverage (“fewer predictions overall”)
- Meta-review (binary classification): 78% accuracy (comparable to the human expert)
Note: The MAE values are item-level (per PHQ-8 item) and exclude items marked “N/A”.
| Tool | Purpose |
|---|---|
| uv | Package management |
| Ollama | Local LLM inference |
| FastAPI | REST API |
| Pydantic v2 | Configuration & validation |
| structlog | Structured logging |
| pytest | Testing |
| Ruff | Linting & formatting |
| mypy | Type checking |
Licensed under Apache 2.0. See LICENSE and NOTICE.
This repository is a clean-room, production-grade reimplementation of the paper’s method. It does not distribute the DAIC-WOZ dataset. The original research code is referenced in the paper under “Data and Code Availability”.
- Fork the repository
- Create a feature branch
- Make changes following CLAUDE.md guidelines
- Run
make cito verify - Submit a pull request