feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience by rawwerks · Pull Request #9295 · stanfordnlp/dspy

rawwerks · 2026-02-11T16:28:34Z

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, and GEPA compatibility

Summary

Seven additive features that came from using RLM + GEPA on real multimodal tasks. Each addresses a failure mode hit in practice. All changes are backwards-compatible — default behavior is unchanged.

+3,115 lines across 10 files. 206 tests pass (0 failures, 38 deno-skips). Rebased on current main.

The seven changes

1. Multimodal media support via the `dspy.Type` sandbox protocol

Problem: RLM can only work with text. Audio/Image inputs are invisible to the agent.

Solution: Implements the to_sandbox() / rlm_preview() / sandbox_setup() / sandbox_assignment() protocol (from #9283) on Audio and Image. Any dspy.Type implementing this protocol is automatically:

Detected as a multimodal input field (_detect_multimodal_fields)
Injected into the sandbox via to_sandbox() (descriptor string for media)
Previewed via rlm_preview() in the agent's variable context
Available for llm_query_with_media(prompt, *media_var_names) which sends the actual content parts to the sub-LLM

Also adds _wrap_rlm_inputs() to auto-wrap raw values into their annotated dspy.Type, and _inject_pending_vars() for unified sandbox injection — both patterns from #9283.

Documents the protocol on the Type base class so future type authors know what to implement for RLM support.

class DescribeImage(dspy.Signature):
    image: dspy.Image = dspy.InputField()
    description: str = dspy.OutputField()

rlm = dspy.RLM(DescribeImage)
result = rlm(image=dspy.Image("photo.jpg"))
# Agent code: text = llm_query_with_media("Describe this", "image")

2. Multi-model sub-call routing via `sub_lms`

Problem: Some tasks benefit from cheap exploration + strong verification. RLM only supports a single sub_lm.

Solution: sub_lms={"flash": lm_fast, "pro": lm_pro} — sandbox code selects with llm_query(prompt, model="flash"). Falls back to sub_lm → dspy.settings.lm.

3. LocalInterpreter — unsandboxed host-process execution

Problem: Deno/Pyodide sandbox can't access host packages (PIL, numpy, soundfile, etc.).

Solution: New LocalInterpreter (175 lines) implements CodeInterpreter via exec(). State persists, tools injected, SUBMIT() identical. Intentionally unsandboxed for local experiments.

from dspy.primitives.local_interpreter import LocalInterpreter
rlm = dspy.RLM(MySignature, interpreter=LocalInterpreter())

4. Budget awareness — `budget()`, `max_time`, `max_cost`

Problem: Agents burn through iterations blindly. A single runaway example can blow your API budget in optimization loops.

Solution:

max_time=120 — wall-clock limit, triggers extract fallback (not crash)
max_cost=0.50 — dollar limit via litellm cost + BYOK upstream_inference_cost
budget() — sandbox tool showing remaining resources with low-resource warnings

5. GEPA bootstrap_trace resilience

Problem: RLM timeouts/cost overruns raise exceptions that bootstrap_trace_data doesn't catch, losing all trace data.

Solution: Broad except Exception handler preserves partial trace as FailedPrediction (+7 lines in bootstrap_trace.py). KeyboardInterrupt is NOT caught.

6. Depth > 1 — recursive subcalls

Problem: llm_query() makes plain LM completions — no tools, no iteration. Sub-LLMs miss information in long contexts ("lost in the middle").

Solution: max_depth=2 makes llm_query() spawn a child RLM with its own REPL. Child can write code, call its own llm_query(), and iterate before SUBMITting.

rlm = dspy.RLM(
    MySignature,
    interpreter=LocalInterpreter(),
    max_depth=2,          # child RLMs get own REPLs
    max_iterations=15,    # per-level
    max_llm_calls=30,     # per-level (independent counters)
    max_time=300,         # remaining propagated to children
    max_cost=2.00,        # remaining propagated to children
)

Budget propagation: children get remaining time/cost, own iteration/call counters. Interpreter type matches parent (LocalInterpreter→LocalInterpreter, PythonInterpreter→PythonInterpreter). Tested E2E: 5/5 on LongMemEval_S with Gemini 3 Flash. Adapted from vanilla RLM depth>1.

7. Type protocol documentation on base class

Adds protocol comments to dspy.Type base class documenting the four methods for RLM sandbox opt-in, matching the pattern established by #9283.

File-by-file

File	Lines	What
`dspy/adapters/types/audio.py`	+21	`rlm_preview`, `to_sandbox`, `sandbox_setup`, `sandbox_assignment`
`dspy/adapters/types/image.py`	+25	Same protocol methods
`dspy/adapters/types/base_type.py`	+10	Protocol documentation
`dspy/predict/rlm.py`	+567	All features: media protocol, sub_lms, budget, cost, depth>1, _subcall, _wrap_rlm_inputs
`dspy/primitives/local_interpreter.py`	+175	New: unsandboxed exec()-based interpreter
`dspy/primitives/python_interpreter.py`	+76/-40	Generic `_has_rlm_support` + `to_sandbox` injection
`dspy/teleprompt/bootstrap_trace.py`	+7	GEPA exception handler
`tests/predict/test_rlm.py`	+1,792	All feature tests + depth>1 tests
`tests/primitives/test_local_interpreter.py`	+312	LocalInterpreter unit tests
`tests/primitives/test_media_serialization.py`	+170	Type protocol tests (Audio, Image, custom types)

How they compose

rlm = dspy.RLM(
    MyMultimodalSignature,
    sub_lms={"pro": lm_pro},        # multi-model routing
    max_time=120, max_cost=0.50,     # budget controls
    max_llm_calls=25,
    max_depth=2,                     # recursive subcalls
    interpreter=LocalInterpreter(),  # full Python access
)

# GEPA can optimize this without crashing on budget-exceeded examples
optimizer = dspy.GEPA(metric=my_metric, max_rounds=5)
optimized = optimizer.compile(rlm, trainset=examples)

Test summary

206 passed, 0 failures, 38 skipped (skips are pre-existing Deno-dependent tests)

Test Area	Count
Multimodal detection & registry (types protocol)	8
llm_query_with_media validation	5
Media instructions	2
Media content construction (LM calls)	4
Sub_lms routing	8
Budget/cost tracking	12
LocalInterpreter	33
Type protocol (Audio, Image, custom)	21
GEPA resilience	3
Depth > 1 (init, propagation, E2E)	30
_wrap_rlm_inputs / auto-wrapping	2

Relationship to #9283

The multimodal media support adopts the same dspy.Type sandbox protocol that #9283 introduces for DataFrame. We implemented to_sandbox() / rlm_preview() / sandbox_setup() / sandbox_assignment() on Audio and Image, and use the same generic _has_rlm_support() detection in PythonInterpreter. This means any future dspy.Type with the protocol works automatically with RLM — no RLM-specific code needed.

Implementation notes

All changes additive — no breaking changes to existing behavior
Default values preserve current behavior (max_time=None, max_cost=None, sub_lms={}, max_depth=1, default PythonInterpreter)
Generic type detection: uses protocol checks (hasattr(annotation, 'to_sandbox')) not hardcoded type lists
All files pass ruff check

Add llm_query_with_media() tool that lets RLM's sandboxed code send Audio and Image inputs to sub-LLM calls for multimodal reasoning. Changes: - python_interpreter.py: Handle Audio/Image in _serialize_value() and _to_json_compatible() by converting to descriptor strings - rlm.py: Add media registry that captures Audio/Image inputs, new llm_query_with_media(prompt, *media_var_names) tool, and dynamic instruction generation when media fields are detected Media objects can't be serialized into the Deno/WASM sandbox, so they are stored in a registry and referenced by variable name. The sandbox sees descriptor strings; actual media data flows through llm_query_with_media() to the sub-LLM as proper multimodal content. Tested with Gemini 3 Flash (via OpenRouter) for both audio transcription and image description tasks.

- Fix ruff lint errors: import sorting, unused loop var, list()[0] → next(iter()) - Add 15 unit tests for RLM media detection, registry, llm_query_with_media - Add 12 unit tests for python_interpreter media serialization helpers - All 78 unit tests pass (37 deno tests skipped as expected) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enable sandbox code to select different LMs for sub-queries by passing model='name' to llm_query, llm_query_batched, and llm_query_with_media. This matches the upstream RLM design where registered models can be chosen at call time within the REPL. - Add sub_lms: dict[str, dspy.LM] parameter to RLM.__init__ - Add _resolve_lm() helper with 3-tier routing: named → sub_lm → default - Add model parameter to all llm_query* tool functions - Auto-document available model names in action instructions template - Shared call counter across all model choices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add LocalInterpreter as an alternative to PythonInterpreter (Deno/Pyodide) for RLM code execution. Executes code directly in the host Python process via exec(), giving RLM agents access to any installed package (PIL, numpy, soundfile, etc.) without WASM sandbox restrictions. Implements the CodeInterpreter protocol: start(), execute(), shutdown(), tools property. State persists across execute() calls. SUBMIT() maps positional args to output_fields (matching PythonInterpreter behavior). Usage: from dspy.primitives.local_interpreter import LocalInterpreter rlm = dspy.RLM(sig, interpreter=LocalInterpreter()) Includes 33 unit tests covering execution, state persistence, variable injection, error handling, SUBMIT/FinalOutput, tools, and context manager. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The agent can now call budget() from the REPL to check remaining iterations, LLM calls, and wall-clock time. This enables cost-aware recursive strategies — the agent can decide to use cheaper/fewer sub-queries as resources dwindle. Changes: - Add max_time parameter: optional wall-clock seconds limit per forward() call. Gracefully falls back to extract when exceeded (no exception, just early termination). - Add budget() tool: injected alongside llm_query/llm_query_batched, returns human-readable summary of remaining resources. - Track execution state (start_time, current iteration) via mutable dict shared between forward() loop and budget() closure. - Update ACTION_INSTRUCTIONS_TEMPLATE to mention budget(). - Add 'budget' to reserved tool names. - Mirror all changes in aforward() for async path. 13 new tests covering: initialization, budget output format, iteration tracking, LLM call tracking, time tracking, reserved name, action instructions, timeout fallback, and end-to-end budget updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extends the budget system with actual dollar cost tracking: - Add max_cost parameter: optional dollar limit per forward() call. Tracked via litellm's per-call cost reporting in lm.history entries. Gracefully falls back to extract when exceeded. - budget() now reports: iterations, LLM calls, time, cost, and tokens. Cost is computed by summing lm.history entries added since tool creation (snapshot offsets at start of each forward() call). - Cost enforcement in both forward() and aforward() loops. - Expose _get_cost_and_tokens via execution_state dict so the forward() loop can check cost without reaching into tool internals. 5 new tests for max_cost init, budget cost display, cost fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Sum both provider cost (litellm response_cost) and upstream inference cost (usage.cost_details.upstream_inference_cost) for accurate BYOK tracking (e.g. OpenRouter BYOK → Vertex with 5% markup) - Add budget warnings when any resource drops below 20% remaining: iterations, LLM calls, time, or cost - budget() output now prefixed with "⚠ LOW: ..." when resources are low Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ilures - Rename _SubmitCalled → _SubmitCalledError (N818) - Add noqa for SUBMIT() function name (N802, intentional API) - Use X | Y union syntax in isinstance calls (UP038) - Remove unused imports and variables (F401, F841) - Sort imports (I001) - Add zip strict=False (B905) - Catch non-parse exceptions in bootstrap_trace so GEPA can reflect on partial traces from RLM timeout/crash/cost overrun

- RLM + LocalInterpreter integration (7 tests): forward, state persistence, tool access, stdlib imports, error recovery, max_time, aforward - Media content construction (4 tests): audio/image content parts sent to LM, multiple media objects, model routing with media - max_cost mid-run fallback (1 test): cost exceeded during iteration triggers extract fallback, not crash - Async budget/time/cost (3 tests): aforward respects max_time, max_cost, budget() tool works in async path - bootstrap_trace resilience (3 tests): RuntimeError captured as FailedPrediction, partial trace preserved on cost overrun, KeyboardInterrupt not swallowed - LocalInterpreter output_fields setter (2 tests): post-init configuration, default single-output wrapping

When max_depth > 1, llm_query() spawns a child RLM with its own LocalInterpreter REPL instead of making a plain LM completion call. Each child manages its own iteration loop, tools, and budget. Key design (mirrors vanilla RLM PR stanfordnlp#84 pattern): - RLM._subcall() method spawns child RLM(signature='prompt -> response') - Child gets fresh LocalInterpreter (isolated namespace, stdout capture) - Budget propagation: remaining time/cost passed to child - model= param selects child's sub_lm via resolve_lm closure - User tools inherited by children - Interpreter cleanup in finally block - At leaf depth (depth >= max_depth - 1), falls back to plain LM call - llm_query_batched runs children sequentially when recursive Params added to RLM.__init__: - depth: int = 0 (current recursion depth, 0-indexed) - max_depth: int = 1 (max recursion depth, default=no recursion) Tests: 24 new tests adapted from vanilla RLM test_subcall.py covering parameter propagation, budget enforcement, interpreter isolation, and end-to-end execution with DummyLM.

…test Verifies that when the parent RLM uses PythonInterpreter (Deno sandbox), llm_query with max_depth=2 correctly spawns a child RLM with its own LocalInterpreter. The tool callback crosses the Deno→host JSON-RPC boundary, _subcall runs on the host, and the child operates in an isolated LocalInterpreter namespace.

- Reject max_depth < 1 and depth < 0 with ValueError at init - Add async aforward() test for depth=2 (async parent, sync _subcall) - All 185 tests pass with --deno flag

Child now inherits the parent's interpreter type: - LocalInterpreter parent → LocalInterpreter child - Default (PythonInterpreter/Deno) parent → PythonInterpreter child - Custom interpreter → LocalInterpreter fallback (can't clone) Previously child always got LocalInterpreter, which broke sandboxing when parent used PythonInterpreter (Deno). Tests updated: interpreter isolation tests specify parent type, new test verifies PythonInterpreter child when parent uses default.

…port Replace hardcoded Audio/Image media registry with the generic to_sandbox() protocol from PR stanfordnlp#9283 (kmad's DataFrame type approach). Changes: - Add rlm_preview(), to_sandbox(), sandbox_setup(), sandbox_assignment() to Audio and Image types - Replace _is_media/_media_descriptor with generic _has_rlm_support() in python_interpreter.py - Replace _detect_media_fields/_build_media_registry with generic _detect_multimodal_fields/_build_multimodal_registry in rlm.py - Add _wrap_rlm_inputs() for auto-wrapping raw values into dspy.Type - Add _inject_pending_vars() for unified sandbox variable injection - llm_query_with_media is now always available as a tool - _build_variables uses rlm_preview() for better LLM context - Update all tests to use new API names (multimodal_registry, etc.) - Add tests for custom types implementing the protocol Any dspy.Type with to_sandbox() + format() is now automatically: 1. Detected as multimodal input 2. Injected into sandbox via to_sandbox() protocol 3. Available for llm_query_with_media() 4. Previewed via rlm_preview() in variable context 204 tests pass (0 failures, 38 deno-skips).

…, protocol docs, max_output_chars Incorporate changes from 07e3f39 and 61e81ac on PR stanfordnlp#9283: - Restore head+tail truncation in REPLVariable and REPLEntry (was simplified to head-only, now shows first half + '...' + last half) - Add REPLEntry.format_output() static method for verbose logging - Put max_output_chars back on REPLHistory (threaded through append()) - Revert max_output_chars default to 10_000 (was 100_000) - Simplify _format_output() to passthrough (truncation in REPLHistory) - Add RLM sandbox protocol documentation to Type base class - Update tests: head+tail assertions, REPLHistory threading, new tests 206 tests pass (0 failures, 38 deno-skips).

okhat · 2026-02-11T16:49:38Z

The hero we needed, not the hero we deserved; thanks @rawwerks !!

rawwerks · 2026-02-11T17:00:20Z

@isaacbmiller - i think 1 & 7 probably go first (and together) since it looks like @kmad 's types are now in main. that seems like the next logical increment. do you agree?

isaacbmiller · 2026-02-12T21:34:14Z

Thanks for the PR! Going through these changes one by one:

1. Multimodal media support via the dspy.Type sandbox protocol
Problem: RLM can only work with text. Audio/Image inputs are invisible to the agent.

Awesome!
I want to combine (1) and (7) on this PR with @kmad’s (Note #9283 is not yet merged), as these changes are connected. Will follow up on that PR.

Multi-model sub-call routing via sub_lm

I understand the desire to give the model flexibility in choosing a model. I don’t know if this change is currently worth the potential confusion that it will cause the model.

Do you have current usecases where inside of a single run you want to use multiple models?

LocalInterpreter — unsandboxed host-process execution

Great! Can we add an extra flag to verify that the user knows they are doing something dangerous (see BaseModule.load for how we force the user to pass an extra flag)

Separately, you can install packages from pip inside of pyodide. Is there a package you need that you are unable to install inside of pyodide?

Budget awareness — budget(), max_time, max_cost

These are great!
You could make the argument that max_time and budget should both be at a different level of abstraction in DPSy, but until we get there, this works.

If we do hit max_time or budget, should we throw or should we extract? I would lean towards throw, as if we had enough information to extract, the model would have hopefully called SUBMIT.

GEPA bootstrap_trace resilience
Problem: RLM timeouts/cost overruns raise exceptions that bootstrap_trace_data doesn't catch, losing all trace data.

I also do not fully understand this problem. Is it just that after doing (4), that breaks GEPA somehow? Can you explain more?

Depth > 1 — recursive subcalls

I am very excited about this change too! I am hesitant on having the signature only be “prompt → response”?

IMO the stronger types via signatures is one of the main benefits of the DSPy version of RLMs.

Type protocol documentation on base class
See (1)

TLDR:

Will combine (1) and (7) in RLM Generic Type Support + Dataframe implementation #9283.
3 and 4 are great changes. Let’s break them out as separate PRs and make sure to add the flag for security on local interpreter.
Need more information on (2). I am hesitant about adding this in currently for the sake of confusing the model with too much information.
(5) I need more info
(6) is very cool! lets discuss how to make it a fully fledged sub_rlm implementation with signatures.

rawwerks and others added 15 commits February 11, 2026 11:23

fix(rlm): add depth/max_depth validation + aforward depth>1 test

6bb9e21

- Reject max_depth < 1 and depth < 0 with ValueError at init - Add async aforward() test for depth=2 (async parent, sync _subcall) - All 185 tests pass with --deno flag

CyrusNuevoDia requested review from CyrusNuevoDia and removed request for CyrusNuevoDia February 11, 2026 16:36

isaacbmiller self-requested a review February 11, 2026 16:45

rawwerks mentioned this pull request Feb 11, 2026

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility #9289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience#9295

feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience#9295
rawwerks wants to merge 15 commits intostanfordnlp:mainfrom
rawwerks:feat/rlm-media-types-protocol

rawwerks commented Feb 11, 2026

Uh oh!

okhat commented Feb 11, 2026

Uh oh!

rawwerks commented Feb 11, 2026 •

edited

Loading

Uh oh!

isaacbmiller commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rawwerks commented Feb 11, 2026

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, and GEPA compatibility

Summary

The seven changes

1. Multimodal media support via the dspy.Type sandbox protocol

2. Multi-model sub-call routing via sub_lms

3. LocalInterpreter — unsandboxed host-process execution

4. Budget awareness — budget(), max_time, max_cost

5. GEPA bootstrap_trace resilience

6. Depth > 1 — recursive subcalls

7. Type protocol documentation on base class

File-by-file

How they compose

Test summary

Relationship to #9283

Implementation notes

Uh oh!

okhat commented Feb 11, 2026

Uh oh!

rawwerks commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaacbmiller commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Multimodal media support via the `dspy.Type` sandbox protocol

2. Multi-model sub-call routing via `sub_lms`

4. Budget awareness — `budget()`, `max_time`, `max_cost`

rawwerks commented Feb 11, 2026 •

edited

Loading