Skip to content

feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience#9295

Open
rawwerks wants to merge 15 commits intostanfordnlp:mainfrom
rawwerks:feat/rlm-media-types-protocol
Open

feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience#9295
rawwerks wants to merge 15 commits intostanfordnlp:mainfrom
rawwerks:feat/rlm-media-types-protocol

Conversation

@rawwerks
Copy link

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, and GEPA compatibility

Closes #9289

Summary

Seven additive features that came from using RLM + GEPA on real multimodal tasks. Each addresses a failure mode hit in practice. All changes are backwards-compatible — default behavior is unchanged.

+3,115 lines across 10 files. 206 tests pass (0 failures, 38 deno-skips). Rebased on current main.

The seven changes

1. Multimodal media support via the dspy.Type sandbox protocol

Problem: RLM can only work with text. Audio/Image inputs are invisible to the agent.

Solution: Implements the to_sandbox() / rlm_preview() / sandbox_setup() / sandbox_assignment() protocol (from #9283) on Audio and Image. Any dspy.Type implementing this protocol is automatically:

  • Detected as a multimodal input field (_detect_multimodal_fields)
  • Injected into the sandbox via to_sandbox() (descriptor string for media)
  • Previewed via rlm_preview() in the agent's variable context
  • Available for llm_query_with_media(prompt, *media_var_names) which sends the actual content parts to the sub-LLM

Also adds _wrap_rlm_inputs() to auto-wrap raw values into their annotated dspy.Type, and _inject_pending_vars() for unified sandbox injection — both patterns from #9283.

Documents the protocol on the Type base class so future type authors know what to implement for RLM support.

class DescribeImage(dspy.Signature):
    image: dspy.Image = dspy.InputField()
    description: str = dspy.OutputField()

rlm = dspy.RLM(DescribeImage)
result = rlm(image=dspy.Image("photo.jpg"))
# Agent code: text = llm_query_with_media("Describe this", "image")

2. Multi-model sub-call routing via sub_lms

Problem: Some tasks benefit from cheap exploration + strong verification. RLM only supports a single sub_lm.

Solution: sub_lms={"flash": lm_fast, "pro": lm_pro} — sandbox code selects with llm_query(prompt, model="flash"). Falls back to sub_lmdspy.settings.lm.

3. LocalInterpreter — unsandboxed host-process execution

Problem: Deno/Pyodide sandbox can't access host packages (PIL, numpy, soundfile, etc.).

Solution: New LocalInterpreter (175 lines) implements CodeInterpreter via exec(). State persists, tools injected, SUBMIT() identical. Intentionally unsandboxed for local experiments.

from dspy.primitives.local_interpreter import LocalInterpreter
rlm = dspy.RLM(MySignature, interpreter=LocalInterpreter())

4. Budget awareness — budget(), max_time, max_cost

Problem: Agents burn through iterations blindly. A single runaway example can blow your API budget in optimization loops.

Solution:

  • max_time=120 — wall-clock limit, triggers extract fallback (not crash)
  • max_cost=0.50 — dollar limit via litellm cost + BYOK upstream_inference_cost
  • budget() — sandbox tool showing remaining resources with low-resource warnings

5. GEPA bootstrap_trace resilience

Problem: RLM timeouts/cost overruns raise exceptions that bootstrap_trace_data doesn't catch, losing all trace data.

Solution: Broad except Exception handler preserves partial trace as FailedPrediction (+7 lines in bootstrap_trace.py). KeyboardInterrupt is NOT caught.

6. Depth > 1 — recursive subcalls

Problem: llm_query() makes plain LM completions — no tools, no iteration. Sub-LLMs miss information in long contexts ("lost in the middle").

Solution: max_depth=2 makes llm_query() spawn a child RLM with its own REPL. Child can write code, call its own llm_query(), and iterate before SUBMITting.

rlm = dspy.RLM(
    MySignature,
    interpreter=LocalInterpreter(),
    max_depth=2,          # child RLMs get own REPLs
    max_iterations=15,    # per-level
    max_llm_calls=30,     # per-level (independent counters)
    max_time=300,         # remaining propagated to children
    max_cost=2.00,        # remaining propagated to children
)

Budget propagation: children get remaining time/cost, own iteration/call counters. Interpreter type matches parent (LocalInterpreter→LocalInterpreter, PythonInterpreter→PythonInterpreter). Tested E2E: 5/5 on LongMemEval_S with Gemini 3 Flash. Adapted from vanilla RLM depth>1.

7. Type protocol documentation on base class

Adds protocol comments to dspy.Type base class documenting the four methods for RLM sandbox opt-in, matching the pattern established by #9283.

File-by-file

File Lines What
dspy/adapters/types/audio.py +21 rlm_preview, to_sandbox, sandbox_setup, sandbox_assignment
dspy/adapters/types/image.py +25 Same protocol methods
dspy/adapters/types/base_type.py +10 Protocol documentation
dspy/predict/rlm.py +567 All features: media protocol, sub_lms, budget, cost, depth>1, _subcall, _wrap_rlm_inputs
dspy/primitives/local_interpreter.py +175 New: unsandboxed exec()-based interpreter
dspy/primitives/python_interpreter.py +76/-40 Generic _has_rlm_support + to_sandbox injection
dspy/teleprompt/bootstrap_trace.py +7 GEPA exception handler
tests/predict/test_rlm.py +1,792 All feature tests + depth>1 tests
tests/primitives/test_local_interpreter.py +312 LocalInterpreter unit tests
tests/primitives/test_media_serialization.py +170 Type protocol tests (Audio, Image, custom types)

How they compose

rlm = dspy.RLM(
    MyMultimodalSignature,
    sub_lms={"pro": lm_pro},        # multi-model routing
    max_time=120, max_cost=0.50,     # budget controls
    max_llm_calls=25,
    max_depth=2,                     # recursive subcalls
    interpreter=LocalInterpreter(),  # full Python access
)

# GEPA can optimize this without crashing on budget-exceeded examples
optimizer = dspy.GEPA(metric=my_metric, max_rounds=5)
optimized = optimizer.compile(rlm, trainset=examples)

Test summary

206 passed, 0 failures, 38 skipped (skips are pre-existing Deno-dependent tests)

Test Area Count
Multimodal detection & registry (types protocol) 8
llm_query_with_media validation 5
Media instructions 2
Media content construction (LM calls) 4
Sub_lms routing 8
Budget/cost tracking 12
LocalInterpreter 33
Type protocol (Audio, Image, custom) 21
GEPA resilience 3
Depth > 1 (init, propagation, E2E) 30
_wrap_rlm_inputs / auto-wrapping 2

Relationship to #9283

The multimodal media support adopts the same dspy.Type sandbox protocol that #9283 introduces for DataFrame. We implemented to_sandbox() / rlm_preview() / sandbox_setup() / sandbox_assignment() on Audio and Image, and use the same generic _has_rlm_support() detection in PythonInterpreter. This means any future dspy.Type with the protocol works automatically with RLM — no RLM-specific code needed.

Implementation notes

  • All changes additive — no breaking changes to existing behavior
  • Default values preserve current behavior (max_time=None, max_cost=None, sub_lms={}, max_depth=1, default PythonInterpreter)
  • Generic type detection: uses protocol checks (hasattr(annotation, 'to_sandbox')) not hardcoded type lists
  • All files pass ruff check

rawwerks and others added 15 commits February 11, 2026 11:23
Add llm_query_with_media() tool that lets RLM's sandboxed code send
Audio and Image inputs to sub-LLM calls for multimodal reasoning.

Changes:
- python_interpreter.py: Handle Audio/Image in _serialize_value() and
  _to_json_compatible() by converting to descriptor strings
- rlm.py: Add media registry that captures Audio/Image inputs, new
  llm_query_with_media(prompt, *media_var_names) tool, and dynamic
  instruction generation when media fields are detected

Media objects can't be serialized into the Deno/WASM sandbox, so they
are stored in a registry and referenced by variable name. The sandbox
sees descriptor strings; actual media data flows through
llm_query_with_media() to the sub-LLM as proper multimodal content.

Tested with Gemini 3 Flash (via OpenRouter) for both audio
transcription and image description tasks.
- Fix ruff lint errors: import sorting, unused loop var, list()[0] → next(iter())
- Add 15 unit tests for RLM media detection, registry, llm_query_with_media
- Add 12 unit tests for python_interpreter media serialization helpers
- All 78 unit tests pass (37 deno tests skipped as expected)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable sandbox code to select different LMs for sub-queries by passing
model='name' to llm_query, llm_query_batched, and llm_query_with_media.
This matches the upstream RLM design where registered models can be
chosen at call time within the REPL.

- Add sub_lms: dict[str, dspy.LM] parameter to RLM.__init__
- Add _resolve_lm() helper with 3-tier routing: named → sub_lm → default
- Add model parameter to all llm_query* tool functions
- Auto-document available model names in action instructions template
- Shared call counter across all model choices

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add LocalInterpreter as an alternative to PythonInterpreter (Deno/Pyodide)
for RLM code execution. Executes code directly in the host Python process
via exec(), giving RLM agents access to any installed package (PIL, numpy,
soundfile, etc.) without WASM sandbox restrictions.

Implements the CodeInterpreter protocol: start(), execute(), shutdown(),
tools property. State persists across execute() calls. SUBMIT() maps
positional args to output_fields (matching PythonInterpreter behavior).

Usage:
    from dspy.primitives.local_interpreter import LocalInterpreter
    rlm = dspy.RLM(sig, interpreter=LocalInterpreter())

Includes 33 unit tests covering execution, state persistence, variable
injection, error handling, SUBMIT/FinalOutput, tools, and context manager.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent can now call budget() from the REPL to check remaining
iterations, LLM calls, and wall-clock time. This enables cost-aware
recursive strategies — the agent can decide to use cheaper/fewer
sub-queries as resources dwindle.

Changes:
- Add max_time parameter: optional wall-clock seconds limit per
  forward() call. Gracefully falls back to extract when exceeded
  (no exception, just early termination).
- Add budget() tool: injected alongside llm_query/llm_query_batched,
  returns human-readable summary of remaining resources.
- Track execution state (start_time, current iteration) via mutable
  dict shared between forward() loop and budget() closure.
- Update ACTION_INSTRUCTIONS_TEMPLATE to mention budget().
- Add 'budget' to reserved tool names.
- Mirror all changes in aforward() for async path.

13 new tests covering: initialization, budget output format, iteration
tracking, LLM call tracking, time tracking, reserved name, action
instructions, timeout fallback, and end-to-end budget updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extends the budget system with actual dollar cost tracking:

- Add max_cost parameter: optional dollar limit per forward() call.
  Tracked via litellm's per-call cost reporting in lm.history entries.
  Gracefully falls back to extract when exceeded.
- budget() now reports: iterations, LLM calls, time, cost, and tokens.
  Cost is computed by summing lm.history entries added since tool
  creation (snapshot offsets at start of each forward() call).
- Cost enforcement in both forward() and aforward() loops.
- Expose _get_cost_and_tokens via execution_state dict so the
  forward() loop can check cost without reaching into tool internals.

5 new tests for max_cost init, budget cost display, cost fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sum both provider cost (litellm response_cost) and upstream inference
  cost (usage.cost_details.upstream_inference_cost) for accurate BYOK
  tracking (e.g. OpenRouter BYOK → Vertex with 5% markup)
- Add budget warnings when any resource drops below 20% remaining:
  iterations, LLM calls, time, or cost
- budget() output now prefixed with "⚠ LOW: ..." when resources are low

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ilures

- Rename _SubmitCalled → _SubmitCalledError (N818)
- Add noqa for SUBMIT() function name (N802, intentional API)
- Use X | Y union syntax in isinstance calls (UP038)
- Remove unused imports and variables (F401, F841)
- Sort imports (I001)
- Add zip strict=False (B905)
- Catch non-parse exceptions in bootstrap_trace so GEPA can
  reflect on partial traces from RLM timeout/crash/cost overrun
- RLM + LocalInterpreter integration (7 tests): forward, state persistence,
  tool access, stdlib imports, error recovery, max_time, aforward
- Media content construction (4 tests): audio/image content parts sent to LM,
  multiple media objects, model routing with media
- max_cost mid-run fallback (1 test): cost exceeded during iteration triggers
  extract fallback, not crash
- Async budget/time/cost (3 tests): aforward respects max_time, max_cost,
  budget() tool works in async path
- bootstrap_trace resilience (3 tests): RuntimeError captured as FailedPrediction,
  partial trace preserved on cost overrun, KeyboardInterrupt not swallowed
- LocalInterpreter output_fields setter (2 tests): post-init configuration,
  default single-output wrapping
When max_depth > 1, llm_query() spawns a child RLM with its own
LocalInterpreter REPL instead of making a plain LM completion call.
Each child manages its own iteration loop, tools, and budget.

Key design (mirrors vanilla RLM PR stanfordnlp#84 pattern):
- RLM._subcall() method spawns child RLM(signature='prompt -> response')
- Child gets fresh LocalInterpreter (isolated namespace, stdout capture)
- Budget propagation: remaining time/cost passed to child
- model= param selects child's sub_lm via resolve_lm closure
- User tools inherited by children
- Interpreter cleanup in finally block
- At leaf depth (depth >= max_depth - 1), falls back to plain LM call
- llm_query_batched runs children sequentially when recursive

Params added to RLM.__init__:
- depth: int = 0 (current recursion depth, 0-indexed)
- max_depth: int = 1 (max recursion depth, default=no recursion)

Tests: 24 new tests adapted from vanilla RLM test_subcall.py covering
parameter propagation, budget enforcement, interpreter isolation, and
end-to-end execution with DummyLM.
…test

Verifies that when the parent RLM uses PythonInterpreter (Deno sandbox),
llm_query with max_depth=2 correctly spawns a child RLM with its own
LocalInterpreter. The tool callback crosses the Deno→host JSON-RPC
boundary, _subcall runs on the host, and the child operates in an
isolated LocalInterpreter namespace.
- Reject max_depth < 1 and depth < 0 with ValueError at init
- Add async aforward() test for depth=2 (async parent, sync _subcall)
- All 185 tests pass with --deno flag
Child now inherits the parent's interpreter type:
- LocalInterpreter parent → LocalInterpreter child
- Default (PythonInterpreter/Deno) parent → PythonInterpreter child
- Custom interpreter → LocalInterpreter fallback (can't clone)

Previously child always got LocalInterpreter, which broke sandboxing
when parent used PythonInterpreter (Deno).

Tests updated: interpreter isolation tests specify parent type,
new test verifies PythonInterpreter child when parent uses default.
…port

Replace hardcoded Audio/Image media registry with the generic to_sandbox()
protocol from PR stanfordnlp#9283 (kmad's DataFrame type approach).

Changes:
- Add rlm_preview(), to_sandbox(), sandbox_setup(), sandbox_assignment()
  to Audio and Image types
- Replace _is_media/_media_descriptor with generic _has_rlm_support()
  in python_interpreter.py
- Replace _detect_media_fields/_build_media_registry with generic
  _detect_multimodal_fields/_build_multimodal_registry in rlm.py
- Add _wrap_rlm_inputs() for auto-wrapping raw values into dspy.Type
- Add _inject_pending_vars() for unified sandbox variable injection
- llm_query_with_media is now always available as a tool
- _build_variables uses rlm_preview() for better LLM context
- Update all tests to use new API names (multimodal_registry, etc.)
- Add tests for custom types implementing the protocol

Any dspy.Type with to_sandbox() + format() is now automatically:
1. Detected as multimodal input
2. Injected into sandbox via to_sandbox() protocol
3. Available for llm_query_with_media()
4. Previewed via rlm_preview() in variable context

204 tests pass (0 failures, 38 deno-skips).
…, protocol docs, max_output_chars

Incorporate changes from 07e3f39 and 61e81ac on PR stanfordnlp#9283:

- Restore head+tail truncation in REPLVariable and REPLEntry (was
  simplified to head-only, now shows first half + '...' + last half)
- Add REPLEntry.format_output() static method for verbose logging
- Put max_output_chars back on REPLHistory (threaded through append())
- Revert max_output_chars default to 10_000 (was 100_000)
- Simplify _format_output() to passthrough (truncation in REPLHistory)
- Add RLM sandbox protocol documentation to Type base class
- Update tests: head+tail assertions, REPLHistory threading, new tests

206 tests pass (0 failures, 38 deno-skips).
@okhat
Copy link
Collaborator

okhat commented Feb 11, 2026

The hero we needed, not the hero we deserved; thanks @rawwerks !!

@rawwerks
Copy link
Author

rawwerks commented Feb 11, 2026

@isaacbmiller - i think 1 & 7 probably go first (and together) since it looks like @kmad 's types are now in main. that seems like the next logical increment. do you agree?

@isaacbmiller
Copy link
Collaborator

Thanks for the PR! Going through these changes one by one:

 1. Multimodal media support via the dspy.Type sandbox protocol
Problem: RLM can only work with text. Audio/Image inputs are invisible to the agent.

Awesome!
I want to combine (1) and (7) on this PR with @kmad’s (Note #9283 is not yet merged), as these changes are connected. Will follow up on that PR.

  1. Multi-model sub-call routing via sub_lm

I understand the desire to give the model flexibility in choosing a model. I don’t know if this change is currently worth the potential confusion that it will cause the model.

Do you have current usecases where inside of a single run you want to use multiple models?

  1. LocalInterpreter — unsandboxed host-process execution

Great! Can we add an extra flag to verify that the user knows they are doing something dangerous (see BaseModule.load for how we force the user to pass an extra flag)

Separately, you can install packages from pip inside of pyodide. Is there a package you need that you are unable to install inside of pyodide?

  1. Budget awareness — budget(), max_time, max_cost

These are great!
You could make the argument that max_time and budget should both be at a different level of abstraction in DPSy, but until we get there, this works.

If we do hit max_time or budget, should we throw or should we extract? I would lean towards throw, as if we had enough information to extract, the model would have hopefully called SUBMIT.

  1. GEPA bootstrap_trace resilience
    Problem: RLM timeouts/cost overruns raise exceptions that bootstrap_trace_data doesn't catch, losing all trace data.

I also do not fully understand this problem. Is it just that after doing (4), that breaks GEPA somehow? Can you explain more?

  1. Depth > 1 — recursive subcalls

I am very excited about this change too! I am hesitant on having the signature only be “prompt → response”?

IMO the stronger types via signatures is one of the main benefits of the DSPy version of RLMs.

  1. Type protocol documentation on base class
    See (1)

TLDR:

  • Will combine (1) and (7) in RLM Generic Type Support + Dataframe implementation #9283.
  • 3 and 4 are great changes. Let’s break them out as separate PRs and make sure to add the flag for security on local interpreter.
  • Need more information on (2). I am hesitant about adding this in currently for the sake of confusing the model with too much information.
  • (5) I need more info
  • (6) is very cool! lets discuss how to make it a fully fledged sub_rlm implementation with signatures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility

3 participants