feat(rlm): multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, GEPA resilience#9295
Conversation
Add llm_query_with_media() tool that lets RLM's sandboxed code send Audio and Image inputs to sub-LLM calls for multimodal reasoning. Changes: - python_interpreter.py: Handle Audio/Image in _serialize_value() and _to_json_compatible() by converting to descriptor strings - rlm.py: Add media registry that captures Audio/Image inputs, new llm_query_with_media(prompt, *media_var_names) tool, and dynamic instruction generation when media fields are detected Media objects can't be serialized into the Deno/WASM sandbox, so they are stored in a registry and referenced by variable name. The sandbox sees descriptor strings; actual media data flows through llm_query_with_media() to the sub-LLM as proper multimodal content. Tested with Gemini 3 Flash (via OpenRouter) for both audio transcription and image description tasks.
- Fix ruff lint errors: import sorting, unused loop var, list()[0] → next(iter()) - Add 15 unit tests for RLM media detection, registry, llm_query_with_media - Add 12 unit tests for python_interpreter media serialization helpers - All 78 unit tests pass (37 deno tests skipped as expected) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable sandbox code to select different LMs for sub-queries by passing model='name' to llm_query, llm_query_batched, and llm_query_with_media. This matches the upstream RLM design where registered models can be chosen at call time within the REPL. - Add sub_lms: dict[str, dspy.LM] parameter to RLM.__init__ - Add _resolve_lm() helper with 3-tier routing: named → sub_lm → default - Add model parameter to all llm_query* tool functions - Auto-document available model names in action instructions template - Shared call counter across all model choices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add LocalInterpreter as an alternative to PythonInterpreter (Deno/Pyodide)
for RLM code execution. Executes code directly in the host Python process
via exec(), giving RLM agents access to any installed package (PIL, numpy,
soundfile, etc.) without WASM sandbox restrictions.
Implements the CodeInterpreter protocol: start(), execute(), shutdown(),
tools property. State persists across execute() calls. SUBMIT() maps
positional args to output_fields (matching PythonInterpreter behavior).
Usage:
from dspy.primitives.local_interpreter import LocalInterpreter
rlm = dspy.RLM(sig, interpreter=LocalInterpreter())
Includes 33 unit tests covering execution, state persistence, variable
injection, error handling, SUBMIT/FinalOutput, tools, and context manager.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent can now call budget() from the REPL to check remaining iterations, LLM calls, and wall-clock time. This enables cost-aware recursive strategies — the agent can decide to use cheaper/fewer sub-queries as resources dwindle. Changes: - Add max_time parameter: optional wall-clock seconds limit per forward() call. Gracefully falls back to extract when exceeded (no exception, just early termination). - Add budget() tool: injected alongside llm_query/llm_query_batched, returns human-readable summary of remaining resources. - Track execution state (start_time, current iteration) via mutable dict shared between forward() loop and budget() closure. - Update ACTION_INSTRUCTIONS_TEMPLATE to mention budget(). - Add 'budget' to reserved tool names. - Mirror all changes in aforward() for async path. 13 new tests covering: initialization, budget output format, iteration tracking, LLM call tracking, time tracking, reserved name, action instructions, timeout fallback, and end-to-end budget updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extends the budget system with actual dollar cost tracking: - Add max_cost parameter: optional dollar limit per forward() call. Tracked via litellm's per-call cost reporting in lm.history entries. Gracefully falls back to extract when exceeded. - budget() now reports: iterations, LLM calls, time, cost, and tokens. Cost is computed by summing lm.history entries added since tool creation (snapshot offsets at start of each forward() call). - Cost enforcement in both forward() and aforward() loops. - Expose _get_cost_and_tokens via execution_state dict so the forward() loop can check cost without reaching into tool internals. 5 new tests for max_cost init, budget cost display, cost fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sum both provider cost (litellm response_cost) and upstream inference cost (usage.cost_details.upstream_inference_cost) for accurate BYOK tracking (e.g. OpenRouter BYOK → Vertex with 5% markup) - Add budget warnings when any resource drops below 20% remaining: iterations, LLM calls, time, or cost - budget() output now prefixed with "⚠ LOW: ..." when resources are low Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ilures - Rename _SubmitCalled → _SubmitCalledError (N818) - Add noqa for SUBMIT() function name (N802, intentional API) - Use X | Y union syntax in isinstance calls (UP038) - Remove unused imports and variables (F401, F841) - Sort imports (I001) - Add zip strict=False (B905) - Catch non-parse exceptions in bootstrap_trace so GEPA can reflect on partial traces from RLM timeout/crash/cost overrun
- RLM + LocalInterpreter integration (7 tests): forward, state persistence, tool access, stdlib imports, error recovery, max_time, aforward - Media content construction (4 tests): audio/image content parts sent to LM, multiple media objects, model routing with media - max_cost mid-run fallback (1 test): cost exceeded during iteration triggers extract fallback, not crash - Async budget/time/cost (3 tests): aforward respects max_time, max_cost, budget() tool works in async path - bootstrap_trace resilience (3 tests): RuntimeError captured as FailedPrediction, partial trace preserved on cost overrun, KeyboardInterrupt not swallowed - LocalInterpreter output_fields setter (2 tests): post-init configuration, default single-output wrapping
When max_depth > 1, llm_query() spawns a child RLM with its own LocalInterpreter REPL instead of making a plain LM completion call. Each child manages its own iteration loop, tools, and budget. Key design (mirrors vanilla RLM PR stanfordnlp#84 pattern): - RLM._subcall() method spawns child RLM(signature='prompt -> response') - Child gets fresh LocalInterpreter (isolated namespace, stdout capture) - Budget propagation: remaining time/cost passed to child - model= param selects child's sub_lm via resolve_lm closure - User tools inherited by children - Interpreter cleanup in finally block - At leaf depth (depth >= max_depth - 1), falls back to plain LM call - llm_query_batched runs children sequentially when recursive Params added to RLM.__init__: - depth: int = 0 (current recursion depth, 0-indexed) - max_depth: int = 1 (max recursion depth, default=no recursion) Tests: 24 new tests adapted from vanilla RLM test_subcall.py covering parameter propagation, budget enforcement, interpreter isolation, and end-to-end execution with DummyLM.
…test Verifies that when the parent RLM uses PythonInterpreter (Deno sandbox), llm_query with max_depth=2 correctly spawns a child RLM with its own LocalInterpreter. The tool callback crosses the Deno→host JSON-RPC boundary, _subcall runs on the host, and the child operates in an isolated LocalInterpreter namespace.
- Reject max_depth < 1 and depth < 0 with ValueError at init - Add async aforward() test for depth=2 (async parent, sync _subcall) - All 185 tests pass with --deno flag
Child now inherits the parent's interpreter type: - LocalInterpreter parent → LocalInterpreter child - Default (PythonInterpreter/Deno) parent → PythonInterpreter child - Custom interpreter → LocalInterpreter fallback (can't clone) Previously child always got LocalInterpreter, which broke sandboxing when parent used PythonInterpreter (Deno). Tests updated: interpreter isolation tests specify parent type, new test verifies PythonInterpreter child when parent uses default.
…port Replace hardcoded Audio/Image media registry with the generic to_sandbox() protocol from PR stanfordnlp#9283 (kmad's DataFrame type approach). Changes: - Add rlm_preview(), to_sandbox(), sandbox_setup(), sandbox_assignment() to Audio and Image types - Replace _is_media/_media_descriptor with generic _has_rlm_support() in python_interpreter.py - Replace _detect_media_fields/_build_media_registry with generic _detect_multimodal_fields/_build_multimodal_registry in rlm.py - Add _wrap_rlm_inputs() for auto-wrapping raw values into dspy.Type - Add _inject_pending_vars() for unified sandbox variable injection - llm_query_with_media is now always available as a tool - _build_variables uses rlm_preview() for better LLM context - Update all tests to use new API names (multimodal_registry, etc.) - Add tests for custom types implementing the protocol Any dspy.Type with to_sandbox() + format() is now automatically: 1. Detected as multimodal input 2. Injected into sandbox via to_sandbox() protocol 3. Available for llm_query_with_media() 4. Previewed via rlm_preview() in variable context 204 tests pass (0 failures, 38 deno-skips).
…, protocol docs, max_output_chars Incorporate changes from 07e3f39 and 61e81ac on PR stanfordnlp#9283: - Restore head+tail truncation in REPLVariable and REPLEntry (was simplified to head-only, now shows first half + '...' + last half) - Add REPLEntry.format_output() static method for verbose logging - Put max_output_chars back on REPLHistory (threaded through append()) - Revert max_output_chars default to 10_000 (was 100_000) - Simplify _format_output() to passthrough (truncation in REPLHistory) - Add RLM sandbox protocol documentation to Type base class - Update tests: head+tail assertions, REPLHistory threading, new tests 206 tests pass (0 failures, 38 deno-skips).
|
The hero we needed, not the hero we deserved; thanks @rawwerks !! |
|
@isaacbmiller - i think 1 & 7 probably go first (and together) since it looks like @kmad 's types are now in main. that seems like the next logical increment. do you agree? |
|
Thanks for the PR! Going through these changes one by one:
Awesome!
I understand the desire to give the model flexibility in choosing a model. I don’t know if this change is currently worth the potential confusion that it will cause the model. Do you have current usecases where inside of a single run you want to use multiple models?
Great! Can we add an extra flag to verify that the user knows they are doing something dangerous (see BaseModule.load for how we force the user to pass an extra flag) Separately, you can install packages from pip inside of pyodide. Is there a package you need that you are unable to install inside of pyodide?
These are great! If we do hit max_time or budget, should we throw or should we extract? I would lean towards throw, as if we had enough information to extract, the model would have hopefully called SUBMIT.
I also do not fully understand this problem. Is it just that after doing (4), that breaks GEPA somehow? Can you explain more?
I am very excited about this change too! I am hesitant on having the signature only be “prompt → response”? IMO the stronger types via signatures is one of the main benefits of the DSPy version of RLMs.
TLDR:
|
Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1, and GEPA compatibility
Closes #9289
Summary
Seven additive features that came from using RLM + GEPA on real multimodal tasks. Each addresses a failure mode hit in practice. All changes are backwards-compatible — default behavior is unchanged.
+3,115 lines across 10 files. 206 tests pass (0 failures, 38 deno-skips). Rebased on current
main.The seven changes
1. Multimodal media support via the
dspy.Typesandbox protocolProblem: RLM can only work with text. Audio/Image inputs are invisible to the agent.
Solution: Implements the
to_sandbox()/rlm_preview()/sandbox_setup()/sandbox_assignment()protocol (from #9283) onAudioandImage. Anydspy.Typeimplementing this protocol is automatically:_detect_multimodal_fields)to_sandbox()(descriptor string for media)rlm_preview()in the agent's variable contextllm_query_with_media(prompt, *media_var_names)which sends the actual content parts to the sub-LLMAlso adds
_wrap_rlm_inputs()to auto-wrap raw values into their annotateddspy.Type, and_inject_pending_vars()for unified sandbox injection — both patterns from #9283.Documents the protocol on the
Typebase class so future type authors know what to implement for RLM support.2. Multi-model sub-call routing via
sub_lmsProblem: Some tasks benefit from cheap exploration + strong verification. RLM only supports a single
sub_lm.Solution:
sub_lms={"flash": lm_fast, "pro": lm_pro}— sandbox code selects withllm_query(prompt, model="flash"). Falls back tosub_lm→dspy.settings.lm.3. LocalInterpreter — unsandboxed host-process execution
Problem: Deno/Pyodide sandbox can't access host packages (PIL, numpy, soundfile, etc.).
Solution: New
LocalInterpreter(175 lines) implementsCodeInterpreterviaexec(). State persists, tools injected,SUBMIT()identical. Intentionally unsandboxed for local experiments.4. Budget awareness —
budget(),max_time,max_costProblem: Agents burn through iterations blindly. A single runaway example can blow your API budget in optimization loops.
Solution:
max_time=120— wall-clock limit, triggers extract fallback (not crash)max_cost=0.50— dollar limit via litellm cost + BYOKupstream_inference_costbudget()— sandbox tool showing remaining resources with low-resource warnings5. GEPA bootstrap_trace resilience
Problem: RLM timeouts/cost overruns raise exceptions that
bootstrap_trace_datadoesn't catch, losing all trace data.Solution: Broad
except Exceptionhandler preserves partial trace asFailedPrediction(+7 lines inbootstrap_trace.py).KeyboardInterruptis NOT caught.6. Depth > 1 — recursive subcalls
Problem:
llm_query()makes plain LM completions — no tools, no iteration. Sub-LLMs miss information in long contexts ("lost in the middle").Solution:
max_depth=2makesllm_query()spawn a child RLM with its own REPL. Child can write code, call its ownllm_query(), and iterate before SUBMITting.Budget propagation: children get remaining time/cost, own iteration/call counters. Interpreter type matches parent (LocalInterpreter→LocalInterpreter, PythonInterpreter→PythonInterpreter). Tested E2E: 5/5 on LongMemEval_S with Gemini 3 Flash. Adapted from vanilla RLM depth>1.
7. Type protocol documentation on base class
Adds protocol comments to
dspy.Typebase class documenting the four methods for RLM sandbox opt-in, matching the pattern established by #9283.File-by-file
dspy/adapters/types/audio.pyrlm_preview,to_sandbox,sandbox_setup,sandbox_assignmentdspy/adapters/types/image.pydspy/adapters/types/base_type.pydspy/predict/rlm.pydspy/primitives/local_interpreter.pydspy/primitives/python_interpreter.py_has_rlm_support+to_sandboxinjectiondspy/teleprompt/bootstrap_trace.pytests/predict/test_rlm.pytests/primitives/test_local_interpreter.pytests/primitives/test_media_serialization.pyHow they compose
Test summary
206 passed, 0 failures, 38 skipped (skips are pre-existing Deno-dependent tests)
Relationship to #9283
The multimodal media support adopts the same
dspy.Typesandbox protocol that #9283 introduces forDataFrame. We implementedto_sandbox()/rlm_preview()/sandbox_setup()/sandbox_assignment()onAudioandImage, and use the same generic_has_rlm_support()detection inPythonInterpreter. This means any futuredspy.Typewith the protocol works automatically with RLM — no RLM-specific code needed.Implementation notes
max_time=None,max_cost=None,sub_lms={},max_depth=1, defaultPythonInterpreter)hasattr(annotation, 'to_sandbox')) not hardcoded type listsruff check