Bilevel Autoresearch: use autoresearch to research autoresearch #375
Replies: 1 comment
-
Update: Controlled ablation on Karpathy's benchmark (3×3 repeats)Since the original post, we ran a proper controlled experiment directly on the autoresearch benchmark (train.py, val_bpb, 300s budget) using 3 RTX 5090 servers in parallel. SetupSame LLM (DeepSeek) for all levels — eliminates model capability as a confound.
3 independent repeats × 30 iterations each. train.py reset to baseline between repeats. What Level 2 discoveredEach repeat independently generated different mechanisms from different domains — no human specified which domains to explore:
Why Level 2 winsGroup A follows a near-deterministic search path: it tries WEIGHT_DECAY, then WINDOW_PATTERN, then repeats the same failed proposal up to 22 times consecutively. Level 2's mechanisms (Tabu Search, Orthogonal Exploration) break this loop and guided the LLM to discover that reducing TOTAL_BATCH_SIZE dramatically improves val_bpb — a direction Groups A and B never explored. DetailsFull ablation report, experiment logs, and all Level 2 generated code: GitHub |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What we built
We extended the autoresearch pattern with a second optimization loop. The inner loop runs the standard propose → evaluate → iterate cycle on a task. The outer loop treats the inner loop's configuration as its own optimization target — analyzing traces, diagnosing bottlenecks, and updating the pipeline.
The key question: autoresearch treats the pipeline structure as fixed. What happens when the pipeline itself becomes the research subject?
Repo: EdwardOptimization/Bilevel-Autoresearch
Architecture
The framework is domain-agnostic. Our demo optimizes research articles against a 5-dimension rubric (Argumentative Rigor / Conceptual Clarity / Cross-Article Consistency / Insight Novelty / Actionability), but the inner loop can be anything with a measurable objective.
Results
Single-layer (inner loop only, 17 runs):
Dual-layer (outer loop automated, 4 cycles × 5 runs):
Cycle 2 stability (4/5 runs at 7.0 vs declining in Cycle 1) is direct evidence of the outer loop working.
Level 2: Mechanism Research via Code Generation
This is where it gets interesting. Prompt-level optimization has a ceiling — you can't discover a fundamentally new search mechanism by rewording a prompt. So we asked: can the outer loop research new mechanisms the same way autoresearch researches any topic?
The outer LLM (DeepSeek) runs a multi-round research session:
importlib, inject it into the pipeline, run the inner loop, measure improvementIn our first successful run, DeepSeek drew from Behavioral Psychology / Curriculum Learning and autonomously generated a
SubskillFeedbackLoopStage— a stage that decomposes "argumentative rigor" into sub-skills (premise clarity, transition logic, jargon usage, conclusion support), scores each, and provides targeted revision directives for weak areas.The code was generated on the first attempt (0 retries), dynamically loaded, and injected after the edit planning stage. Result:
Modest improvement, but the point is: the outer loop autonomously wrote working Python code that modified the inner pipeline's behavior — no human specified what mechanism to try or which domain to draw from.
The recursive structure
This creates three levels of the same pattern:
Each level uses the same core loop: propose × evaluate → iterate. The boundary isn't the search space — it's whether the research question is measurable. As long as we can score the inner loop's output, the outer loop can research how to improve it.
Honest limitations
Connection to autoresearch
This project was directly inspired by autoresearch. The core observation: autoresearch, AutoResearchClaw, and EvoScientist each represent a human-designed mechanism change to the base loop. We asked whether an outer optimization loop could discover such improvements autonomously.
The theoretical framing maps to bilevel optimization:
The inner problem is solved approximately by LLM, making this an instance of approximate bilevel optimization with LLM solvers.
Beta Was this translation helpful? Give feedback.
All reactions