AeroRL is a compact library for vision-language RL experimentation. It packages the repetitive parts of VLM-RL work that teams usually rebuild by hand: trainer/runtime wiring, masked loss over mixed vision-language tokens, lightweight benchmarking, and replayable reward evaluation.
- runtime setup for train and reference model roles
- masked cross-entropy for text tokens with vision-token exclusion
- a minimal trainer lifecycle for experimentation
- a weighted reward stack for offline replay scoring
- benchmark scripts for throughput, VRAM, and reward-quality checks
MLE teams working on VLM-RL usually need the same scaffolding before they can even test an idea:
- choose and normalize trainer backend behavior
- wire train/reference model roles
- get token masking right for mixed-modal sequences
- measure throughput and memory in a repeatable way
- iterate on reward logic before spending GPU time on RL
AeroRL standardizes that layer so reward design and experiment iteration are easier to reason about and easier to reproduce.
Core library:
git clone https://github.com/sabdulmajid/aeroRL.git
cd aeroRL
python -m pip install -e .Library + benchmark tooling + test runner:
python -m pip install -e ".[benchmark,dev]"benchmark installs the image / Arrow / Transformers dependencies used by the benchmark scripts. dev installs pytest.
Library + local support for the external lmms-eval benchmark wrapper:
python -m pip install -e ".[benchmark,benchmark_standard,dev]"
python -m pip install --no-deps git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git@v0.7.1The extra step is intentional. lmms-eval has a very wide dependency surface, so AeroRL keeps the CUDA / PyTorch stack under your control and installs the harness itself without letting it re-resolve torch.
import torch
from aerorl import AeroRLConfig, AeroRLTrainer, wrap_vlm_for_rl
cfg = AeroRLConfig(trainer_backend="auto", quant_ref_bits=8, mask_vision_tokens=True)
train_runtime, ref_runtime = wrap_vlm_for_rl("Qwen/Qwen2.5-VL-7B-Instruct", cfg)
trainer = AeroRLTrainer(cfg)
trainer.on_train_start()
logits = torch.randn(2, 4, 8, requires_grad=True)
labels = torch.tensor([[1, 2, 3, 4], [2, 3, 4, 5]])
vision_mask = torch.tensor([[True, False, False, True], [False, False, True, False]])
step_result = trainer.train_step(logits=logits, labels=labels, vision_mask=vision_mask)
print(train_runtime["trainer"])
print(ref_runtime["quantization_mode"])
print(step_result)
print(trainer.on_train_end())The replay evaluator expects newline-delimited JSON records with prompt, response, reference, and metadata.
reference may be either:
- a single canonical answer string
- a list of acceptable answer strings when the dataset provides multiple valid aliases
Example:
{"id":"ex1","prompt":"...","response":"{\"answer\":\"april 15, 2014\"}","reference":["april 15,2014","april 15, 2014"],"metadata":{"evidence_entities":["april 15,2014","april 15, 2014"],"claimed_entities":["april 15, 2014"],"latency_ms":120}}The built-in reward stack combines four components:
verifier: answer correctness against one or more acceptable referencesgrounding: claimed vs evidence entity consistency with normalization for common surface-form variantsformat: JSON / regex contract checkscost: latency and verbosity pressure
Run the replay evaluator:
python benchmarks/reward_replay_evaluator.py \
--input examples/reward_replay_example.jsonl \
--output reports/reward-eval-example-2026-03-23.json \
--require-json \
--regex-pattern '^\{.*\}$' \
--weight verifier=0.45 \
--weight grounding=0.3 \
--weight format=0.2 \
--weight cost=0.05 \
--pass-threshold 0.5 \
--top-k 2Small deterministic benchmark:
python benchmarks/reward_value_benchmark.py \
--input examples/reward_value_benchmark_dataset.jsonl \
--output reports/reward-value-benchmark-2026-03-23.jsonCurrent checked-in result:
- dataset size:
6 - manual pass rate:
0.833333 - AeroRL pass rate:
0.5 - hidden false passes caught by AeroRL:
2
This benchmark is intentionally tiny. Its job is to show the scoring contract, not to claim model quality.
CPU-only replay benchmark over cached DocVQA + ChartQA rows:
python benchmarks/reward_large_scale_real_dataset_benchmark.py \
--limit 50000 \
--report-output reports/reward-large-scale-benchmark-2026-03-23.jsonCurrent checked-in artifact: reports/reward-large-scale-benchmark-2026-03-23.json
- total records:
29,299 - manual pass rate:
0.997747 - AeroRL pass rate:
0.624083 - hidden false passes caught:
10,948 - evaluation throughput:
52,646records/s on CPU
This benchmark uses deterministic corruption patterns over real prompts and references. It is useful for stress-testing the reward stack, but it is not a substitute for model-generated evaluation.
This is the primary benchmark in the repo. It uses real cached images and questions, generates answers with a real VLM, and then scores those answers with both the manual baseline and the AeroRL reward stack.
First-principles takeaway:
- the old March 27 small-model run underused the GPU because it paired a 256M model with batch size
1 - the current benchmark fixes that by batching generation and sampling GPU usage continuously during the run
- the goal is not to maximize VRAM for its own sake; the goal is to maximize useful throughput while keeping quality strong
Always preflight GPU usage before a run:
nvidia-smiWe first ran a balanced 200-sample matrix on one physical GPU to find a good model / batch-size point:
CUDA_VISIBLE_DEVICES=1 python benchmarks/reward_model_generated_benchmark.py \
--models 'HuggingFaceTB/SmolVLM-500M-Instruct@32,Qwen/Qwen2-VL-2B-Instruct@16,Qwen/Qwen2.5-VL-3B-Instruct@16,Qwen/Qwen2.5-VL-7B-Instruct@8' \
--device cuda:0 \
--cache-dir /pub7/neel2/.cache_hf \
--limit-docvqa 100 \
--limit-chartqa 100 \
--max-new-tokens 16 \
--gpu-sample-interval-sec 0.5 \
--output reports/reward-model-generated-matrix-2026-03-27.jsonArtifact: reports/reward-model-generated-matrix-2026-03-27.json
| Model | Batch | Samples/s | AeroRL pass rate | Avg GPU util | Peak reserved GB |
|---|---|---|---|---|---|
SmolVLM-500M |
32 | 0.835 | 0.69 | 2.783% | 18.459 |
Qwen2-VL-2B |
16 | 2.416 | 0.76 | 84.947% | 48.008 |
Qwen2.5-VL-3B |
16 | 1.941 | 0.61 | 73.409% | 43.316 |
Qwen2.5-VL-7B |
8 | 1.553 | 0.39 | 64.632% | 42.719 |
What this means:
- On this machine,
Qwen2-VL-2B-Instructwas the best measured balance of quality, throughput, and actual GPU use. - Bigger was not automatically better. The 7B model used more compute than the old tiny baseline, but it did not beat the 2B model on this slice.
- Batching was the real lever. That is what moved the GPU from “barely used” to “meaningfully busy.”
After the matrix, we promoted the winning configuration to the full dated run:
CUDA_VISIBLE_DEVICES=1 python benchmarks/reward_model_generated_benchmark.py \
--model-id 'Qwen/Qwen2-VL-2B-Instruct' \
--device cuda:0 \
--cache-dir /pub7/neel2/.cache_hf \
--limit-docvqa 300 \
--limit-chartqa 500 \
--batch-size 16 \
--max-new-tokens 16 \
--gpu-sample-interval-sec 0.5 \
--output reports/reward-model-generated-benchmark-qwen2-vl-2b-2026-03-27.json \
--replay-output reports/reward-model-generated-replay-qwen2-vl-2b-2026-03-27.jsonlArtifact: reports/reward-model-generated-benchmark-qwen2-vl-2b-2026-03-27.json
- model:
Qwen/Qwen2-VL-2B-Instruct - process-visible device:
cuda:0 - physical GPU used:
GPU 1viaCUDA_VISIBLE_DEVICES=1 - total samples:
800 - batch size:
16 - generation throughput:
3.05samples/s - token throughput:
13.13generated tokens/s - average sample latency:
327.795 ms - p95 sample latency:
857.34 ms - average GPU utilization during the run:
86.845% - max GPU utilization during the run:
100% - average GPU memory in use during the run:
57,742.51 MiB - max GPU memory in use during the run:
68,702 MiB - peak PyTorch reserved memory:
66.404 GB - manual pass rate:
0.71625 - AeroRL pass rate:
0.6825 - manual false passes caught by AeroRL:
27 - replay rows with multiple acceptable references preserved:
122 - empty generations recorded as prompt echoes:
0
The model-generated benchmark is also where the recent scoring fixes matter most:
- DocVQA rows now keep every acceptable answer alias instead of collapsing to one string.
- Grounding tolerates common surface-form variants like trailing punctuation and percent formatting.
- Empty generations stay empty instead of being decoded as the prompt template.
This benchmark answers a different question than the AeroRL reward benchmark:
- the AeroRL benchmark measures reward quality and replay integrity on real model generations
- the public harness benchmark measures the same model on a standard external evaluation stack
We keep the external harness at batch size 1 on purpose. lmms-eval explicitly warns that some multimodal backends can change behavior across batch sizes, so for the README numbers below we chose the conservative setting over maximum GPU saturation.
Run the public lite VQA suite:
CUDA_VISIBLE_DEVICES=1 python benchmarks/lmms_eval_standard_benchmark.py \
--device cuda:0 \
--batch-size 1 \
--process-with-media \
--force-simple \
--output reports/lmms-eval-standard-benchmark-2026-03-27.json \
--raw-output-dir reports/lmms-eval-standard-raw-2026-03-27 \
--log-output reports/lmms-eval-standard-2026-03-27.logArtifact: reports/lmms-eval-standard-benchmark-2026-03-27.json
Public-harness results for Qwen/Qwen2-VL-2B-Instruct:
| Task | Samples | Metric | Score |
|---|---|---|---|
aerorl_docvqa_val_lite |
500 | anls |
0.860572 |
aerorl_chartqa_lite |
500 | relaxed_overall |
0.650000 |
Run summary:
- total public-harness samples:
1,000 - elapsed time:
235.42 s - throughput:
4.248samples/s - physical GPU used:
GPU 1viaCUDA_VISIBLE_DEVICES=1 - average GPU utilization during the run:
27.249% - max GPU utilization during the run:
61% - average GPU memory in use during the run:
4,769.815 MiB - max GPU memory in use during the run:
5,230 MiB - average GPU power draw during the run:
157.478 W - max GPU power draw during the run:
213.42 W
What this means:
- AeroRL now has a public benchmark path that does not depend on AeroRL’s own scoring logic.
- The external harness run is intentionally easier to compare across teams, but it is not the right benchmark to maximize single-GPU utilization.
- The batched AeroRL benchmark remains the right place to measure local throughput and reward-pipeline behavior.
benchmarks/vlm_grpo_benchmark.py is still useful, but it should be read as a synthetic torch smoke benchmark, not the primary README performance claim. The README performance story above comes from real model generation on real image/question pairs.
python benchmarks/vlm_grpo_benchmark.py \
--mode real \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--steps 20 \
--matrix-size 512Current checked-in artifacts:
reports/benchmark-real-2026-03-23.jsonreports/benchmark-matrix-real-2026-03-23.json
AeroRLConfigwrap_vlm_for_rl(...)AeroRLTrainermasked_cross_entropy_loss(...)create_quantized_reference_runtime(...)build_default_reward_stack(...)build_reward_stack(...)evaluate_records(...)
Local validation completed on March 27, 2026:
python -m pytest -q- result:
29 passed - model-generated matrix:
reports/reward-model-generated-matrix-2026-03-27.json - model-generated dated benchmark:
reports/reward-model-generated-benchmark-qwen2-vl-2b-2026-03-27.json - public harness benchmark:
reports/lmms-eval-standard-benchmark-2026-03-27.json
- prototyping VLM-RL reward or loss ideas before integrating into a larger trainer
- replay-scoring offline trajectories to debug false positives and hidden failures
- generating dated JSON artifacts for throughput, VRAM, and reward-quality tracking