- Reviewed existing repo files:
AGENTS.md,README.md,BOOKMARK_LOG.md. - Initialized a runnable Python package scaffold:
pyproject.tomlaerorl/__init__.pyaerorl/config.pyaerorl/wrapper.py
- Added synthetic benchmark entrypoint:
benchmarks/vlm_grpo_benchmark.py
- Added smoke test:
tests/test_public_api.py
- Updated
README.mdto match current implemented state.
- User requested durable progress that survives disconnects and can be resumed by another agent.
- Repo was docs-only and untracked; this creates a concrete baseline with executable artifacts.
- Keep public API stable via
aerorl.__init__exports. - Benchmark script is synthetic for smoke checks only; real benchmark integration is pending.
- Integrate a real VLM policy/ref-model loading path (likely HF + TRL/verl adapters).
- Implement vision-mask-aware loss computation path and tests.
- Add kernel scaffolding under
aerorl/kernels/per AGENTS guidance. - Replace synthetic benchmark outputs with measured VRAM + throughput metrics.
- Continue frequent small commits and push each checkpoint.
- Run:
git -C /pub7/neel2/aerorl status - Read this file first, then
README.md. - Continue from "Immediate next steps" and commit after each logical sub-step.
python -m pip install -e .failed withOSError: [Errno 28] No space left on device.pytestexecutable is not installed on host PATH.- Functional smoke checks passed using
PYTHONPATH=. python ...invocations.
- Diagnosed free space across mounted disks and confirmed
/pub7has large free capacity. - Created dedicated execution environment on
/pub7:- venv:
/pub7/neel2/.venvs/aerorl - temp:
/pub7/neel2/tmp - pip cache:
/pub7/neel2/pip-cache
- venv:
- Successfully installed project with:
TMPDIR=/pub7/neel2/tmp PIP_CACHE_DIR=/pub7/neel2/pip-cache /pub7/neel2/.venvs/aerorl/bin/python -m pip install -e .
- Ran tests:
/pub7/neel2/.venvs/aerorl/bin/python -m pytest -q- Result:
1 passed in 0.01s
- Ran benchmark and persisted output:
/pub7/neel2/.venvs/aerorl/bin/python benchmarks/vlm_grpo_benchmark.py --model Qwen/Qwen2.5-VL-7B-Instruct --steps 25- Output saved to
reports/benchmark-smoke-2026-03-23.json
- Added verification report:
reports/verification-2026-03-23.md
steps: 25elapsed_sec: 0.2515iters_per_sec: 99.3925
- Use
/pub7/neel2/.venvs/aerorl/bin/pythonfor all project commands to avoid/pub3disk-space issues.
- Real TRL/verl GRPO integration surface:
- Added backend adapter resolution in
aerorl/adapters.pyand wired towrap_vlm_for_rloutput.
- Added backend adapter resolution in
- Vision-token masking in loss path:
- Added
aerorl/losses.pywithbuild_text_token_maskandmasked_cross_entropy_loss.
- Added
- Quantized reference runtime path:
- Added
aerorl/quant_ref.pyand integrated into wrapper outputs (int4/int8/fp16-referencemodes).
- Added
- Real VRAM/throughput benchmark path:
- Upgraded benchmark script to
--mode realwith measuredpeak_vram_gband throughput.
- Upgraded benchmark script to
pytest:5 passed in 1.15s- Real benchmark (
cuda) persisted toreports/benchmark-real-2026-03-23.json - Synthetic benchmark persisted to
reports/benchmark-synth-2026-03-23.json
trlandverlare not installed in this machine, so backend auto-resolution reports scaffold mode.- The integration layer is implemented and ready; full trainer run requires installing one of those backends.
- Added
AeroRLTrainerlifecycle +train_stepintegration for masked loss path. - Added backend-aware quantization runtime selection (
torch/bitsandbytes/torchaofallback logic). - Added multi-model matrix benchmark mode with
--models. - Rewrote README to minimal user-facing format with direct stats and easy examples.
- Tests:
7 passed - Real benchmark file:
reports/benchmark-real-2026-03-23.json - Real matrix file:
reports/benchmark-matrix-real-2026-03-23.json
- Added
aerorl/rewards.pywith composable reward framework:VerifierRewardGroundingRewardFormatRewardCostRewardWeightedRewardStackbuild_default_reward_stackevaluate_records
- Added offline evaluator CLI:
benchmarks/reward_replay_evaluator.py
- Exported reward APIs in
aerorl/__init__.py.
- Added tests:
tests/test_rewards.pytests/test_reward_replay_evaluator_cli.py
- Full suite result:
12 passed. - Added replay sample + summary artifacts:
reports/reward-replay-sample-2026-03-23.jsonlreports/reward-eval-summary-2026-03-23.json
- AeroRL now supports fast reward-iteration loops offline before expensive RL runs.
- This directly improves practical usefulness for reward-function experimentation.
benchmarks/reward_replay_evaluator.pynow supports:- configurable weights via repeated
--weight name=value - JSON/regex format constraints (
--require-json,--regex-pattern) - cost controls (
--target-tokens,--latency-budget-ms) - aggregate pass metrics (
--pass-threshold) - best/worst surfacing (
--top-k)
- configurable weights via repeated
aerorl/rewards.pynow returns richer summaries:pass_ratecomponent_averagesbest_examples/worst_examples
- Added example replay dataset and output:
examples/reward_replay_example.jsonlreports/reward-eval-example-2026-03-23.json
- Full suite:
14 passed
- New benchmark script:
benchmarks/reward_value_benchmark.py
- New benchmark dataset:
examples/reward_value_benchmark_dataset.jsonl
- New benchmark artifact:
reports/reward-value-benchmark-2026-03-23.json
- New unit test:
tests/test_reward_value_benchmark.py
- Manual pass rate:
0.833333 - AeroRL pass rate:
0.5 - AeroRL caught hidden manual false passes:
2 - False pass rate among manual passes:
0.4 - Observability dimensions: manual
1vs AeroRL4
- Demonstrates concrete quality-gate value, not just aggregate score output.
- Shows exactly what AeroRL catches that naive manual scoring misses.
- Real cached HF datasets from
/pub7/neel2/.cache_hf/datasets:nielsr/docvqa_1200_examplestrainHuggingFaceM4/ChartQAtrain shards
- Total records evaluated:
29,299
- Ran
nvidia-smibefore benchmark. - GPUs were idle (0% utilization except display processes).
- Benchmark intentionally executed in CPU mode.
- Script:
benchmarks/reward_large_scale_real_dataset_benchmark.py - Artifact:
reports/reward-large-scale-benchmark-2026-03-23.json
- Manual pass rate:
0.997747 - AeroRL pass rate:
0.624083 - Hidden manual false passes caught by AeroRL:
10,948 - Hidden false-pass fraction among manual passes:
0.374508 - Evaluation throughput:
52,646 records/s(CPU)
- Manual scoring overestimates quality on large datasets.
- AeroRL provides materially stronger quality gating and diagnostics at scale.
- New benchmark script:
benchmarks/reward_model_generated_benchmark.py
- New focused tests:
tests/test_reward_model_generated_benchmark.py
- Loads real cached datasets from
/pub7/neel2/.cache_hf/datasets:- DocVQA train Arrow
- ChartQA train Arrow shards
- Decodes real image bytes and runs image-conditioned generation with cached SmolVLM (
HuggingFaceTB/SmolVLM-256M-Instruct). - Normalizes outputs into a JSON answer contract and scores with AeroRL reward stack.
- Writes both replay-level JSONL and benchmark summary JSON artifacts.
- Added local snapshot/cache resolution + forced HF cache env handling to avoid
/pub3disk-full path.
- Checked GPU state before run via
nvidia-smi(idle). - Ran benchmark on
cuda:0using local cached model files only.
- Summary artifact:
reports/reward-model-generated-benchmark-2026-03-24.json
- Replay artifact:
reports/reward-model-generated-replay-2026-03-24.jsonl
- Run shape:
40DocVQA +60ChartQA =100model-generated samples
- Key outcomes:
- manual pass rate:
0.47 - AeroRL pass rate:
0.3 - hidden manual false passes caught by AeroRL:
17 - hidden false-pass fraction among manual passes:
0.361702 - generation throughput:
1.784samples/sec (GPU)
- manual pass rate:
- New test file passes:
tests/test_reward_model_generated_benchmark.py:2 passed
- Script:
benchmarks/reward_model_generated_benchmark.py - Model:
HuggingFaceTB/SmolVLM-256M-Instruct - Device:
cuda:0 - Data: real cached DocVQA + ChartQA rows with image-conditioned generation
- Limits:
300DocVQA +500ChartQA =800total
reports/reward-model-generated-benchmark-2026-03-24-large.jsonreports/reward-model-generated-replay-2026-03-24-large.jsonl
- manual pass rate:
0.47125 - AeroRL pass rate:
0.2875 - hidden manual false passes caught by AeroRL:
147 - hidden false-pass fraction among manual passes:
0.38992 - generation throughput:
1.462samples/sec
- Reported GPU status before run: near idle.
- Reported GPU status after run: very high utilization/memory pressure (
~96-99%util on two GPUs reported bynvidia-smiquery snapshot in artifact).