This repository is built for high-velocity AI coding agents (Claude 4 Sonnet, Cursor, GPT-4o, etc.).
Rules every agent must follow:
- Never change the public API in aerorl/init.py unless you bump the major version.
- All kernels must be in
aerorl/kernels/as pure Triton Python (no raw CUDA unless absolutely required for IPC). - Every new feature must ship with a benchmark delta vs. vanilla verl/TRL on Qwen2.5-VL-7B.
- Vision token masking is mandatory for any loss function — use the
vision_masktensor passed from the VLM processor. - Zero-copy KV must survive
torch.compileand FSDP2. - All benchmarks must run on a single RTX PRO 6000 96 GB card and report exact VRAM delta and throughput.
Preferred workflow:
- Agent 1 (Architect): Update README + plan new feature
- Agent 2 (Kernel): Implement Triton kernel + test
- Agent 3 (Benchmark): Run on 96 GB card, update tables, create PR
Current priority queue:
- Finish quantized ref model + DAPO filter fusion
- Add InternVL2 and Phi-3-Vision processor support
- Publish wheel to PyPI
When in doubt: make it faster and lower-VRAM than verl on VLMs. Ship numbers, not promises.