Skip to content

feat: Add VERL-based GRPO training implementation#43

Open
yxjiang wants to merge 4 commits intomainfrom
example-verl
Open

feat: Add VERL-based GRPO training implementation#43
yxjiang wants to merge 4 commits intomainfrom
example-verl

Conversation

@yxjiang
Copy link
Member

@yxjiang yxjiang commented Sep 16, 2025

  • Add reasoning_grpo_verl.py with VERL GRPOTrainer implementation
  • Unified reward function combining format, thinking quality, and answer correctness
  • VERL-specific GRPO configuration with actor_rollout parameters
  • Built-in validation and error handling for VERL availability
  • Support for LoRA fine-tuning and wandb logging
  • Comprehensive documentation in README_VERL.md
  • Compatible with existing TRL implementation for comparison

Usage:
python reasoning_grpo_verl.py --model-size 3B --use-lora

- Add reasoning_grpo_verl.py with VERL GRPOTrainer implementation
- Unified reward function combining format, thinking quality, and answer correctness
- VERL-specific GRPO configuration with actor_rollout parameters
- Built-in validation and error handling for VERL availability
- Support for LoRA fine-tuning and wandb logging
- Comprehensive documentation in README_VERL.md
- Compatible with existing TRL implementation for comparison

Usage:
  python reasoning_grpo_verl.py --model-size 3B --use-lora
- Replace GRPOConfig/GRPOTrainer with HybridEngineEntrypointConfig
- Use AlgoConfig with adv_estimator='grpo' for GRPO functionality
- Fix OmegaConf serialization issues with reward functions and datasets
- Update workspace directory to avoid read-only filesystem issues
- Script now runs successfully with VERL 0.5.0

Resolves import errors on remote GPU VM where GRPOConfig was not available.
- Fix training completion issue: Uncomment actual training execution
- Add robust VERL API compatibility with fallback imports
- Implement smart workspace path detection for remote VMs
- Add write permission testing for workspace directories
- Enhance error handling with detailed debugging information
- Integrate reward function and dataset properly with VERL trainer
- Add graceful degradation when VERL components unavailable
- Fix workspace paths in both VERL and TRL implementations

Key improvements:
- Training now actually runs instead of just printing success messages
- Automatic detection of /root/workspace and /workspace on remote VMs
- Better error messages for troubleshooting on remote systems
- Fallback suggestions when VERL setup is incomplete
…tasks

- Implement ReasoningRewardManager with unified reward computation
- Add VERL-compatible configuration and worker setup
- Support for FSDP strategy and multi-GPU training
- Include comprehensive documentation with usage examples
- Add LoRA support for memory-efficient fine-tuning
- Integrate wandb logging and debug capabilities
- Provide comparison with TRL implementation

Key features:
- Scalable multi-GPU training with Ray workers
- Advanced resource pool management
- Fault tolerance and checkpointing support
- Detailed reward breakdown (format, thinking, answer correctness)
- Clean architecture following VERL best practices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant