Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add Direct Preference Optimization (DPO) Support
Summary
This PR implements Direct Preference Optimization (DPO) for MLX-LM, enabling users to fine-tune language models using human preference data without requiring a separate reward model.
What is DPO?
DPO is a simpler alternative to RLHF that directly optimizes on preference pairs (chosen vs rejected responses), avoiding the complexity of training reward models and using PPO. It's mathematically equivalent to RLHF but more stable and efficient.
Key Features Added
Core Implementation
Integration
python -m mlx_lm dpo
with full argument parsingData Format Support
Documentation & Testing
mlx_lm/DPO.md
with usage examples and best practicesUsage Example
# Basic DPO training mlx_lm.dpo \ --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \ --train \ --data preference_data/ \ --beta 0.1 \ --fine-tune-type lora
Files Added/Modified
mlx_lm/dpo.py
- Main DPO modulemlx_lm/tuner/losses.py
- DPO loss implementationmlx_lm/tuner/datasets.py
- PreferenceDataset classmlx_lm/tuner/trainer.py
- DPO training functionsmlx_lm/__main__.py
- CLI registrationmlx_lm/DPO.md
- Documentationtests/test_dpo.py
- DPO-specific teststests/test_losses.py
- Enhanced with DPO loss testsTesting
Benefits for MLX-LM Users
This implementation enables MLX-LM users to easily train more helpful, harmless, and honest models using preference data.