Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise with especially notable performance in SPPO across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss when using SPPO for large language models through improved performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR summarization tasks.

We implement our method based on TRIL. This repository is for IMDB positive sentiment analysis and TL;DR summarization. We add the advantage normalization and the symmetric RL part to it. For Atari games, MuJoCo benchmark, and Box2D. Please refer to LLM tasks here.

Installation

Note that we use accelerate=0.27.2 which is different from the original code to solve an error.

conda create -n tril python=3.10
conda activate tril
pip install -e .

Example Scripts

To run SPPO for IMDB positive sentiment

./examples/imdb/imdb_sppo.sh

To run PPO for IMDB positive sentiment

./examples/imdb/imdb_ppo.sh

To run SPPO for TL;DR summarization

./examples/tldr/tldr_sppo.sh

To run PPO for TL;DR summarization

./examples/tldr/tldr_ppo.sh

Evaluation for TL;DR

We follow TRIL, where they evaluate their models' perplexity after training. Here, we provide the script for the evaluation. For the perplexity metric, you need to comment in and out of the script cfgs/task/tldr.yaml (please see the script).

./examples/tldr/tldr_eval.sh

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
accelerate_cfgs		accelerate_cfgs
cfgs		cfgs
examples		examples
src/tril		src/tril
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Installation

Example Scripts

Evaluation for TL;DR

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Installation

Example Scripts

Evaluation for TL;DR

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages