-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
Description
Main road
- Create toy environments
- Create new toy environments (@timorl's repo)
- Clean up toy environments for use with Gym API
- Add toy environments as a dependency (Add toy environments as a dependency #38)
- Debug toy environments (There's something wrong with the Toy Gridworlds... david-lindner/safe-grid-gym#15)
- Refactor for use with Gym API (Gym #32)
- Modify ai_safety_gridworlds_gym to fit our needs (@david-lindner's fork)
- Improve dependency management Improve dependency management #31
- Switch all code referencing envs to use Gym env
- Improved tooling for hyperparameter tuning (e.g. Ray)
- Estimate compute costs and finalize logistics
- First guess for an upper bound: 1 agent x 4 environments x 3 experiments = 12 sets of hyperparameters to tune x ~30 training runs = 360 runs x 2 hours
- Do experiments Start with experiments January 11
- Check if hparams tuned on Solver generalize to Cheater (vice versa too, but less important/rigorous)
Investigate corrupt versions of harder environmentsMaybe bigger / more realistic boat raceMaybe a modified Atari envMaybe a modified MuJoCo envMaybe modified BipedalWalker env
Finish experiments February 15
Deadline February 22
Environments:
- TomatoWateringCRMDP
- TransitionBoatRaceCRMDP
- Toy environments
- corrupt corners (satisfies our assumptions for guaranteed learnability)
- corrupt path to goal (does not satisfy assumptions for guaranteed learnability)
Experiments per env
- Baseline (learns corrupt reward)
- Cheater (learns with access to true reward)
- Solver (learns intended behavior from corrupt reward)
Optional
- Generalize PPO Generalize PPO #17
- Improve test coverage Test cases #29