-
Notifications
You must be signed in to change notification settings - Fork 243
Pixels-Based Sim2Real Demo for Aloha Peg Insertion #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Andrew-Luo1
wants to merge
20
commits into
google-deepmind:main
Choose a base branch
from
Andrew-Luo1:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 16 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
60cc7df
aloha sim to real code first pass
Andrew-Luo1 d390cf8
Update README.md
Andrew-Luo1 bf98c98
code formatting
Andrew-Luo1 e7ebcd5
Update README.md
Andrew-Luo1 6782db2
Update README.md
Andrew-Luo1 4dd6616
pass code quality checks
Andrew-Luo1 7bfd533
update brax dagger api
Andrew-Luo1 2cce751
clean up domain randomization
Andrew-Luo1 b4ae622
clean up train_dagger.py
Andrew-Luo1 bc3bfb0
Update README.md
Andrew-Luo1 c534d57
Revert README.md
Andrew-Luo1 ccace15
update visionmlp to use orbax checkpointing, update for compat with n…
Andrew-Luo1 25d8db7
merge the s2r into the main aloha folder
Andrew-Luo1 ec030db
Merge branch 'main' of https://github.com/Andrew-Luo1/mujoco_playgrou…
Andrew-Luo1 6c31bc2
add frozen encoder orbax checkpoint
Andrew-Luo1 aea048c
Update README.md
Andrew-Luo1 9b3d755
remove unnecessary helper function, rename files
Andrew-Luo1 d40eb3f
remove learning/train_jax_ppo.py dependency on brax.io.model
Andrew-Luo1 fb34541
everything working with orbax checkpointing
Andrew-Luo1 179d980
use updated bc checkpoint API
Andrew-Luo1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| ### Quickstart | ||
|
|
||
|
|
||
| **Pre-requisites** | ||
|
|
||
| - *Handover, Pick, Peg Insertion:* The standard Playground setup | ||
| - *Behaviour Cloning for Peg Insertion:* Madrona MJX | ||
| - *Jax-to-ONNX Conversion:* Onnx, Tensorflow, tf2onnx | ||
|
|
||
| ```bash | ||
| # Train Aloha Handover. Documentation at https://github.com/google-deepmind/mujoco_playground/pull/29 | ||
| python learning/train_jax_ppo.py --env_name AlohaHandOver | ||
| ``` | ||
|
|
||
| ```bash | ||
| # Plots for pick and peg-insertion at https://github.com/google-deepmind/mujoco_playground/pull/76 | ||
| cd <PATH_TO_YOUR_CLONE> | ||
| export PARAMS_PATH=mujoco_playground/_src/manipulation/aloha/params | ||
|
|
||
| # Train a single arm to pick up a cube. | ||
| python learning/train_jax_ppo.py --env_name AlohaPick --domain_randomization --norender_final_policy --save_params_path $PARAMS_PATH/AlohaPick.prms | ||
| sleep 0.5 | ||
|
|
||
| # Train a biarm to insert a peg into a socket. Requires above policy. | ||
| python learning/train_jax_ppo.py --env_name AlohaPegInsertion --save_params_path $PARAMS_PATH/AlohaPegInsertion.prms | ||
| sleep 0.5 | ||
|
|
||
| # Train a student policy to insert a peg into a socket using *pixel inputs*. Requires above policy. | ||
| python mujoco_playground/experimental/bc_peg_insertion.py --domain-randomization --num-evals 0 --print-loss | ||
|
|
||
| # Convert checkpoints from the above run to ONNX for easy robot deployment. | ||
| # ONNX policies are written to `experimental/jax2onnx/onnx_policies`. | ||
| python mujoco_playground/experimental/jax2onnx/aloha_nets_to_onnx.py --checkpoint_path <YOUR_DISTILL_CHECKPOINT_DIR> | ||
| ``` | ||
|
|
||
| ### Sim-to-Real Transfer of a Bi-Arm RL Policy via Pixel-Based Behaviour Cloning | ||
|
|
||
| https://github.com/user-attachments/assets/205fe8b9-1773-4715-8025-5de13490d0da | ||
|
|
||
| --- | ||
|
|
||
| **Distillation** | ||
|
|
||
| In this module, we demonstrate policy distillation: a straightforward method for deploying a simulation-trained reinforcement learning policy that initially uses privileged state observations (such as object positions). The process involves two steps: | ||
|
|
||
| 1. **Teacher Policy Training:** A state-based teacher policy is trained using RL. | ||
| 2. **Student Policy Distillation:** The teacher is then distilled into a student policy via behaviour cloning (BC), where the student learns to map its observations $o_s(x)$ (e.g., exteroceptive RGBD images) to the teacher’s deterministic actions $\pi_t(o_t(x))$. For example, while both policies observe joint angles, the student uses RGBD images, whereas the teacher directly accesses (noisy) object positions. | ||
|
|
||
| The distillation process—where the student uses left and right wrist-mounted RGBD cameras for exteroception—takes about **3 minutes** on an RTX4090. This rapid turnaround is due to three factors: | ||
|
|
||
| 1. [Very fast rendering](https://github.com/google-deepmind/mujoco_playground/blob/main/mujoco_playground/experimental/madrona_benchmarking/figures/cartpole_benchmark_full.png) provided by Madrona MJX. | ||
| 2. The sample efficiency of behaviour cloning. | ||
| 3. The use of low-resolution (32×32) rendering, which is sufficient for precise alignment given the wrist camera placement. | ||
|
|
||
| For further details on the teacher policy and RGBD sim-to-real techniques, please refer to the [technical report](https://docs.google.com/presentation/d/1v50Vg-SJdy5HV5JmPHALSwph9mcVI2RSPRdrxYR3Bkg/edit?usp=sharing). | ||
|
|
||
| --- | ||
|
|
||
| **A Note on Sample Efficiency** | ||
|
|
||
| Behaviour cloning (BC) can be orders of magnitude more sample-efficient than reinforcement learning. In our approach, we use an L2 loss defined as: | ||
|
|
||
| $|| \pi_s(o_s(x)) - \pi_t(o_t(x)) ||$ | ||
|
|
||
| In contrast, the policy gradient in RL generally takes the form: | ||
|
|
||
| %20=%20\mathbb{E}_{\tau%20\sim%20\theta}%20\left[\sum_t%20\nabla_\theta%20\log%20\pi_\theta(a_t%20|%20s_t)%20R(\tau)\right]) | ||
|
|
||
| Two key observations highlight why BC’s direct supervision is more efficient: | ||
|
|
||
| - **Explicit Loss Signal:** The BC loss compares against the teacher action, giving explicit feedback on how the action should be adjusted. In contrast, the policy gradient only provides directional guidance, instructing the optimizer to increase or decrease an action’s likelihood based solely on its downstream rewards. | ||
| - **Per-Dimension Supervision:** While the policy gradient applies a uniform weighting across all action dimensions, BC supplies per-dimension information, making it easier to scale to high-dimensional action spaces. | ||
|
|
||
| --- | ||
|
|
||
| **Frozen Encoders** | ||
|
|
||
| *VisionMLP2ChanCIFAR10_OCP* is an Orbax checkpoint of [NatureCNN](https://github.com/google/brax/blob/241f9bc5bbd003f9cfc9ded7613388e2fe125af6/brax/training/networks.py#L153) (AtariCNN) pre-trained on CIFAR10 to achieve over 70% classification accuracy. We omit the supervised training code, see [this tutorial](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/JAX/tutorial5/Inception_ResNet_DenseNet.html) for reference. | ||
|
|
||
| --- | ||
|
|
||
| **Aloha Deployment Setup** | ||
|
|
||
| For deployment, the ONNX policy is executed on the Aloha robot using a custom fork of [OpenPI](https://github.com/Physical-Intelligence/openpi) along with the Interbotix Aloha ROS packages. Acknowledgements to Kevin Zakka, Laura Smith and the Levine Lab for robot deployment setup! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,9 @@ def get_assets() -> Dict[str, bytes]: | |
| path = mjx_env.ROOT_PATH / "manipulation" / "aloha" / "xmls" | ||
| mjx_env.update_assets(assets, path, "*.xml") | ||
| mjx_env.update_assets(assets, path / "assets") | ||
| path = mjx_env.ROOT_PATH / "manipulation" / "aloha" / "xmls" / "s2r" | ||
|
||
| mjx_env.update_assets(assets, path, "*.xml") | ||
| mjx_env.update_assets(assets, path / "assets") | ||
| return assets | ||
|
|
||
|
|
||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Andrew-Luo1 would really like to not use the pkl stuff, is this absolutely necessary?