Add MVP from Bekerley

patrick-llgc · Mar 8, 2024 · dd45597 · dd45597
1 parent 781ca33
commit dd45597
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -34,7 +34,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Multimodal Regression](https://towardsdatascience.com/anchors-and-multi-bin-loss-for-multi-modal-target-regression-647ea1974617)
 - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
 
-## 2024-03 (8)
+## 2024-03 (11)
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
 - [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [[Notes](paper_notes/drive_dreamer.md)] [Jiwen Lu, World Model]
 - [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) [[Notes](paper_notes/world_dreamer.md)] [Jiwen Lu, World Model]
@@ -43,8 +43,8 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos](https://arxiv.org/abs/2206.11795) <kbd>NeurIPS 2022</kbd> [[Notes](paper_notes/vpt.md)] [OpenAI]
 - [MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge](https://arxiv.org/abs/2206.08853) <kbd>NeurIPS 2022</kbd> [NVidia, Outstanding paper award]
 - [Humanoid Locomotion as Next Token Prediction](https://arxiv.org/abs/2402.19469) [[Notes](paper_notes/locomotion_next_token_pred.md)] [Berkeley, EAI]
-- [RPT: Robot Learning with Sensorimotor Pre-training](https://arxiv.org/abs/2306.10007) [[Notes](paper_notes/rpt.md)] <kbd>CoRL 2023 Oral</kbd>
-- [MVP: Real-World Robot Learning with Masked Visual Pre-training](https://arxiv.org/abs/2210.03109) [[Notes](paper_notes/mvp.md)] <kbd>CoRL 2022</kbd>
+- [RPT: Robot Learning with Sensorimotor Pre-training](https://arxiv.org/abs/2306.10007) [[Notes](paper_notes/rpt.md)] <kbd>CoRL 2023 Oral</kbd> [Berkeley, EAI]
+- [MVP: Real-World Robot Learning with Masked Visual Pre-training](https://arxiv.org/abs/2210.03109) [[Notes](paper_notes/mvp.md)] <kbd>CoRL 2022</kbd> [Berkeley, EAI]
 - [BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning](https://arxiv.org/abs/2202.02005) [[Notes](paper_notes/bc_z.md)] <kbd>CoRL 2021</kbd> [Eric Jang, 1X]
 - [HG-DAgger: Interactive Imitation Learning with Human Experts](https://arxiv.org/abs/1810.02890) [DAgger]
 - [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060) [[Notes](paper_notes/drive_gan.md)] <kbd>CVPR 2021 oral</kbd> [Nvidia, Sanja]

diff --git a/paper_notes/mvp.md b/paper_notes/mvp.md
@@ -0,0 +1,32 @@
+# [MVP: Real-World Robot Learning with Masked Visual Pre-training](https://arxiv.org/abs/2210.03109)
+
+_March 2024_
+
+tl;dr: Large scale visual pre-training helps robotic learning tasks by increasing sample efficiency, achieving better with fewer learning.
+
+#### Overall impression
+We can approach representation learning for robotics from two ends: shared representations on the perception side or shared representations on the action side. The focus of MVP is on shared visual representations.
+
+Visual pretraining via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. We train control policies per task, on top of the same frozen encoder for all downstream robotic tasks and embodiments.
+
+Note that this pretraining is vision-only and is NOT multimodal. See follow up work to extend to multimodal pretraining in [RPT](rpt.md), with the frozen vision pretraining from [MVP](mvp.md).
+
+[MVP](mvp.md) generates vision tokens and is essentially one type of continuous vision tokenizer, in contrast with discrete vision tokenizer such as VQ-VAE or MAGVIT-V2.
+
+#### Key ideas
+- Pretraining works under 3 conditions
+	- Diverse, real-world data.
+	- Self-supervision loss.
+	- Scalable architecture with transformers.
+- Data source
+	- Internet and egocentric videos
+	- 4.5 million images, sufficiently large and diverse
+- Model arch heavily based on MAE.
+- MAE pretrianing is better than CLIP pretraining and train tabula rasa.
+
+
+#### Technical details
+- High sample efficiency == low sample complexity. need fewer samples to learn a task.
+
+#### Notes
+- [Masked Visual Pre-training for Motor Control](https://arxiv.org/abs/2203.06173), the foundation for this MVP work was not accepted at conf most likely.