-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
126 lines (110 loc) · 9.23 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
<html>
<head>
<title>DMotion: Robotic Visuomotor Control with Unsupervised Forward Model Learned from Videos</title>
<link rel="SHORTCUT ICON" href="favicon.ico"/>
<link href='css/paperstyle.css' rel='stylesheet' type='text/css'>
</head>
<body>
<div class="pageTitle">
DMotion: Robotic Visuomotor Control with Unsupervised Forward Model Learned from Videos
<br>
<br>
<span class = "Authors">
<a href="https://YHQpkueecs.github.io/" target="_blank">Haoqi Yuan<sup>1*</sup>
<a href="https://www.researchgate.net/profile/Yan-Zhao-182" target="_blank">Yan Zhao<sup>1*</sup>
<a href="https://www.cs.stanford.edu/~kaichun/" target="_blank">Kaichun Mo</a><sup>2*</sup>
<a href="https://guozz.cn/" target="_blank">Zizheng Guo</a><sup>1</sup>
<a href="https://galaxy-qazzz.github.io/" target="_blank">Yian Wang</a><sup>1</sup> <br>
<a href="https://moistchi.github.io/tianhaowu.github.io/" target="_blank">Tianhao Wu</a><sup>2</sup>
<a href="https://fqnchina.github.io/" target="_blank">Qingnan Fan</a><sup>3</sup>
<a href="https://xuelin-chen.github.io/" target="_blank">Xuelin Chen</a><sup>1</sup>
<a href="https://geometry.stanford.edu/member/guibas/" target="_blank">Leonidas J. Guibas</a><sup>2</sup>
<a href="https://zsdonghao.github.io/" target="_blank">Hao Dong</a><sup>1</sup> <br>
<i>(*: indicates joint first authors)</i><br><br>
<sup>1</sup><a href = "http://english.pku.edu.cn/" target="_blank"> Peking University </a>
<sup>2</sup><a href = "http://www.stanford.edu/" target="_blank"> Stanford University </a>
<sup>3</sup><a href = "https://ai.tencent.com/ailab/zh/index" target="_blank"> Tencent AI Lab </a> <br><br>
</span>
</div>
<br>
<div class = "material">
<a href="https://arxiv.org/abs/2103.04301" target="_blank">[ArXiv Preprint]</a>
<a href="paper.bib" target="_blank">[BibTex]</a>
</div>
<div class = "abstractTitle">
Abstract
</div>
<p class = "abstractText">
Learning an accurate model of the environment is essential for model-based control tasks. Existing methods in robotic visuomotor control usually learn from data with heavily labelled actions, object entities or locations, which can be demanding in many cases. To cope with this limitation, we propose a method, dubbed DMotion, that trains a forward model from video data only, via disentangling the motion of controllable agent to model the transition dynamics. An object extractor and an interaction learner are trained in an end-to-end manner without supervision. The agent's motions are explicitly represented using spatial transformation matrices containing physical meanings. In the experiments, DMotion achieves superior performance on learning an accurate forward model in a Grid World environment, as well as a more realistic robot control environment in simulation. With the accurate learned forward models, we further demonstrate their usage in model predictive control as an effective approach for robotic manipulations.
</p>
<div class="abstractTitle">
Video Presentation
</div>
<center>
<iframe width="660" height="415" src="https://player.bilibili.com/player.html?aid=432089228&bvid=BV1sG411A7no&cid=873856845&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
</center>
<br>
<br>
<br>
<div class="abstractTitle">
Motivation
</div>
<p class="abstractText">
Existing research in cognitive science demonstrates the capabilities of infants for understanding the physical world and making predictions via unsupervised visual observation. By observing moving objects in the world, infants can acquire self-awareness and build internal physics models of the world. Such physical models help humans to acquire the ability to predict the outcome of physical events and control tools to interact with the environment.
</p>
<p class="abstractText">
See the example figure: A player is showing the man how to play a video game. To learn the game, human should build an internal predictive model of the game, to know how to control the agent and how the actions taken affect the game state. However, human do not need to focus on the player’s actions on the keyboard all the time. Most of the time, human just watches the video in the screen, observes the interactions in the game. The model of the game is built mostly upon the visual observations. To know how to control the agent, a few additional observations on the keyboard are enough.
</p>
<p class="abstractText">
This example inspires us to learn unsupervised world models from videos. Note that, common approaches learning world models usually need a large amount of labelled videos in the environment, for supervised training. We propose DMotion, an unsupervised learning method, which is less data demanding, more intuitive and human-like. As shown in the figure below, in a world with several rigid bodies, DMotion first disentangles the agent object and learns objects’ interactions from unlabeled videos. Then, using a few samples labelled with agent’s actions, DMotion understands how each action affects the agent’s motion.
</p>
<!--
<img class="bannerImage" src="./images/teaser.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 1.
Given an input 3D articulated object (a), we propose a novel perception-interaction handshaking point for robotic manipulation tasks - object-centric actionable visual priors, including per-point visual action affordance predictions (b) indicating where to interact, and diverse trajectory proposals (c) for selected contact points (marked with green dots) suggesting how to interact.
</p></td></tr></tbody></table>
-->
<div class="abstractTitle">
Tasks
</div>
<img class="bannerImage" src="./images/teaser.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 1.
Given an input 3D articulated object (a), we propose a novel perception-interaction handshaking point for robotic manipulation tasks - object-centric actionable visual priors, including per-point visual action affordance predictions (b) indicating where to interact, and diverse trajectory proposals (c) for selected contact points (marked with green dots) suggesting how to interact.
</p></td></tr></tbody></table>
<div class="abstractTitle">
Framework and Network Architecture
</div>
<img class="bannerImage" src="./images/fig2.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 2.
Our proposed VAT-MART framework is composed of an RL policy (left) exploring interaction trajectories and a perception system (right) learning the desired actionable visual priors. We build bidirectional supervisory channels between the two parts: 1) the RL policy collects data to supervise the perception system, and 2) the perception system produces curiosity feedbacks encouraging the RL networks to explore diverse solutions.
</p></td></tr></tbody></table>
<div class="abstractTitle">
Qualitative Results
</div>
<img class="bannerImage" src="./images/fig3.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 3.
We show qualitative results of the actionability prediction and trajectory proposal modules. In each result block, from left to right, we present the input shape with the task, the predicted actionability heatmap, and three example trajectory proposals at a selected contact point.
</p></td></tr></tbody></table>
<img class="bannerImage" src="./images/fig4.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 4.
We present qualitative analysis of the learned trajectory scoring module. In each result block, from left to right, we show the input shape with the task, the input trajectory with its close-up view, and our network predictions of success likelihood applying the trajectory over all the points.
</p></td></tr></tbody></table>
<img class="bannerImage" src="./images/fig5.png" ,="" width="800"><br>
<table width="800" align="center"><tbody><tr><td><p class="figureTitleText">
Figure 5.
Left: qualitative analysis of the trajectory scoring prediction (each column shares the same task; every row uses the same trajectory); Middle: promising results testing on real-world data (from left to right: input, affordance prediction, trajectory proposals); Right: real-robot experiment.
</p></td></tr></tbody></table>
<div class="abstractTitle">
Acknowledgements
</div>
<p class="abstractText">
This work was supported by the funding of National Natural Science Foundation of China —Youth Science Fund (No.62006006), a grant from the Toyota Research Institute University 2.0 program, NSF grant IIS-1763268, and a Vannevar Bush faculty fellowship.
We would like to thank <a href="https://blog.csdn.net/tritone" target="_blank">Yourong Zhang</a> for setting up ROS environment and helping in real robot experiments.
</p>
<p></p>
</body></html>