-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSeaPearl.html
251 lines (231 loc) · 16.8 KB
/
SeaPearl.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
<!DOCTYPE HTML>
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!--<base target="_blank">-->
<title>Tom Marty</title>
<meta name="author" content="Tom Marty">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" type="text/css" href="stylesheet.css">
<link rel="icon" type="image/png" href="images/favicon.png">
<script src="scramble.js"></script>
<!-- Place this tag in your head or just before your close body tag. -->
<script async defer src="https://buttons.github.io/buttons.js"></script>
<script src="hidebib.js" type="text/javascript"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
</head>
<body>
<table style="width:100%;max-width:920px;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;"><tbody>
<tbody><tr><td>
<p style="text-align:center">
<img src="images/Corail_full.png" alt="nav_sign" width="700" height="150">
<br>
<email>
<font id="email" size="+1" style="display:inline;">[email protected]</font>
<script>
emailScramble = new scrambledString(document.getElementById('email'),
'emailScramble', '[email protected]',
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, 18,19,20]);
</script>
</email>
</p>
<!--<table style="width:100%;max-width:800px;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;">-->
<table style="width:100%;max-width:800px;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;">
<tr style="padding:0px">
<td style="padding:0px">
<table style="width:100%;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;">
<tr style="padding:0px">
<td style="padding:1.5%;width:16%;vertical-align:middle;text-align:center">
<a href="index.html" target="_self"><big>Home</big></a>
</td>
</tr>
</table>
<table style="width:100%;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;"><tbody>
In a nutshell, SeaPearl is an intelligent generic solver guided by <b class="term">Reinforcement Learning</b> (RL) to solve NP-hard <b class="term">discrete optimization problems</b> using <b class="term">Constraint Programming</b>.
<br>
<br>
At that moment, you can feel a little confused with all these concepts if you've never heard of them before. No worries, let's start from the beginning :).
</tbody></table>
<br>
<table width="100%" border="0" cellspacing="15" cellpadding="10">
<heading><b class="term">What are discrete optimization problems ?</b></heading>
<br>
<br>
<tr>
Discrete optimization problems are problems for which an outcome is to be optimized and for which there is a finite number of solutions. For example, consider the following problem :
<br>
<br>
A deliveryman has to deliver products from a factory to all the customers at varying distances from each other as detailed below (left).In practice, we represent the problem as a graph with values on edges (right). The goal for the deliveryman is to <b class="term">minimise the total distance</b> travelled to deliver all the customers, this is the objective function of the problem under the constraint that <b class="term">every customer must be delivered</b>.
<br>
<br>
<img src="images/tsp_1.png" alt="nthu" width="800" height="250">
</tr>
<tr>
<br>
<br>
Of course, there are several solutions to this problem and some are better than others : Here <b class="term">solution 1 is better than solution 2</b> because its route has a shorter total distance. For each problem, it could exist one or more optimal solutions, but prooving its optimality (ie. solving the problem) requires to have listed all possible solutions.
<br>
<br>
</tr>
<tr>
<img src="images/tsp2.drawio.png" alt="nthu" width="800" height="280">
</tr>
<tr>
<br>
<br>
In this small problem it is possible to list all possible solutions and choose the best one, which become not possible anymore when considering larger problems with more nodes, as NP-complete class of problems, which TSP belongs to, could not be solved in polynomial time ( unless P = NP <img src="images/blink.png" width="18px">).
<br>
<br>
That said, it may not be interesting to prove the optimality of a solution but simply to find good enough solutions. The proof of optimality is sacrificed for the benefit of execution time.
</tr>
</table>
<br>
<table width="100%" border="0" cellspacing="15" cellpadding="10">
<heading><b class="term">Constraint Programming</b></heading>
<br>
<br>
<tr>
Now that we have introduced the notion of discrete optimisation problems, we will discuss a method used for searching solutions : <b class="term">constraint programming</b> (CP). CP is a paradigm to solve discrete optimization problems. It lies on methods for <b class="term">problem declaration</b> and <b class="term">problem solving</b> using a solver that automates the constraint propagation mechanism (the process of inferring the possible values given the value of the variables already assigned, this is the process that everyone uses to solve a sudoku for example). For more information about it, please look at <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjb1KrqiZX6AhU-FVkFHaUACBkQFnoECCoQAQ&url=https%3A%2F%2Fwww.cs.upc.edu%2F~erodri%2Fwebpage%2Fcps%2Ftheory%2Fcp%2Fintro%2Fslides.pdf&usg=AOvVaw3uGZ4zJqrdVZ5Z9THZWZNP" target="_blank">this presentation from Enric Rodrıguez-Carbonell </a>.
<br>
the strength of the CP comes from the diversity of the problems it is able to solve. It allows, among other things, to manage non-linear constraints or variables of different types (boolean, integer ...).
<br>
<br>
To understand how the research works, let's go back to our problem above. Let's consider the deliveryman starts his delivery tour at the factory.
<br>
<br>
Each node of the graph can be considered as a <b class="term">variable</b> of the problem. Each variable can take for values the node they are connected to, this define their <b class="term">domain of definition</b>. FInally, <b class="term">constraints</b> governs the values taken simultaneously by the variables (ex : You can't go through the same node twice.)
<br>
<br>
Below is how a search is conducted in a CP solver, which alternates <b class="term">decision phases</b> and <b class="term">fix-point phases</b> (constraints propagations) :
<ul>
<li>The <i style="color:red;">red</i> assignations are the one obtained after a variable/value assignation</li>
<li>The <i style="color:green;">green</i> assignations are the one infered directly by the constraint propagation mechanism</li>
</ul>
</tr>
<tr>
<img src="images/solve.gif" alt="nthu" width="800" height="280">
</tr>
As can be seen, constraint propagation is a very powerful mechanism that allowed us to automatically infer the only possible values for 3 of the 5 variables in the problem.
<br>
<br>
During the decision phase, 2 decision heuristics are to be considered:
<ul>
<li><b class="term">Variable Selection Heuristic</b> : Select the next variable to assign</li>
<li><b class="term">Value Selection Heuristic</b> : Select the value to assign to the previously selected variable</li>
</ul>
The design of these heuristics can have a tremendous impact on search performance, this is what makes the difference between a advanced sudoku player, who will find a solution on the first try and a beginner, who do multiple errors and changes before finding a solution. Moreover, for some problems, we don't even know heuristics that works well in the general case.
<br>
This is where we bring
<span class="block-line"><span><span style="color:#dd1313;">R</span><span style="color:#dd5013;">e</span><span style="color:#dd8913;">i</span><span style="color:#ddc213;">n</span><span style="color:#bedd13;">f</span><span style="color:#85dd13;">o</span><span style="color:#4cdd13;">r</span><span style="color:#13dd13;">c</span><span style="color:#13dd50;">e</span><span style="color:#13dd89;">m</span><span style="color:#13ddc2;">e</span><span style="color:#13bedd;">n</span><span style="color:#1385dd;">t </span></span><span><span style="color:#134cdd;">L</span><span style="color:#1313dd;">e</span><span style="color:#5013dd;">a</span><span style="color:#8913dd;">r</span><span style="color:#c213dd;">n</span><span style="color:#dd13be;">i</span><span style="color:#dd1385;">n</span><span style="color:#dd134c;">g</span></span></span> !
<br>
<br>
</tbody></table>
<table width="100%" border="0" cellspacing="15" cellpadding="10">
<heading><b class="term">How to learn a decision making process ?</b></heading>
<tr>
<br>
<br>
The idea is to train a model using RL to learn a smart <b class="term">Value Selection Heuristic</b> by solving successively thousands of similar problems drawn for a specific distribution and evaluate the performance of this heuristic on problems never seen during training. The resolution process still relies on CP, which guarantees the validity of the returned solutions, while the agent is in charge of branching during the search. We propose a <b class="term">generic approach</b> where a single agent could learn problems of different nature, for this purpose we present a generic graph representation of any problems using the characteristics of CP problem definition.
<br>
<br>
The RL algorithm used is <a href="https://www.researchgate.net/figure/The-Q-learning-algorithm-taken-from-Sutton-Barto-1998_fig1_337089781" target="_blank">Deep Q-Learning </a> (DQN). The associated <b class="term">reward</b> is based on the objective function/quality of the solution returned.
<h3>Generic Graph representation :</h3>
Every CP problems is defined by a set of variable, values that can be taken by these variables and a set of Constraints on these variables. The idea is to encode each of these entities as node in a graph and connect these nodes according to whether :
<ul>
<li>A <b class="term">Value</b> is part of a <b class="term">Variable</b>'s domain of definition</li>
<li>A <b class="term">Variable</b> is involved in a <b class="term">Constraint</b></li>
</ul>
This graph is naturally dynamically updated throughout the resolution of instances during the training process. Here is a little example just to get a sense of things :
</tr>
<br>
<br>
<tr>
<div style="text-align: center;">
<img src="images/tripartite.png" alt="nthu" width="600" height="190">
</div>
</tr>
<br>
<tr>
The advantage of this method is that it allows the entire information to be encoded in a structure (a graph) that can be processed by a Neural Network. Each node comes with node embeddings allowing to identify -among others- the type of constraints of a Constraint node or the value of a Value node.
<h3>Neural Pipeline for Variable/Value assignation :</h3>
Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider for the moment that the variable selection heuristic is deterministic, so that the input is the graph representation <b class="term">and</b> a variable on which to branch. This is where we are :
</tr>
<br>
<br>
<tr>
<div style="text-align: center;">
<img src="images/scheme.drawio.png" alt="nthu" width="650" height="240">
</div>
</tr>
<br>
<tr>
Given a state, the RL agent is parameterized by a DQN-network that outputs the Q-values associated with every possible value selection in the domain of the selected variable $v$. The tripartite state is fed into a neural approximator model, the <b class="term">learner</b> $\hat{Q}$, which consists of two parts : an <b class="term">encoder</b> for learning contextual embeddings of the nodes of the graph representation, and a <b class="term">decoder</b> which, given these node embeddings, estimates the optimal policy to design a powerful heuristic.
<h4>Graph neural network encoder :</h4>
Graph convolutional networks (GCN) constitute a very convenient solution to learn contextual node embeddings, and have been largely used for this purpose in reinforcement learning for discrete optimization. Due to the heterogeneous nature of our representation we opted for a heterogeneous GCN composed of several Graph convolutional layers.
<br>
<br>
Considering a variable $v_i$, a constraint $c_j$ and a value $u_k$, they are respectively defined by raw feature $V_i, C_j, U_k$, with respective dimension $d_v, d_c, d_u$ . First, a type-wise linear combination of raw features compute the input features of dimension $d$ for the GNN such that : $\mathbf{h}_{v_{i}}^{0} = V_{i}w_v$, $\mathbf{h}_{c_{j}}^{0} = C_{j}w_v$ and $\mathbf{h}_{u_{k}}^{0} = U_{k}w_v$.
Then, we perform recursively $N$ operations of graph convolution on the nodes of the graph representation. At step $t$, a convolution can be formulated as :
<br>
\begin{align*}
\mathbf{h}_{v_{i}}^{t+1}&=\phi_{v}\left(\mathbf{V}_{i}: \mathbf{h}_{v_{i}}^{t}: \bigoplus_{c_{j} \in \mathcal{N}_{c}\left(v_{i}\right)} \mathbf{h}_{c_{j}}^{t}: \bigoplus_{u_{k} \in \mathcal{N}_{u}\left(v_{i}\right)} \mathbf{h}_{u_{k}}^{t}V\right)\\
\mathbf{h}_{c_{j}}^{t+1}&=\phi_{c}\left(\mathbf{C}_{j}: \mathbf{h}_{c_{j}}^{t}: \bigoplus_{v_{i} \in \mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)\\
\mathbf{h}_{u_{k}}^{t+1}&=\phi_{u}\left(\mathbf{U}_{k}: \mathbf{h}_{u_{k}}^{t}: \bigoplus_{v_{i} \in \mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)
\end{align*}
Where $\phi_{v}, \phi_{c}, \phi_{u}$ are one-layer perceptron, composed by an affine transformation followed by an activation function, $:$ is the concatenation operation, $\mathcal{N}_{v},\mathcal{N}_{c},\mathcal{N}_{u}$ represents the type-specific neighborhood, and $\bigoplus$ is the feature-wise aggregation function.
<br>
<br>
<h4>Downstream neural network decoder :</h4>
Once the contextual node embeddings are computed, a decoder should be used to convert them into an actionable policy.
<br>
<br>
Our architecture consists in first transforming the embeddings of the variable $v$ currently branched on and all the possible values by feeding them into two different multi-layer perceptrons. The transformed embedding of each value $u \in \mathcal{D}_v$ concatenated with the transformed embedding of the branching variable $v$ passes trough a last multi-layer perceptron that outputs the approximated Q-value for each pair ($v$,$u$).
<br>
<br>
\begin{equation}
\hat{Q}(S_t,a) = \hat{Q}\left(\{s_t,v\},u) = \phi_q(\phi_v( \mathbf{h}_{v}^{N}):\phi_u( \mathbf{h}_{u}^{N})\right)
\end{equation}
where $\phi_{q}, \phi_{u}, \phi_{v}$ are multi-layer perceptron.
<br>
<br>
Once the Q-values are approximated, the <b class="term">explorer</b> can exploit the learned values and voraciously choose the best action or decide otherwise (for example, a random action with probability $\epsilon$). This tradeoff between exploitation and exploration is necessary in early learning when the estimate of Q-values is very poor and many states have never been visited before.
<br>
<br>
Here is a summary of the value decision pipeline:
</tr>
<br>
<br>
<tr>
<div style="text-align: center;">
<img src="images/full_pipeline_updated.png" alt="nthu" width="800" height="530">
</div>
</tr>
<br>
<tr>
<br>
<br>
My website is still under construction. I will add the end of the explanations soon :) !
</tr>
</tr>
</table>
<table style="width:100%;border:0px;border-spacing:0px;border-collapse:separate;margin-right:auto;margin-left:auto;"><tbody>
<tr>
<td style="padding:0px">
<br>
<p style="text-align:left;font-size:small;">
Updated on: 14th September, 2022
<span style="float:right;">
Merci, <a href="https://jonbarron.info/" target="_blank">Jon Barron</a> et <a href="https://github.com/digantamisra98" target="_blank">Diganta Misra</a> !
</span>
</p>
</td>
</tr>
</tbody></table>
</td>
</tr>
</table>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
</body>
</html>