-
Notifications
You must be signed in to change notification settings - Fork 0
/
getting started.qmd
271 lines (174 loc) · 8.18 KB
/
getting started.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
jupyter: julia-1.10
execute:
freeze: true # re-render only when source changes
cache: true
warning: false
---
# Getting started
Now you are convinced that Julia is really nice and fast, and that Topological data analysis is a very unique tool. It is time to start applying!
## Persistent homology
Let's quickly review what is persistent homology and why it is useful.
Let $M$ be a finite metric space.
### Creating simplicial complexes
For each $\epsilon > 0$ we create a simplicial complex $K_\epsilon$ in such a way that $\epsilon_1 < \epsilon_2$ implies $K_{\epsilon} \subseteq K_{\epsilon'}$.
As you can guess, there are several ways to "undiscretize" a finite metric space into a simplicial complex.
:::{.callout title="" appearance="simple"}
One famous construction is the Vietoris-Rips complex:
$$
VR_\epsilon(M) = \{ [x_1, \cdots, x_n] \subset M \; \text{s.t.} \; d(x_i, x_j) < \epsilon, \; \forall i, j \}
$$
that is: if we put a ball $B_i$ of radius $2 \epsilon$ around each point $x_i$, then
$$
[x_1, \cdots, x_n] \in VR_\epsilon(M) \Leftrightarrow \cap B_i \neq \emptyset.
$$
:::
![The Vietoris-Rips filtration of a metric space (on the left of each panel). As $\epsilon$ increases, so do the amount of simplexes we have. [@vr].](images/getting-started/vr.png)
There are some other simplicial complex constructions like the Alpha complex or the Cech complex, but the Vietoris-Rips has a good algorithm to calculate its homology.
### Creating sequences of vector spaces
Applying the homology functor with field coefficients
$$
V_\epsilon = H_n(K_\epsilon)
$$
on each of the simplicial complexes
$$
\{ K_\epsilon \subseteq K_{\epsilon'} \; \text{s.t.} \; \epsilon \leq \epsilon' \}
$$
we obtain a sequence of vector spaces together with linear transformations
$$
\mathbb{V}(M) = \{ V_\epsilon \to V_{\epsilon'} \; \epsilon \leq \epsilon' \}
$$
called a _persistent vector space_.
### Simplifying
Some very nice theorems [see @oudot2017persistence for a really complete material] prove that a persistent vector space can be decomposed as a sum of _interval modules_, which are the fundamental blocks of persistent vector spaces. Each one of these blocks represent the birth and death of a generator. Thus, $\mathbb{V}(M)$ can summarised in two equivalent ways:
- as a _barcode_: a (multi)set of real intervals $\{ [a_i, b_i) \subset \mathbb{R}, \; i \in I \}$.
- as a _persistence diagram_: a subset of real plane above the diagonal of the form $\{ (a_i, b_i) \in \mathbb{R}^2, \; i \in I \}$.
Each pair $(a_i, b_i)$ can be interpreted as follows: $a_i$ is the value of $\epsilon$ at which a feature (i.e. a generator of $H_n(K_\epsilon)$) was "born", and this generator persisted until it reached $b_i$. See the following image for the representation of a barcode.
We will use both these representations many times.
![A Vietoris-Rips filtration and its respective barcode. Vertical lines represent "slices" at some values of $\epsilon$. [Source: @ghrist2008barcodes].](images/getting-started/ph.png)
### Distances on persistence diagrams and stability
The _bottleneck distance_ can be defined on the set of persitence diagrams. Intuitively, it measures the "effort" to move the points of one diagram to the points of the other OR collapsing these points to the diagonal.
It is important to note that points very close to the diagonal (or, equivalently, intervals very short on a barcode) can be seen as "noise" or "non relevant features": they represent features that were born and lived for just a small time.
![The bottleneck distance between two persistence diagrams (in blue and red). Source: [https://gudhi.inria.fr/doc/latest/group__bottleneck__distance.html](https://gudhi.inria.fr/doc/latest/group__bottleneck__distance.html)](images/getting-started/bottleneck.png)
### Stability of the bottleneck distance
This distance is much more useful because of the _stability theorem_: metric spaces close to each other yield barcodes close to each other:
$$
d_b(\mathbb{V}(M), \mathbb{V}(N)) \leq d_{GH}(M, N)
$$
where $M, N$ are finite metric spaces, $d_b$ is the bottleneck distance and $d_{GH}$ is the Gromov-Hausdorff distance on the set of metric spaces.
::: {.callout-note title="Why is stability useful?"}
Suppose we have several metric spaces $X_1, \ldots, X_n$ (let's say photos or 3d-objects) and we want to group these spaces by similarity (cats in one group, dogs in another). Calculating the Gromov-Hausdorff distance is a _very expensive calculation_!^[see [this excellent article by Matt Piekenbrock](https://peekxc.github.io/Mapper/articles/ShapeRecognition.html#gromov-hausdorff-distance) to get an idea of the complexity involved.]. Instead, we can use the bottleneck distance of each barcode $\mathbb{V}(X_i)$ as a approximation to the geometry of $X_i$.
So, if our barcode retains enough information about $X_i$, we can use
$$
d_b(\mathbb{V}(X_i), \mathbb{V}(X_j)) \quad \text{as a good approximation to} \quad d_{GH}(X_i, X_j).
$$
:::
The following beautiful gifs can be found at the [Ripserer.jl](https://mtsch.github.io/Ripserer.jl/dev/generated/stability/) documentation, a Julia package that implements an efficient algorithm to calculate the barcode using the Vietoris-Rips filtration.
![Taking random samples of the circle result in pretty similar persistent diagrams.](images/getting-started/rips1.gif)
![Addin some noise to the circle does not modify so much the persistent diagram.](images/getting-started/rips2.gif)
![The persistent diagram slowly deformates itself as we add more noise to the circle.](images/getting-started/rips3.gif)
![Collapsing the circle also make the 1-dimensional persistence diagram get close to the diagonal.](images/getting-started/rips4.gif)
## Some classic examples in topology
Let's start exploring some common objects in topology.
```{julia}
#| eval: true
using MetricSpaces;
import CairoMakie as gl;
import Ripserer;
# import PersistenceDiagrams as Pd
import Plots;
function plot_barcode(bc)
Plots.plot(
Plots.plot(bc)
,Ripserer.barcode(bc)
)
end;
```
### Torus
```{julia}
#| eval: true
X = torus(1500)
gl.scatter(X)
```
```{julia}
#| eval: true
bc = Ripserer.ripserer(X, dim_max = 1, threshold = 4)
plot_barcode(bc)
```
### Circle
```{julia}
#| eval: true
X = sphere(300, dim = 2)
gl.scatter(X)
```
```{julia}
#| eval: true
bc = Ripserer.ripserer(X, dim_max = 1)
plot_barcode(bc)
```
### Sphere
```{julia}
#| eval: true
X = sphere(300, dim = 3)
gl.scatter(X)
```
```{julia}
#| eval: true
bc = Ripserer.ripserer(X, dim_max = 2)
plot_barcode(bc)
```
### A square with a hole
```{julia}
#| eval: true
X = rand(gl.Point2, 1500)
filter!(x -> (x[1]-0.5)^2 + (x[2]-0.5)^2 > 0.03, X)
gl.scatter(X)
```
```{julia}
#| eval: true
bc = Ripserer.ripserer(X, dim_max = 1, verbose = true)
plot_barcode(bc)
```
### Two circles
```{julia}
#| eval: true
X = vcat(
sphere(150, dim = 2)
,sphere(150, dim = 2) .|> x -> (x .+ (1, 1))
)
gl.scatter(X)
```
```{julia}
#| eval: true
bc = Ripserer.ripserer(X, dim_max = 1, verbose = true)
plot_barcode(bc)
```
## A better way to represent barcodes
Even though the bottleneck distance is easier to calculate than the Gromov-Hausdorff distance, it is still a bit expensive for large barcodes.
Barcodes are nice but wild objects. They have variable length. To be able to use these tools in machine learning algorithms, we need to represent them in a vector or matrix with fixed size.
There are several ways to do that!
### Persistence images
Persistence images is a technique that transform a set of barcodes into nxn matrices with values from 0 to 1 [see @adams2017persistence for more details].
The idea is the following:
- Plot the persistence diagram;
- Plot gaussians around each point;
- Pixelate them.
![The idea behind a persistent image. Source: @adams2017persistence.](images/getting-started/image.png)
For example, the two barcodes below
```{julia}
using PersistenceDiagrams, Plots
diag_1 = PersistenceDiagram([(0, 1), (0, 1.5), (1, 2)]);
diag_2 = PersistenceDiagram([(1, 2), (1, 1.5)]);
image = PersistenceImage([diag_1, diag_2])
Plots.plot(
Plots.plot(diag_1)
,Plots.plot(diag_2)
)
```
are transformed into these matrices (plotted as heatmaps):
```{julia}
Plots.plot(
heatmap(image(diag_1))
,heatmap(image(diag_2))
)
```