-
Notifications
You must be signed in to change notification settings - Fork 0
/
introduction.qmd
180 lines (113 loc) · 6.09 KB
/
introduction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
jupyter: julia-1.10
---
# Introduction
## Why Julia?
In a world ruled by Python (and, in some areas, R), why use an exquisite and new language like [Julia](https://julialang.org/) in this workshop? You can find some good reasons in the book [Julia Data Science](https://juliadatascience.io/programmers), but I will give some in the context of topology:
### Speed
::: {.callout-tip title="" appearance="simple"}
Julia is fast. You won't need to use another language to make some expensive computation.
:::
Let's say you read about a new algorithm to calculate the Vietoris-Rips. If you are a R/Python user, *your algorithm won't be written in R/Python* simply because R/Python is slow. Fast code in R/Python is written in C, C++, Fortran or Rust; this is called the [two language problem](https://juliadatascience.io/julia_accomplish#sec:two_language). Julia solves this because with enought knowledge about the compiler and the language, you will get performance nearly as good as if it was written in C. Optimizing Julia code is not a trivial task, but is way easier than learning another language just to get good performance in some functions.
![The famous deep learning package [`torch`](https://github.com/pytorch/pytorch) in Python has its core written mostly in C++. Python is just a "glue" interface.](images/intro/torch.png)
![The deep learning package [`Flux.jl`](https://github.com/FluxML/Flux.jl) is 100% written in Julia.](images/intro/flux.png)
### Use other languages
::: {.callout-tip title="" appearance="simple"}
It is easy to use another language inside Julia.
:::
Even though Julia is fast and has a robust ecosystem, Python and R have many more packages already good-to-go. You can use them easily with tools like [PythonCall](https://github.com/JuliaPy/PythonCall.jl) or [RCall](https://github.com/JuliaInterop/RCall.jl).
### Mathematical syntax
::: {.callout-tip title="" appearance="simple"}
Julia looks like mathematics and has an elegant syntax.
:::
#### LaTeX symbols
Being able to mix LaTeX symbols with code can make the code way more readable.
You can find many nice examples on [BeautifulAlgorithms.jl](https://github.com/mossr/BeautifulAlgorithms.jl). Below are some common Julia code:
```{julia}
# calculate the intersection of two vectors/sets
[1, 2] ∩ [2, 3, 4]
```
```{julia}
# check if a value is in a vector/set
1 ∉ [2, 3]
```
```{julia}
# define a function in one line
f(r) = π*r^2
f(3)
```
```{julia}
# Euler's identity
ℯ^(im * π) + 1 |> round
```
```{julia}
# calculating the pairwise-distance between points in a set
X = [1, 2, 3, 4]
d(x, y) = abs(x - y)
[d(xᵢ, xⱼ) for xᵢ ∈ X, xⱼ ∈ X]
```
#### Broadcasting
Easily apply a function to all elements of a vector/set:
```{julia}
x = [1, 2, 3, 4]
# broadcast
sin.(x)
```
```{julia}
# mapping
map(sin, x)
```
```{julia}
# list comprehension
[sin(x_i) for x_i ∈ x]
```
#### Piping
You can compose functions in the reading order. Instead of writing:
```{julia}
sin(cos(1))
```
you can "pipe" the functions as in:
```{julia}
1 |> cos |> sin
```
#### Functional programming
Julia is a [functional programming language](https://en.wikipedia.org/wiki/Functional_programming) with type hierarchy, which means that we have "categories" (ie. types) and "functors" (ie. functions) mapping between the types; moreover, its polimorphism means that the functions depend on the type of its arguments.
For example, suppose you want to define the norm of a vector:
```{julia}
x = [1, 2, 3]
# define the norm of a Vector of Numbers
norm(x::Vector{<:Number}) = x.^2 |> sum |> sqrt
norm(x)
```
You can also define the norm of a function $f$ as the approximate integral of $|f|$ on the interval $[0, 1]$:
```{julia}
# norm on [0, 1]
norm(f::Function; step_size = 0.0001) =
[f(x) * step_size for x ∈ 0:step_size:1] .|>
abs |> sum
square = x -> x^2
norm(square)
```
which is very close to the real definite integral.
PS: plotting is also as easy as:
```{julia}
using Plots;
plot(square, 0:0.1:1)
```
Why not define the norm of a text as the amount of characters?
```{julia}
norm(s::AbstractString) = length(s)
norm("Hello!")
```
## Why TDA?
Topological Data Analysis is a eccentric field of mathematics that applies tools from topology and algebraic topology on the study of datasets. By "datasets", we almost always mean "finite metric space".
These tools can be divided in some broad categories:
### Homological methods
Persistence homology calculates the "shape" (homology groups) of a object on different scales, and compact this information into objects called "barcodes". This barcode can be vectorized and then inserted into machine learning models.
![The shape of data can be summarised as a barcode.](images/getting-started/ph.png)
### Graph reduction methods
The Mapper algorithm [see @singh2007topological] is the discrete version of the [Reeb graph](https://en.wikipedia.org/wiki/Reeb_graph). It captures the geometry of a dataset with respect to a given function (called "filter") and produce a graph of its pre-images clustered. There is also the Ball Mapper variant [see @dlotko2019ball], which does not need a filter function and resembles the Vietoris-Rips filtration. These methods are useful to get a glimpse of the geometry (flares, holes) and areas of interest. For example: if your metric space is a dataset of measurements of patients with diabets, the Mapper graph can represent the patients and the shape of this graph can give insights about the type of diabetes one has.
![The Mapper graph of some 3d-shapes retain some of the original geometry.](images/intro/mapper.png)
### Clustering methods
Algorithms like ToMATo [see @chazal2013persistence] are _clustering algorithms_: we have a metric space as input, and return a (disjoint) partition of this dataset. Clustering is useful whenever we need to "group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters)"^[[https://en.wikipedia.org/wiki/Cluster_analysis](https://en.wikipedia.org/wiki/Cluster_analysis)].
![ToMATo is a persistence-based clustering algorithm.](images/intro/tomato.png)