-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTutorial_DL_UAV.Rmd
1722 lines (1137 loc) · 91.1 KB
/
Tutorial_DL_UAV.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to Deep Learning in R for the Analysis of UAV-based Remote Sensing Data"
author: "OpenGeoHub summer school 2020, Christian Knoth"
date: "*Aug 17, 2020*"
output:
html_document:
highlight: tango
code_folding: show
theme: flatly
toc: yes
toc_depth: 2
toc_float:
collapsed: no
smooth_scroll: no
editor_options:
chunk_output_type: console
bibliography: references.bib
link-citations: yes
nocite: '@*'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Setting up your environment
```{r env, results=F,warning=F,message=F}
library(keras)
library(tensorflow)
library(tfdatasets)
library(purrr)
library(ggplot2)
library(rsample)
library(stars)
library(raster)
library(reticulate)
library(mapview)
#If not done yet, download and unzip the tutorial data:
#download.file("https://uni-muenster.sciebo.de/s/SOjP7fRGBRztn9z/download",destfile = "tutorial_data.zip")
#unzip("./tutorial_data.zip")
#setwd("./tutorial_data/")
```
# Introduction
In this tutorial we aim to get a basic understanding of the key practical steps for applying deep learning for object detection in remote sensing images, and how to do them in R.
These steps are:
- Building your model
- Preparing your data
- Training your model
- Predicting with your model
In part I, we consider a very basic workflow. We´re going build a rather puristic model for image classification and train it from scratch on a small amount of training data. Our model will be able to predict whether or not a target is present in an image.
In part II we move on to aspects that are important in spatial applications of CNNs like the one in our scenario: we´ll develop a model for pixel-by-pixel classification applying an architecture called U-net [@Ronneberger2015]. As a result, our model will be able to predict where in an image the target is present. In addition, we´ll address the practical question of how to get from our remote sensing images to a bunch of very small subsets that can be processed by CNNs, and how to reassemble the resulting predictions back to a map of our test area.
We will also look at two strategies/tools for tackling small data problems, which specifically useful when working with UAV data (or other data that is rather specific to a certain use case). These are transfer learning (using pretrained networks) and data augmentation (increasing the amount of training data).
Finally we´ll briefly touch on the apsect of inspecting a trained model more closely by visualizing the spatial patterns it has learned.
Many parts of what is explained here are based on the book _Deep Learning with R_ by Francois Chollet [@Chollet2018], which is a great resource for getting a (thorough) introduction into the field along with practical advice (R-code examples).
# Scenario
Our task in this tutorial is to develop a CNN for mapping the extent of wildrye (_Elymus athericus_) growth on the Hallig (seasonally flooded island) "Nordstrandischmoor" in Northern Germany.
_Elymus athericus_ is a grass species common to salt marshes in central Europe. When becoming dominant, it can negatively impact the biodiversity of an ecosystem. Therefore, it requires monitoring and management.
This scenario presents a typical ecological monitoring task. Applications like this have been one of the reasons why UAV-based remote sensing has experienced great interest within the research community of landscape ecology.
In this tutorial we will work on a very simple and very small example.
This is what our data looks like. The test area, in which we will detect the extent of wildrye to test our model, is located in western part of the Hallig The training data was collected in two small training areas in the north of the Hallig (not the whole of the images are used for training, only a few subsets).
```{r test_area,echo=F,warning=F,message=F,eval=T,out.width="100%"}
input_img <- stack("./testarea_part1/testarea.tif")
training1 <- stack("./training_part1/quecken_img_1.tif")
training2 <- stack("./training_part1/quecken_img_3.tif")
viewRGB(input_img,quantiles = c(0,1),r=1,g=2,b=3,layer.name = "test area", map.types = c("OpenStreetMap","Esri.WorldImagery"))+viewRGB(training1,quantiles = c(0,1),r=1,g=2,b=3)+viewRGB(training2,quantiles = c(0,1),r=1,g=2,b=3)
```
# Part I: Basic Workflow
## Building your model {#build_model}
As we have seen in the plenary, our models consist of a number of layers consecutively transforming the input data into a number of (hopefully useful) representations of that data. At the end of this chain, there is a layer that outputs the prediction result.
```{r,echo=F}
magick::image_draw(magick::image_read("./figures/CNN-sketch.png"))
dev.off()
```
In this part of the tutorial, the model we´re going to build will do image classification.
It will take an image as input and transform it to one number between 0 and 1, indicating the probability of a target being present in that image.
The `keras` package offers two methods for building networks: the sequential model created by `keras_model_sequential()` and the Keras functional API where models are built using the `keras_model()` function.
As the name suggests `keras_model_sequential()` creates _sequential_ models with only one input, one output, and a directed linear stack of layers. The functional API allows to create more complex architectures of models. These models can allow for multiple inputs and outputs or having a non-linear topology. We will look at the functional API later, when we need it for building a U-net.
For now, we only need `keras_model_sequential()`. Building networks using this method is pretty straighforward: We initiate an empty network and then consecutively add the layers we need. When adding the first layer, we need to specify the shape of the input that our network expects.
```{r sequential_model}
#initiate an empty model
first_model <- keras_model_sequential()
#add first layer, the expected input is of shape 128 by 128 on three channels (we will be dealing with RGB images)
layer_conv_2d(first_model,filters = 32,kernel_size = 3, activation = "relu",input_shape = c(128,128,3))
```
Let´s have a look at our model.
```{r}
summary(first_model)
```
We now have a model with one layer (plus the input layer), in this case a convolutional layer. We can add further layers just the same way by using corresponding `keras` functions. Some examples are:
- `layer_conv_2d()`: adds a convolutional layer
- `layer_dense()`: adds a dense layer, i.e. fully connected layer
- `layer_max_pooling_2d()`: adds a maxpooling layer
- `layer_flatten()`: adds a flattening layer
- ...
Before we look at the layers more closely, let´s finish our model by adding some more of those layers.
```{r simple model}
layer_max_pooling_2d(first_model, pool_size = c(2, 2))
layer_conv_2d(first_model, filters = 64, kernel_size = c(3, 3), activation = "relu")
layer_max_pooling_2d(first_model, pool_size = c(2, 2))
layer_conv_2d(first_model, filters = 128, kernel_size = c(3, 3), activation = "relu")
layer_max_pooling_2d(first_model, pool_size = c(2, 2))
layer_conv_2d(first_model, filters = 128, kernel_size = c(3, 3), activation = "relu")
layer_max_pooling_2d(first_model, pool_size = c(2, 2))
layer_flatten(first_model)
layer_dense(first_model, units = 256, activation = "relu")
layer_dense(first_model, units = 1, activation = "sigmoid")
```
As you might have noticed, we do not need to reassign the output of functions like `layer_dense()` back to the variable `first_model` or use pipe operators. This is because keras modifies models in-place (for further info, see side note on [model instances](#modelinstances))
The final model now looks like this:
```{r}
summary(first_model)
```
We have now built our first small CNN. Before we continue with [preparing the data](#dataprep) and feeding our network, let´s have a quick look at the layers we have added.
#### Dense Layers
A dense layer basically computes a weighted sum of the feature values of each input and adds a certain value (bias). The result of this computation is then processed by an activation function (see below) to produce the final output of the layer. This is done once for each `unit` in the layer. In case of an image, the feature values are the pixel values. If we feed an image into a dense layer with 64 units, the result would consist of 64 numbers (resulting from 64 weighted sums of all pixel values of that image + 64 biases). In other words, the dense layer would transform the input image into 64 different representations, which could in turn be fed into the next layer of the network. Stacking multiple dense layers together enables the network to produce increasingly complex representations of the input data, and thereby tackle increasingly complex classification tasks.
The final dense layer in our network has only one unit. It produces only one output indicating whether the input image contains the target (in that case the output is 1) or not (in that case the output is 0). How does it get to a number between 0 and 1? This is because of the activation function used here, we´ll see that in a moment.
We could in principle build a network of only dense layers to analyze images. However, these layers only consider patterns in the whole image at once. They look for "global patterns" in the sequence of pixels, and the location of pixels within the sequence is important. They are not able to detect local patterns in different parts of an image. This is what convolutional layers are for.
#### Convolutional Layers
In contrast to dense layers, convolutional layers are able to detect local patterns within images independent of where in that image the patterns are present.
What convolutional layers do is essentially not much different from what you probably already know and might have used earlier for feature extraction in remote sensing images. For instance, edge detection (e.g. Sobel filter) is done using convolutional kernels sliding over your input image to extract (enhance) certain features (e.g. edges) in those images and suppress others. During the convolution, the filter slides a window of certain size (e.g. 3x3 pixels) over the input image centering it around each input pixel. It then computes a weighted sum of all pixels within that window, and the result of that sum is the new pixel value at this location in the output. The weights of this weighted sum determine the convolution. The figure below shows the principle of a convolution on a 1-band input. The same also works for 3-band images as in our case. The filter (kernel) is then 3-dimensional (e.g. 3x3x3 pixels) and the weighted sum is calculated over all pixel within the "sliding box".
```{r,echo=F,out.width="60%",fig.align="center"}
magick::image_draw(magick::image_read("./figures/conv_sketch.png"))
dev.off()
```
One of the specifics of convolutional neural networks is that the weights of the filters are not predefined by the analyst (by choosing a specific filter), but instead learned by the network during training. You only specify the properties of the filters such as the kernel size, 'sliding speed', or the behavior at image borders. With each convolutional layer you add multiple filters (equivalent to the `units` of dense layers). The first convolutional layer of our example has 64 filters. Such a layer transforms the image from 3 dimensions (the RGB bands) to 64 dimensions, so the output called _feature map_ has 64 different bands. Each band represents a response that was extracted from the image with one of the filters. The more convolutional layers are stacked up, the more complex and abstract the extracted features become.
Similar to the dense layer, the output of the convolution is fed into an activation function to produce the final output of the layer. The figure below shows the result of two of the filters of a trained model on an input image (we are going to see how to produce such plots later).
In case you wonder what happens to pixels on the edges of input images, please see [this side note on padding](#padding)
```{r echo=F, fig.align="center",out.width="100%", fig.asp=0.3}
#adapted from: Chollet (2018): Deep learning with R (see references)
plot_layer_activations <- function(img_path, model, activations_layers,channels, out_path=NULL){
if(is.null(out_path)){
out_path <- getwd()
}
model_input_size <- c(model$input_shape[[2]], model$input_shape[[3]])
img <- image_load(img_path, target_size = model_input_size) %>%
image_to_array() %>%
array_reshape(dim = c(1, model_input_size[1], model_input_size[2], 3)) %>%
imagenet_preprocess_input()
layer_outputs <- lapply(model$layers[activations_layers], function(layer) layer$output)
activation_model <- keras_model(inputs = model$input, outputs = layer_outputs)
activations <- predict(activation_model,img)
if(!is.list(activations)){
activations <- list(activations)
}
plot_channel <- function(channel,layer_name,channel_name) {
rotate <- function(x) t(apply(x, 2, rev))
image(rotate(channel), axes = FALSE, asp = 1,
col = terrain.colors(12),main=paste("conv. channel:",channel_name))
}
for (i in 1:length(activations)) {
layer_activation <- activations[[i]]
layer_name <- model$layers[[activations_layers[i]]]$name
n_features <- dim(layer_activation)[[4]]
for (c in channels){
channel_image <- layer_activation[1,,,c]
plot_channel(channel_image,layer_name,c)
}
}
}
par(mfrow=c(1,3),mar=c(1,1,0,1),mai=c(0.1,0.1,0.3,0.1),cex=0.8)
plotRGB(stack("./testarea_unet/subsets/25.jpg"), margins=T,main="input image")
pretrained_unet <- load_model_hdf5("./pretrained_unet.h5")
plot_layer_activations(img_path = "./testarea_unet/subsets/25.jpg", model=pretrained_unet ,activations_layers = 2, channels = c(1,2))
```
Both, dense and convolutional layers transform the input data. The form and impact of these transformations depend on the weights of the weighted sums (dense) or kernels (convolutional) as well as the biases. These are the parameters that a network learns during training. They carry the knowledge of the network. The parameters of a layer are also called its _state_. For more information on how to calculate the number of parameters of a certain layer, see [side note on layer parameters](#layer_parameters).
In addition to dense layers and convolutional layers, we use two other layer types in our `first_model`. These layer types have no learned parameters and are therefore called _stateless_.
#### Max Pooling Layers
A maxpooling layer is used for downsampling input data to a coarser resolution. Similar to a convolutional layer, it applies a sliding window over the input. However, the result in this case is simply the maximum, i.e. the highest number of all pixels within the window.
```{r,echo=F,out.width="60%", fig.align="center"}
magick::image_draw(magick::image_read("./figures/maxpool_sketch.png"))
dev.off()
```
Maxpooling layers are used mainly for two reasons. First, the reduction of the resolution is meant to limit the size of the feature maps (as the number of features increases). Second, subsequent convolutional layers processing the output of a maxpooling layer see a condensed version of the feature maps of earlier convolutional layers that have been the input of that maxpooling layer. Stacking up combinations of convolutional layers followed by maxpooling layers enables the later convolutional layers to "see" increasingly larger spatial patterns.
#### Flatten Layers
The function of this layer is to transform the output of the convolutional or maxpooling layers into something that the following dense layer can process to classify. As mentioned before, the dense layer simply takes all pixel values and computes a weighted sum. Technically these values need to be in the form of 1-D tensors (vectors) per input. The flatten layer simply prepares the data so that each 3D input is transferred to a vector by chaining together all pixels (in our example below the `(6,6,128)` is flattened to a vector of length 4608).
#### Activation functions
As mentioned earlier, a CNN comprises a series of layers doing transformations (convolutions, weighted sums) on an image that result in some (hopefully) meaningful representation of that image (the prediction result).
All the transformations applied through the different layers are so far linear. If we would only apply linear transformations, the chain of all the transformations would basically still be one linear transformation. We would not gain anything by adding multiple layers.
In order for the network to be able to learn non-linear relationships between inputs and outputs, activation functions are used.
Below are three examples of activation functions. For more details, see [side note on activation functions](#activation_functions).
```{r acti_funct_figure, out.width="100%",fig.asp=0.4, echo=F}
par(mfrow=c(1,3),mar=c(5,2,5,2))
input <- seq(-8,8,0.001)
#step
step_output <-lapply(input, FUN=function(x){
if(x>=0){return(1)}
else{return(0)}
})
plot(input,step_output, type="l", col="blue", ylab="step(input)",main="step activation function")
abline(v=0,lty=3,lwd=2, col="grey")
#sigmoid
sig_output <- 1/(1+exp(-input))
plot(input,sig_output, type="l", col="blue", ylab="sig(input)",main="sigmoid activation function")
abline(v=0,lty=3,lwd=2, col="grey")
#relu
relu_output <-lapply(input, FUN=function(x)max(0,x))
plot(input,relu_output, type="l", col="blue", ylab="reLU(input)",main="reLU activation function")
abline(v=0,lty=3,lwd=2, col="grey")
```
There are quite a few activation functions to choose from and there are several things to keep in mind when choosing one of them for a specific layer.
As for this tuturial we will just keep in mind that reLu is one of the most popular activation functions for hidden layers in computer vision applications and is also used in all our examples. This means, the output of a convolutional layer for one pixel on a 1-band image as in the figure above would rather look like this: $o_1= relu(w_1i_1+w_2i_2+\dots+w_9i_9)$.
For the output layer, the activation function can be chosen in order to fit the classification problem at hand. Since we are mainly looking at binary classification problems but also interested in the degree to which a subset fits the class "wildrye", the sigmoid activation function is most appropriate.
### Summary
In summary: In CNNs, blocks of combined convolutional and maxpooling layers extract (hopefully meaningful) features from the images, those features becoming more complex and abstract with increasing number of blocks stacked together. This is the feature extraction part. The dense layers then perform the actual classification based on those extracted features. The flatten layer is merely the connection between convolutional layer and dense layer.
Activation functions are use to enable the model to learn non-linear representations of the data.
The following figure shows a sketch of the architecture. The arrows indicate the different layers, grey boxes symbolize the outputs (representation) produced by the layers. The small numbers on top of the boxes indicate the `depth` of the feature maps (i.e. number of features), the numbers to the side of the boxes indicate the spatial extent (rows and coloumns). The last three layers produce one-dimensional outputs, the numbers here indicate the number of values.
```{r,out.width="100%",echo=F}
magick::image_draw(magick::image_read("./figures/first_model.png"))
dev.off()
```
## Preparing your data {#dataprep}
At this point, we´re going to look at the basic steps needed in order to read images and convert them into the format and structure expected by models in Keras. We will assume that the data is already sitting on our disk as subsets (images containing the targets are in folder `\true`, those without targets are in folder `\false`). We´ll later expand on this and have a look at how to get a larger remote sensing image into a model, how to work with masks instead of labels (for pixel-by-pixel classification), and how to include data augmentation to increase the training data set.
For our initial example we only need to read in the training images and assign one label per image (1-> target is present, 0 -> target is not present).
Before actually reading the image files, we organize our data into training and validation data. The validation data will not be used for training (will not influence the adaption of parameters), but only for validating the results. We will use `initial_split()` function from the `rsample()` package and we will do the split on the level of file paths.
```{r organize_data}
# get all file paths of the images containing our target
subset_list <- list.files("./training_part1/true", full.names = T)
# create a data.frame with two coloumns: file paths, and the labels (1)
data_true <- data.frame(img=subset_list,lbl=rep(1L,length(subset_list)))
# get all file paths of the images containing non-targets
subset_list <- list.files("./training_part1/false", full.names = T)
#creating a data.frame with two coloumns: file paths, and the labels (0)
data_false <- data.frame(img=subset_list,lbl=rep(0L,length(subset_list)))
#merge both data.frames
data <- rbind(data_true,data_false)
# randomly split data set into training (~75%) and validation data (~25%)
# use `lbl` as stratum, so that the split is being done proportional for
# targets and non-targets
set.seed(2020)
data <- initial_split(data,prop = 0.75, strata = "lbl")
```
We now have an `rsplit` object that organizes the training and validation data for us.
We can use `training()` to access the training data and `testing()` to access the validation data, each in the form of a `data.frame` that holds the file paths of the images and the corresponding labels.
```{r r-split}
data
head(training(data)) # returns the the first few entries in the training data.frame
head(testing(data)) # returns the first few entries of the data set aside for validation
# compare the number of files in the training data, that show non-targets vs, those that
# show targets -> should be similiar
c(nrow(training(data)[training(data)$lbl==0,]), nrow(training(data)[training(data)$lbl==1,]))
```
Now we can read the data, converting it into the format we need for feeding our network.
In `keras` and `tensorflow` we work with so called "tensors" as the basic data structure. Tensors are "a generalization of vectors and matrices to an arbitrary number of dimensions" [@Chollet2018]. In R we could 'translate' tensors of different dimensions (ranks) to our familiar data types like this:
- 0D tensor: scalar (just one number)
- 1D tensor: vector
- 2D tensor: matrix
- 3D tensor and any higher-dimensional tensor: multidimensional array
Please note, that multiple samples (in our case images) are usually collected in one tensor (e.g. one tensor holds all images of one batch according to batch size, see below). The first dimension of the tensor is the number of samples. This means that when analyzing images, we are mainly dealing with 4D-tensors, the dimensions being:
- 1st dimension: no. of samples
- 2nd dimension: no. of rows (y)
- 3rd dimension: no. of coloumns (x)
- 4th dimension: no. of channels/bands.
For example a tensor of shape `(2,448,448,3)` holds two images of size 448 by 448 pixels and 3 bands. We will check this again later when we have imported some data.
Since our data is usually in some image file format (tif, jpg...), we have to read and convert it into tensors as expected by our model. In addition, we have to do some preprocessing. There are different tools available to help us do that. Here we will use the `tfdatasets` package for organizing our preprocessing pipeline.
The first function we use is `tensor_slices_dataset()`. This creates a dataset that has one element for each pair of images and labels.
```{r prepare_data 1}
#prepare training dataset
training_dataset <- tensor_slices_dataset(training(data))
```
This has created a `tf_dataset`, that has one element for each training data pair (with components image and corresponding label). Accordingly, each element is a list with two items, "img" and "lbl".
Up to now, the image component is only the file path of the image (a tensor of data type string), the label component is the label (tensor of type "int").
If we would like to inspect the elements of our `tf_dataset`, we can do that like this:
```{r}
#if you want to get a list of all tensors, you can use the as_iterator() and iterate() functions
dataset_iterator <- as_iterator(training_dataset)
dataset_list <- iterate(dataset_iterator)
#check the first few items of that list, each item one is again a list with two items: img and lbl
head(dataset_list)
```
This might seem a bit strange to you. We need to follow the "iterator" concept of Python if we want to access elements of a `tf_dataset`.
For more info on this, see [side note on iterators](#iterators).
Now that we have a `tf_dataset`, we can use the `dataset_map()` function to apply methods to all elements of the dataset. This makes the following preprocessing steps easier. We will use it to transform the image components of our dataset (currently only file paths) into tensors of suitable shape.
The concept that we use to do this in the following code chunk is:
- We use `dataset_map()` to apply a function on each element of the dataset
- Since each element is a list of two tensors, the function we apply is `list_modify()`
- `list_modify()` in turn modifies list items
- we use it to apply functions to the `img` list items, i.e. all our images (not the labels, later -when working with masks- we can modify them simultaneously, which is useful when doing [data augmentation](#data_augmentation))
The functions that we apply will transform the images to the desired tensors in three steps:
1. read the images from the specific file paths (decoding the jpgs into tensors)
2. convert data type to floating point values from 0 to 1 (see [side note on value normalization](#value_normalization))
3. resize the images to the size expected by the model (128 by 128 pixels), in case they are of different size
These functions are available through the `tf.image` and `tf.io` python modules of the Tensoflow Python API. Since the `tensorflow` package provides access to this API, we can access them using the `$` operator, e.g.: `tf$image$resize` uses the `resize` method from the `tf.image` module.
```{r prepare_data_2}
#get input shape expected by first_model
subset_size <- first_model$input_shape[2:3]
# apply function on each dataset element: function is list_modify()
#->modify list item "img" three times:
# 1 read decode jpeg
training_dataset <-
dataset_map(training_dataset, function(.x)
list_modify(.x, img = tf$image$decode_jpeg(tf$io$read_file(.x$img))))
# 2 convert data type
training_dataset <-
dataset_map(training_dataset, function(.x)
list_modify(.x, img = tf$image$convert_image_dtype(.x$img, dtype = tf$float32)))
# 3 resize to the size expected by model
training_dataset <-
dataset_map(training_dataset, function(.x)
list_modify(.x, img = tf$image$resize(.x$img, size = shape(subset_size[1], subset_size[2]))))
```
Now the data is almost ready. We just need to shuffle the data using the function `dataset_shuffle()` and make batches using the function `dataset_batch()`. Since our samples are still ordered by class, the shuffling is done to make sure we don´t start training only on images with targets. The batches are needed because the learning is not done on the whole dataset at once, but in batches of data (see below). As the last step we remove the names of the list elements ("img" and "lbl").
```{r prepare_data_3}
training_dataset <- dataset_shuffle(training_dataset, buffer_size = 10L*128)
training_dataset <- dataset_batch(training_dataset, 10L)
training_dataset <- dataset_map(training_dataset, unname)
```
If we now look at a single element of our dataset, we can see how the data was modified. Let´s look at the first tensor holding the first batch of images:
```{r}
dataset_iterator <- as_iterator(training_dataset)
dataset_list <- iterate(dataset_iterator)
dataset_list[[1]][[1]]
```
This line results in a preview of the tensor, showing (a excerpt from) its content, the shape and the dtype.
This is a 4D tensor of shape `(10,128,128,3)` so there are 10 images, each of size 128*128 px, and with 3 channels (RGB), the dtype is float. As mentioned before, the first dimension is the samples or batch dimension. Since we defined a batch size of 10, this tensor has a size of 10 in the first dimension.
We can also directly output the shape to limit the size of the output:
```{r}
dataset_list[[1]][[1]]$shape
```
The corresponding label tensor looks like this:
```{r}
dataset_list[[1]][[2]]
```
It merely holds the then digits indicating the labels (one per image), thus it only has one dimension of size 10.
Our training data is now ready.
Let´s prepare the validation dataset. We apply the same steps as to the training dataset (except the shuffling which is not necessary for the validation dataset).
```{r prepare_val_dat}
#validation
validation_dataset <- tensor_slices_dataset(testing(data))
validation_dataset <-
dataset_map(validation_dataset, function(.x)
list_modify(.x, img = tf$image$decode_jpeg(tf$io$read_file(.x$img))))
validation_dataset <-
dataset_map(validation_dataset, function(.x)
list_modify(.x, img = tf$image$convert_image_dtype(.x$img, dtype = tf$float32)))
validation_dataset <-
dataset_map(validation_dataset, function(.x)
list_modify(.x, img = tf$image$resize(.x$img, size = shape(subset_size[1], subset_size[2]))))
validation_dataset <- dataset_batch(validation_dataset, 10L)
validation_dataset <- dataset_map(validation_dataset, unname)
```
We now have two `tf_dataset` objects containing our training and validation data, respectively, as tensors. These datasets can be fed into our model for training.
## Training your model {#train}
The first step is to configure/prepare the model for training. We do this using the `compile()` function on our model.
This is how we compile the network:
```{r training, eval=T}
compile(
first_model,
optimizer = optimizer_rmsprop(lr = 5e-5),
loss = "binary_crossentropy",
metrics = "accuracy"
)
```
This function allows us to specify some parameters for the training. Here we choose the `optimizer` (including learning rate `lr`), the `loss` function, and one or more accuracy metrics for measuring/visualizing the models performance (metrics are not used for training).
The key words `optimizer`, `learning rate` and `loss` are associated with the general workflow that is used during training:
(1) The network runs on a randomly drawn batch of the training data and computes the outputs/predictions.
(2) The outputs are compared to the corresponding label and the loss (how far off are our predictions with respect to the labels) is calculated using the defined `loss` function.
(3) The parameters of the layers are then adjusted a little bit to reduce the loss on the current batch, using the `optimizer` and the learning rate `lr`.
After step (3), the network processes the next batch of data with the new parameters. This is repeated for as long as we determine in the `fit` method (see below).
This workflow is called _(mini-batch) stochastic gradient descent_ (mini-batch SGD) and there many different variants. The `optimizer` specifies which one to use.
```{r,echo=F,out.width="80%",fig.align="center"}
magick::image_draw(magick::image_read("./figures/training_workflow.png"))
dev.off()
```
Let´s briefly look into the arguments that we specify when calling `compile()`:
**Loss function:** The loss function specifies in what way the loss of the network is computed, i.e. how we measure the differences between prediction and label. For binary classification problems, the binary crossentropy is a commonly used loss function, which we will also stick to throughout this tutorial (for a bit more info, see side note on [binary crossentropy](#binary_crossentropy))
**Optimizer:** The optimizer is the specific algorithm used for adapting the parameters during each training iteration such that the loss is reduced. It specifies _how_ the parameters are adjusted.
In this tutorial, we will stick to `rmsprop` (for a short explanation, see side note on [optimizers and gradient descent](#optimizers))
**Learning rate:** The learning rate specifies the size of the steps taken in adapting the parameters during each iteration, i.e. _how strongly_ the weights are being adjusted. A larger learning rate means faster training, but if it is too large this can result in unstable training.
**Metric**: This values signals us after each epoch the accuracy of our model on the validation data. Here we use `accuracy` which simply calculates the fraction of samples that are classified correctly into 0 or 1 (automatically applying a threshold of 0.5 for the decision between 0 and 1).
Note, that assigning the result of `compile()` again to `first_model` is not necessary because of the in-place-modification.
The next step is to start the actual training. This is done using the function `fit()` on our model.
```{r fit, eval=T}
diagnostics <- fit(first_model,
training_dataset,
epochs = 15,
validation_data = validation_dataset)
plot(diagnostics)
```
Here we defined three arguments: the training and validation datasets that we have just created, and the number of epochs, i.e. how often the network will iterate over all training data: When all data has been processed batch by batch following the workflow mentioned above, one `epoch` is finished. This whole process will be repeated for as many times as specified by the argument `epochs`.
Assigning the result of `fit()` to a variable (here `diagnostics`) causes the training history (loss and accuracy metrics after each iteration) to be stored in that variable. We can then e.g. plot the training curves.
## Predicting with your model
Once we have finished the training, we can use our trained model for predictions on new data.
This is done using the `predict()` function. It needs as arguments the model to use for prediction and the dataset on which to predict. Let´s first try it on our validation data.
```{r predict_validation_dataset}
predictions <- predict(first_model,validation_dataset)
head(predictions)
tail(predictions)
```
Since we only predict one label per image (target or no target), the result (`predictions`) is a matrix with only one coloumn and 57 rows (the number of images we have processed). Each number represents the calculated probability for a target.
Let´s randomly visualize some results on the validation data set:
```{r show_results, message=F, out.width="100%", fig.asp=0.4, echo=T}
par(mfrow=c(1,3),mai=c(0.1,0.1,0.3,0.1),cex=0.8)
for(i in 1:3){
sample <- floor(runif(n = 1,min = 1,max = 56))
img_path <- as.character(testing(data)[[sample,1]])
img <- stack(img_path)
plotRGB(img,margins=T,main = paste("prediction:",round(predictions[sample],digits=3)," | ","label:",as.character(testing(data)[[sample,2]])))
}
```
We can now try to predict on the actual test area. Again, we will for now assume the data to already sit on our disk as subsets, so we just have to import the data run the model on it.
```{r prepare_testarea}
# get all file paths of the images of our test area
subset_list <- list.files(path="./testarea_part1/subsets/", full.names = T)
```
We will use a similar workflow as shown in the [Preparing your data](#dataprep) section, but we can skip some of the steps when doing prediction (e.g. shuffling, assigning labels).
```{r hidden_order, echo=F}
o <- order(as.numeric(tools::file_path_sans_ext(list.files("./testarea_part1/subsets/"))))
subset_list <- list.files("./testarea_part1/subsets/", full.names = T)[o]
```
```{r predict_testarea}
dataset <- tensor_slices_dataset(subset_list)
dataset <- dataset_map(dataset, function(.x)
tf$image$decode_jpeg(tf$io$read_file(.x)))
dataset <- dataset_map(dataset, function(.x)
tf$image$convert_image_dtype(.x, dtype = tf$float32))
dataset <- dataset_map(dataset, function(.x)
tf$image$resize(.x, size = shape(128, 128)))
dataset <- dataset_batch(dataset, 10L)
dataset <- dataset_map(dataset, unname)
predictions <- predict(first_model, dataset)
```
Instead of checking single tiles, let´s look at how the result looks when reassembled on the map (no worries, youl´ll see one way how to do the reassembling in part II).
Here, all tiles with probability for wildrye of $<0.5$ have been removed.
```{r hidden_rebuild_map, echo = F, warning=F,results=F,message=F}
predictions <- array(data= rep(predictions,128*128),dim = c(length(predictions),128,128,1))
input_img <- stack("./testarea_part1/testarea_crop.tif")
rebuild_img <- function(subsets,out_path,target_rst){
require(raster)
require(gdalUtils)
require(stars)
subset_pixels_x <- ncol(subsets[1,,,])
subset_pixels_y <- nrow(subsets[1,,,])
tiles_rows <- nrow(target_rst)/subset_pixels_y
tiles_cols <- ncol(target_rst)/subset_pixels_x
# load target image to determine dimensions
target_stars <- st_as_stars(target_rst,proxy=F)
#prepare subfolder for output
result_folder <- paste0(out_path,"out")
if(dir.exists(result_folder)){
unlink(result_folder,recursive = T)
}
dir.create(path = result_folder)
#for each tile, create a stars from corresponding predictions,
#assign dimensions using original/target image, and save as tif:
for (crow in 1:tiles_rows){
for (ccol in 1:tiles_cols){
i <- (crow-1)*tiles_cols + (ccol-1) +1
dimx <- c(((ccol-1)*subset_pixels_x+1),(ccol*subset_pixels_x))
dimy <- c(((crow-1)*subset_pixels_y+1),(crow*subset_pixels_y))
cstars <- st_as_stars(t(subsets[i,,,1]))
attr(cstars,"dimensions")[[2]]$delta=-1
#set dimensions using original raster
st_dimensions(cstars) <- st_dimensions(target_stars[,dimx[1]:dimx[2],dimy[1]:dimy[2]])[1:2]
write_stars(cstars,dsn = paste0(result_folder,"/_out_",i,".tif"))
}
}
#mosaic the created tifs
starstiles <- as.vector(list.files(result_folder,full.names = T),mode = "character")
gdalbuildvrt(starstiles,paste0(result_folder,"/mosaic.vrt"))
gdalwarp(paste0(result_folder,"/mosaic.vrt"), paste0(result_folder,"/mosaic.tif"))
}
rebuild_img(predictions,out_path = "./testarea_part1/",target_rst = input_img)
result_map <- raster("./testarea_part1/out/mosaic.tif")%>%readAll()
result_map[result_map[[1]]<0.5] <- NA
agg <- suppressMessages(aggregate(result_map[[1]],c(128,128),fun="max"))
result_scratch <- suppressMessages(rasterToPolygons(agg))
```
```{r hidden_map, echo = F, warning=F,results=T,message=F,out.width="100%"}
viewRGB(input_img,layer.name = "Input image", quantiles = c(0,1),r=1,g=2,b=3)+mapview(result_scratch,layer.name="wildrye prediction",alpha.regions=0.4,na.alpha=0,col.regions =c("blue","red","yellow"))
```
If you want to save a model, you can do that by:
```{r save_model}
save_model_hdf5(first_model,filepath = "./my_first_model.h5")
```
Given the rather puristic model and low number of training samples, the result doesn´t look too bad. However, there are some issues with it. The most prominent one is that we are predicting on the level of image tiles. Although the size of the tiles is already pretty small, there are still a number of mixed tiles (target and non-target). Those tiles are harder to train on and to predict, and the prediction is more likely to be wrong. In addition, even if it correctly identifies the presence of the target, this result is of course only true for parts of the such a tile. We cannot easily tell where and to what degree the target is present in the subset.
While we could further reduce the tile size and thereby increase the spatial "resolution" of the result, this would also decrease the degree to which spatial context can be taken into account, and it would increase the technical overhead of managing a lot of image subsets. For spatial applications, it would be much more appealing to receive a result on a pixel-basis, where our level of detail is not depending on the tile size.
In part II, we will look at a model architecture called "U-net" which allows our model to make pixel-by-pixel predictions within image tiles while still taking the spatial context of pixels into account.
In addition, we will look at a few common techniques (transfer learning, data augmentation), that can improve our training and prediction. Finally, we will have a look at how to get from a remote sensing image to the subsets on which neural networks train, and from the resulting prediction tiles back to a prediction map of the area.
# Part II: Extensions
## Using pretrained networks
One big advantage of deep learning is the portability of learned patterns. As mentioned above, the first layers of a CNN mainly extract generic features and patterns (edges of various forms and directions, patterns and arrangements of pixel values). Even if it was trained on a completely different dataset, these feature extractors may be useful for our use case as well. In the following example, we use a CNN called "VGG16" [@Simonyan2014], that was trained on the ["ImageNet"](http://image-net.org/index) dataset.
```{r,echo=F}
unet_scratch <- magick::image_read("./figures/first_model.png")
unet_pretrained<- magick::image_read("./figures/pretrained_model.png")
magick::image_animate(c(unet_scratch,unet_pretrained),fps=1)
```
This data set consists of animals and everyday objects like cats, bikes, balloons and so on. However, using the first layers for feature extraction works for us too in extracting features from our images that we can use to distinguish between targets and non-targets. We will use the first 15 layers as a basis and just add "our own" dense layers for conducting the actual classification. For some further thoughts on this form of transfer learning, [see this side note](#transfer_learning)
```{r pretrained_image_rec}
# load vgg16 as basis for feature extraction
vgg16_feat_extr <- application_vgg16(include_top = F,input_shape = c(128,128,3),weights = "imagenet")
#freeze weights
freeze_weights(vgg16_feat_extr)
#use only layers 1 to 15
pretrained_model <- keras_model_sequential(vgg16_feat_extr$layers[1:15])
# add flatten and dense layers for classification
# -> these dense layers are going to be trained on our data only
pretrained_model <- layer_flatten(pretrained_model)
pretrained_model <- layer_dense(pretrained_model,units = 256,activation = "relu")
pretrained_model <- layer_dense(pretrained_model,units = 1,activation = "sigmoid")
pretrained_model
```
The function `freeze_weights()` can be used to freeze the parameters of (parts of) or model, so they don´t get updated during training. As you can see, this model is more complex, but since we will only train the dense layers on our data, we don´t have to learn all the parameters of the model.
Let´s see how training works on this one.
```{r fit_pretrained_image_rec}
compile(
pretrained_model,
optimizer = optimizer_rmsprop(lr = 1e-5),
loss = "binary_crossentropy",
metrics = c("accuracy")
)
diagnostics <- fit(pretrained_model,
training_dataset,
epochs = 6,
validation_data = validation_dataset)
plot(diagnostics)
```
We can see, that now the training curve flattens at a high accuracy already after ~3-5 epochs.
The metrics are:
```{r}
diagnostics$metrics
```
Predicting the test area with this model gives:
```{r predtict_pretrained_image_rec, echo=F, warning=F,results=F}
predictions <- predict(pretrained_model,dataset)
predictions <- array(data= rep(predictions,128*128),dim = c(length(predictions),128,128,1))
if(!dir.exists("./testarea_part1/pretrained/")){dir.create("./testarea_part1/pretrained/")}
rebuild_img(predictions,out_path = "./testarea_part1/pretrained/",target_rst = input_img)
result_map <- raster("./testarea_part1/pretrained/out/mosaic.tif")%>%readAll()
result_map[result_map[[1]]<0.5] <- NA
agg <- suppressMessages(aggregate(result_map[[1]],c(128,128),fun="max"))
result_pretr <- suppressMessages(rasterToPolygons(agg))
```
```{r show_result_pretrained, echo=F, warning=F,results=T, out.width="100%"}
viewRGB(input_img,layer.name = "input image",quantiles = c(0,1),r=1,g=2,b=3)+mapview(result_pretr,layer.name="wildrye prediction", alpha.regions=0.4,na.alpha=0,col.regions =c("blue","red","yellow"))
```
## From image to pixel-by-pixel classification
Up to now, we have conducted a simple binary image classification. In the next step we will build a model that is able of a pixel-by-pixel classification. Instead of telling us _if_ there is a target object present in an image subset it tells us _where_ in the image a target is present. To build such a model we will adopt the model architecture of a U-net, proposed by [@Ronneberger2015]
The typical architecture is shown in the figure below. It as a sketch of a largely simplified version only for illustration. The original has a lot more layers
```{r,echo=F, fig.align="center"}
magick::image_draw(magick::image_read("./figures/unet_small.png"))
dev.off()
```
The idea is the following: The network starts with a sequence of convolutional and maxpooling layers. As we have seen before, this sequence extracts features and spatial patterns of growing complexity (with growing number of convolutional/maxpooling blocks). This part is also called the _contracting path_ of the network. The extraction of ever more complex and abstract features happens 'at the cost' of downsampling the resolution, thus loosing information about where exactly certain features are.
This path is not much different from the network architecture we have looked at so far. After the last maxpooling layer we could simply insert a flatten layer and one or two dense layers and we would end up with a CNN doing image classification for us. The network would be more complex than the one we have built before (in that it has more convolutional layers and therefore learns more complex spatial structures) but the idea would still be the same.
The U-net architecture does not stop here, but instead adds another part (the second half of the "U"), which is also called the _expansive path_.
In this expansive path of the network, the feature maps are being resampled back to the same resolution as the input image. This is done by a sequence of upsampling and convolutional layers. The important trick here is that the feature maps are not just simply upsampled, but also combined with the outputs of the corresponding (in terms of resolution) layers in the contracting path that carry the high-resolution features that were originally extracted there (pictured by the grey arrows). This is done by concatenating the layers. As a result, the successive convolutional layers take into account the information on complex spatial patterns aggregated in the whole contracting path, _but also_ the high-resolution features originally extracted at the level of the corresponding intermediate layer of the contracting path. Thus, it is able to more precisely localize the targets within the image subset.
Let´s try to build a simple U-net on our own. The architecture will be a small, slightly adapted version of the one proposed [@Ronneberger2015] , here with only 2 convolutional and 2 upsampling blocks, as shown in the figure above.
```{r simple_Unet}
## we start with the "contratcing path"##
# input
input_tensor <- layer_input(shape = c(448,448,3))
#conv block 1
unet_tensor <- layer_conv_2d(input_tensor,filters = 64,kernel_size = c(3,3), padding = "same",activation = "relu")
conc_tensor2 <- layer_conv_2d(unet_tensor,filters = 64,kernel_size = c(3,3), padding = "same",activation = "relu")
unet_tensor <- layer_max_pooling_2d(conc_tensor2)
#conv block 2
unet_tensor <- layer_conv_2d(unet_tensor,filters = 128,kernel_size = c(3,3), padding = "same",activation = "relu")
conc_tensor1 <- layer_conv_2d(unet_tensor,filters = 128,kernel_size = c(3,3), padding = "same",activation = "relu")
unet_tensor <- layer_max_pooling_2d(conc_tensor1)
#"bottom curve" of unet
unet_tensor <- layer_conv_2d(unet_tensor,filters = 256,kernel_size = c(3,3), padding = "same",activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor,filters = 256,kernel_size = c(3,3), padding = "same",activation = "relu")
## this is where the expanding path begins ##
# upsampling block 1
unet_tensor <- layer_conv_2d_transpose(unet_tensor,filters = 128,kernel_size = c(2,2),strides = 2,padding = "same")
unet_tensor <- layer_concatenate(list(conc_tensor1,unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor, filters = 128, kernel_size = c(3,3),padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 128, kernel_size = c(3,3),padding = "same", activation = "relu")
# upsampling block 2
unet_tensor <- layer_conv_2d_transpose(unet_tensor,filters = 64,kernel_size = c(2,2),strides = 2,padding = "same")
unet_tensor <- layer_concatenate(list(conc_tensor2,unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor, filters = 64, kernel_size = c(3,3),padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 64, kernel_size = c(3,3),padding = "same", activation = "relu")
# output
unet_tensor <- layer_conv_2d(unet_tensor,filters = 1,kernel_size = 1, activation = "sigmoid")
# combine final unet_tensor (carrying all the transformations applied through the layers)
# with input_tensor to create model
unet_model <- keras_model(inputs = input_tensor, outputs = unet_tensor)
```
Looking at the code chunk above, you will have noticed that we use a slightly different way of stacking layers together in this example. This is because now we are not using the sequential model as before, but the _functional API_ in Keras, which is a more general and more flexible way to build models. As mentioned in [the section on building a model](#build_model) the sequential model using `keras_model_sequential()` works just fine for many use cases. In this example however, we need a more flexible solution, because we are not building a directed linear stack of layers, but try to reinject intermediate layer outputs into further downstream layers of the model. I´ll leave you with this short description for now, for more info on how to use the functional API, see [the corresponding side note](#functional_API).
For the upsampling, [transposed convolution](#transp_conv) is used.
### Using pretrained layers in a U-net
We can adopt the strategy of using parts of a pretrained model also in our U-net. As mentioned before, the contracting path is very similar to image recognition models like the one we have built in part I. It is even more similar to VGG16. What if we could use some of the first layers of the pretrained VGG16 model for feature extraction in the contracting path of our U-net, and just add the expanding path?
The following code shows how to do that:
```{r pretrained_unet}
## load pretrained vgg16 and use part of it as contracting path (feature extraction) ##
vgg16_feat_extr <- application_vgg16(weights = "imagenet", include_top = FALSE, input_shape = c (448,448,3))
# optionally freeze first layers to prevent changing of their weights, either whole convbase or only certain layers
# freeze_weights(vgg16_feat_extr) #or:
# freeze_weights(vgg16_feat_extr, to = "block1_pool")
# we'll not use the whole model but only up to layer 15
unet_tensor <- vgg16_feat_extr$layers[[15]]$output
## add the second part of 'U' for segemntation ##
# "bottom curve" of U-net
unet_tensor <- layer_conv_2d(unet_tensor, filters = 1024, kernel_size = 3, padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 1024, kernel_size = 3, padding = "same", activation = "relu")
# upsampling block 1
unet_tensor <- layer_conv_2d_transpose(unet_tensor, filters = 512, kernel_size = 2, strides = 2, padding = "same")
unet_tensor <- layer_concatenate(list(vgg16_feat_extr$layers[[14]]$output, unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor, filters = 512, kernel_size = 3, padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 512, kernel_size = 3, padding = "same", activation = "relu")
# upsampling block 2
unet_tensor <- layer_conv_2d_transpose(unet_tensor, filters = 256, kernel_size = 2, strides = 2, padding = "same")
unet_tensor <- layer_concatenate(list(vgg16_feat_extr$layers[[10]]$output, unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor,filters = 256, kernel_size = 3, padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor,filters = 256, kernel_size = 3, padding = "same", activation = "relu")
# upsampling block 3
unet_tensor <- layer_conv_2d_transpose(unet_tensor, filters = 128, kernel_size = 2, strides = 2, padding = "same")
unet_tensor <- layer_concatenate(list(vgg16_feat_extr$layers[[6]]$output, unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor, filters = 128, kernel_size = 3, padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 128, kernel_size = 3, padding = "same", activation = "relu")
# upsampling block 4
unet_tensor <- layer_conv_2d_transpose(unet_tensor, filters = 64, kernel_size = 2, strides = 2, padding = "same")
unet_tensor <- layer_concatenate(list(vgg16_feat_extr$layers[[3]]$output, unet_tensor))
unet_tensor <- layer_conv_2d(unet_tensor, filters = 64, kernel_size = 3, padding = "same", activation = "relu")
unet_tensor <- layer_conv_2d(unet_tensor, filters = 64, kernel_size = 3, padding = "same", activation = "relu")
# final output
unet_tensor <- layer_conv_2d(unet_tensor, filters = 1, kernel_size = 1, activation = "sigmoid")
# create model from tensors
pretrained_unet <- keras_model(inputs = vgg16_feat_extr$input, outputs = unet_tensor)
```
```{r,echo=F}
unet_scratch <- magick::image_read("./figures/unet.png")
unet_pretrained<- magick::image_read("./figures/pretrained_unet.png")
magick::image_animate(c(unet_scratch,unet_pretrained),fps=1)
```
Before we can train our new U-net and see how it performs on the data, we have to prepare our training samples again. This time we have to prepare them differently because we are now not aiming for an image classification (a prediction between "0" and "1" per image) but for a pixel-by-pixel classification. Accordingly, we now need training data with "labels" on a pixel basis, too. We therefore need masks that we get from digitizing the training data and that need to be prepared as image subsets like the training images, one mask subset for each image subset. Each masks consists of pixels of value 1 (target) or zero (non-target/background).
While preparing this new data, we also want to adopt one more strategy that can help tackle small-data problems: data augmentation.
## Data augmentation {#data_augmentation}
One popular way to synthetically increase the amount of training data is to use data augmentation. This means applying image transformations and manipulations to the training data, that result in synthetic new images to train on. `keras` provides quite a few image augmentation techniques as part of its image preprocessing pipeline. In our tutorial we do not use this pipeline, but instead apply the `tfdatasets` package. Luckily, the tensorflow module itself also provides functions for simple image augmentation. Examples are:
- `tf$image$flip_left_right()`
- `tf$image$flip_up_down()`
In addition to geometric manipulations like flipping or rotating images, we can make spectral manipulations to simulate different light conditions, e.g.:
- `tf$image$random_brightness()`
- `tf$image$random_saturation()`
The following function (adapted from: https://blogs.rstudio.com/ai/posts/2019-08-23-unet/ (accessed 2020-08-12)) applies small random changes to brightness, saturation and hue of images.
```{r spectral_augmentation}
spectral_augmentation <- function(img) {
img <- tf$image$random_brightness(img, max_delta = 0.3)
img <- tf$image$random_contrast(img, lower = 0.8, upper = 1.2)
img <- tf$image$random_saturation(img, lower = 0.8, upper = 1.2)
# make sure we still are between 0 and 1
img <- tf$clip_by_value(img,0, 1)
}
```
Now we have some methods at hand to augment our small training dataset. Since we will use masks as reference for our next model, we have to keep in mind to also change those masks whenever we apply geometric transformations (because otherwise the masks won´t fit anymore). Here, the approach using `tfdatasets` from above comes in handy, because it allows to systematically apply functions to `dataset` elements or parts of these elements and helps us to simultaneously make the same geometric manipulations to images and the corresponding masks.
The following function prepares the dataset and -in case of training data- applies three different augmentations on the input:
- flip images (and masks) left to right