diff --git a/TAA1/lecture.md b/TAA1/lecture.md
index 803f746..f4e049c 100644
--- a/TAA1/lecture.md
+++ b/TAA1/lecture.md
@@ -13,71 +13,29 @@ kernelspec:
name: python3
---
-# Lecture: Earth Observation & Google Earth Engine
+# Lecture: Introduction to Statistical Learning on Geospatial Data
+We will start off this course with a primer on statistical learning for geospatial data. You will become acquainted with the basic principles of statistical learning, and we will run a couple of models on data collected from Google Earth Engine.
-This week we will focus on what we can do with earth observation data and learn how to use the google earth engine to unleash the potential of earth observation data.
-
-`````{admonition} Learning objectives week 4
+`````{admonition} Learning objectives
:class: important
-- Gain a basic understanding of the Google Earth Engine.
-- Understand and know how you can use the Google Earth engine in Python.
-- Understand the differences between supervised and unsupervised learning
-- Know how to apply a supervised classification algorithm.
-- Understand how droughts can be detected through remote sensing.
+- Gain a basic understanding of Google Earth Engine's Python bindings.
+- Become acquainted with common terms and methods for supervised statistical learning.
+- Learn the basic workflow of supervised statistical learning for classification tasks.
+- Learn some tricks to more efficiently train and evaluate models.
+- Understand some of the nuances of applying models on geopatial data.
`````
+In this tutorial we only consider `supervised learning`, which is the process of learning from labelled reference data in order to predict. In TAA3, you will become aquainted with `unsupervised learning` methods as well, such as clustering.
-## Supervised learning
-Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.
-
-Supervised learning can be separated into two types of problems when data mining—classification and regression:
-
-- **Classification** uses an algorithm to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more detail below.
-
-- **Regression** is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.
-
-## Supervised learning algorithms
-Various algorithms and computation techniques are used in supervised machine learning processes. We already discussed **Random Forests** and **Neural Networks** in detail during the lectures. Below you can find an overview of some of the other algorithms that you can apply within our tutorial.
-
-**Naive Bayes**
-Naive Bayes is classification approach that adopts the principle of class conditional independence from the Bayes Theorem. This means that the presence of one feature does not impact the presence of another in the probability of a given outcome, and each predictor has an equal effect on that result. There are three types of Naïve Bayes classifiers: Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes. This technique is primarily used in text classification, spam identification, and recommendation systems.
-
-**Support vector machine (SVM)**
-A support vector machine is a popular supervised learning model developed by Vladimir Vapnik, used for both data classification and regression. That said, it is typically leveraged for classification problems, constructing a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is known as the decision boundary, separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.
-
-**K-nearest neighbor**
-K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.
-
-Its ease of use and low calculation time make it a preferred algorithm by data scientists, but as the test dataset grows, the processing time lengthens, making it less appealing for classification tasks. KNN is typically used for recommendation engines and image recognition.
-
-## Unsupervised learning
-Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).
-
-Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:
-
-- **Clustering** is a data mining technique for grouping unlabeled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, etc.
-- **Association** is another type of unsupervised learning method that uses different rules to find relationships between variables in a given dataset. These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
-- **Dimensionality** reduction is a learning technique used when the number of features (or dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the data integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise from visual data to improve picture quality.
+## Reading
+The book [An Introduction to Statistical Learning](https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf) by James, Witten, Hastie, and Tibshirani provides an excellent introduction to statistical learning. For this tutorial, we suggest to read sections 2.1 and 2.2 in its entirety, to gain a foundational understanding of statistical learning. Our goal in this tutorial is to provide the basic *why* understanding of statistical learning, because at the end of the day, modelling is a mindset, and individual models are just tools to support this mindset.
-## The main differences between supervised and unsupervised learning: Labeled data
-
-The main distinction between the two approaches is the use of labeled datasets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.
-
-In supervised learning, the algorithm “learns” from the training dataset by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. For example, a supervised learning model can predict how long your commute will be based on the time of day, weather conditions and so on. But first, you’ll have to train it to know that rainy weather extends the driving time.
-
-Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time. However, a data analyst would need to validate that it makes sense for a recommendation engine to group baby clothes with an order of diapers, applesauce and sippy cups.
-
-## Other key differences between supervised and unsupervised learning
-
-**Goals**: In supervised learning, the goal is to predict outcomes for new data. You know up front the type of results to expect. With an unsupervised learning algorithm, the goal is to get insights from large volumes of new data. The machine learning itself determines what is different or interesting from the dataset.
-
-**Applications**: Supervised learning models are ideal for spam detection, sentiment analysis, weather forecasting and pricing predictions, among other things. In contrast, unsupervised learning is a great fit for anomaly detection, recommendation engines, customer personas and medical imaging.
-
-**Complexity**: Supervised learning is a simple method for machine learning, typically calculated through the use of statistical software like R or Python. In unsupervised learning, you need powerful tools for working with large amounts of unclassified data. Unsupervised learning models are computationally complex because they need a large training set to produce intended outcomes.
-
-**Drawbacks**: Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise. Meanwhile, unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables.
+With that in mind, the following sections cover parts of the tutorial that may be useful for further reading, but are entirely optional.
+- 4.3, for a better understanding of the logistic regression
+- 8.1.2, for a primer on decision tree models
+- 8.2.2, for more infomation about random forest models
## Google Earth Engine
-Google Earth Engine is a cloud-based platform for planetary-scale geospatial analysis that brings Google's computational capabilities to bear on a variety of high-impact societal issues including deforestation, drought, disaster, disease, food security, water management, climate monitoring and environmental protection.
+Google Earth Engine (often abbreviated to GEE) is a cloud-based platform for planetary-scale geospatial analysis that brings Google's computational capabilities to bear on a variety of high-impact societal issues including deforestation, drought, disaster, disease, food security, water management, climate monitoring and environmental protection.
The Google Earth Engine consists of a multi-petabyte analysis-ready data catalog co-located with a high-performance, intrinsically parallel computation service. It is accessed and controlled through an Internet-accessible application programming interface (API) and an associated web-based interactive development environment (IDE) that enables rapid prototyping and visualization of results.
@@ -85,13 +43,7 @@ The data catalog houses a large repository of publicly available geospatial data
Users can access and analyze data from the public catalog as well as their own private data using a library of operators provided by the Earth Engine API. These operators are implemented in a large parallel processing system that automatically subdivides and distributes computations, providing high-throughput analysis capabilities. Users access the API either through a thin client library or through a web-based interactive development environment built on top of that client library.
-## How to gain access to the Google Earth Engine
+## How to gain access to Google Earth Engine
To be able to use the Google Earth Engine during the tutorial, we will need to sign up.
-The first step is to register. To do so, you have to go to the [Sign-up page](https://signup.earthengine.google.com/#!/). You can use your university account to register.
-
-
-### Sources
-
-- [IBM Supervised Learning Explained](https://www.ibm.com/cloud/learn/supervised-learning)
-- [IBM Supervised vs Unsupervised Learning Explained](https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning)
\ No newline at end of file
+The first step is to register. To do so, you have to go to the [Sign-up page](https://signup.earthengine.google.com/#!/). You can use your university account to register.
\ No newline at end of file
diff --git a/TAA1/tutorial.ipynb b/TAA1/tutorial.ipynb
index c4ffa70..636a28c 100644
--- a/TAA1/tutorial.ipynb
+++ b/TAA1/tutorial.ipynb
@@ -1,2357 +1,1432 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TCDarmwZ0B_P",
- "tags": []
- },
- "source": [
- "# TAA1: Land-cover classification\n",
- "\n",
- "In this tutorial we are going to explore the power of the **Google Earth Engine** through classifying land-use categories on satelite imagery. In particular, we will focus on supervised classification. Supervised classification refers to the process of using a training dataset with known labels to guide a mathematical classifier in the task of labeling spectral space. They key characteristic is that the training dataset guides (or “supervises”) the labeling.\n",
- "\n",
- "The `Classifier` package within the **Google Earth Engine** handles supervised classification by traditional ML algorithms running in Earth Engine. These classifiers include CART, RandomForest, NaiveBayes and SVM. The general workflow for classification is:\n",
- "\n",
- " 1. Collect training data. Assemble features which have a property that stores the known class label and properties storing numeric values for the predictors.\n",
- " 2. Instantiate a classifier. Set its parameters if necessary.\n",
- " 3. Train the classifier using the training data.\n",
- " 4. Classify an image or feature collection.\n",
- " 5. Estimate the classification error with independent validation data.\n",
- " \n",
- "## Important before we start\n",
- "
\n",
- "Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!\n",
- "\n",
- "Now, rename the file into Week4_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "hhGbHqRr0B_V"
- },
- "source": [
- "## Learning Objectives\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "d0cuIiyw0B_V",
- "tags": []
- },
- "source": [
- "- To understand how one can extract data from the **Google Earth Engine**.\n",
- "- To know how to visualize data from the **Google Earth Engine** in Python.\n",
- "- To understand and apply a supervised classification algorithm.\n",
- "- To be able to create training data and a simple classifier.\n",
- "- To classify Landsat-8 data using various training datasets.\n",
- "- To be able to compare and judge the performance of a land-use classification."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "6Q_vJ5ZE0B_W"
- },
- "source": [
- "Tutorial Outline
\n",
- "
\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "q8qSTJEg0B_X",
- "tags": []
- },
- "source": [
- "## 1. Introducing the packages\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "T2Ta4PUs0B_X"
- },
- "source": [
- "Within this tutorial, we are going to make use of the following packages: \n",
- "\n",
- "[**ee**](https://developers.google.com/earth-engine/guides/python_install) is a Python package to use the the Google Earth Engine.\n",
- "\n",
- "[**geemap**](https://geemap.org/) is a Python package for interactive mapping with the Google Earth Engine.\n",
- "\n",
- "*We will first need to install these packages in the cell below. Uncomment them to make sure we can pip install them*"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 11735,
- "status": "ok",
- "timestamp": 1675255686011,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "M1onhJMh0B_Y",
- "outputId": "3caeae50-d278-42fa-cd67-b443d59e525a"
- },
- "outputs": [],
- "source": [
- "!pip install rioxarray"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now we will import these packages in the cell below:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 2849,
- "status": "ok",
- "timestamp": 1675255702820,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "wvqt9J2P0B_Z"
- },
- "outputs": [],
- "source": [
- "import ee\n",
- "import geemap\n",
- "import matplotlib.pyplot as plt\n",
- "import xarray as xr"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "B7c9PTZs0B_a"
- },
- "source": [
- "## 2.Extracting and exploring the Landsat-8 data\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "sehHohh50B_b"
- },
- "source": [
- "Unfortunately it is not as simple as just running the code below. As soon as you run it, you will notice that you need to authorize ourselves to be able to use the Google Earth Engine.\n",
- "\n",
- "To do so, we will first have to register ourselves. on the website of the [Google Earth Engine](https://earthengine.google.com/). \n",
- "\n",
- "You can choose any location in Europe, but the code below will show the coordinates we have selected for the Netherlands. Feel free to change them."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "edcb631dedd44674b28172fa8d16da9b",
- "146fca8030fe4bd6b740d679d52c4838",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "e93fa2d603b04a1faafb9b1c28333856",
- "61b61a156a3644c19edec2945fab21b8",
- "a0bc2372a05943c19c357d3f7198ce30",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "a4a869ed419c43839645d2835664636c",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900",
- "53e0928ed0914f6491e91a215b33573d",
- "082b7daf697b4de3befccfdc1f163478"
- ]
- },
- "executionInfo": {
- "elapsed": 1822,
- "status": "ok",
- "timestamp": 1675255707897,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "wGpIK8FO0B_b",
- "outputId": "153fad6a-b184-42c4-ba0c-ae060f901902"
- },
- "outputs": [],
- "source": [
- "Map = geemap.Map(height=800,width=700,center=[52.37,4.5],zoom=7)\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Frgjg-VU0B_c"
- },
- "source": [
- "Now we have succesfully managed to see a map of the Netherlands, let's add Landsat data to the map"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 569,
- "status": "ok",
- "timestamp": 1675255710789,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "NioWCK5v0B_d"
- },
- "outputs": [],
- "source": [
- "point = ee.Geometry.Point([5.0, 51.37])\n",
- "\n",
- "image = (\n",
- " ee.ImageCollection('LANDSAT/LC08/C01/T1_SR')\n",
- " .filterBounds(point)\n",
- " .filterDate('2020-01-01', '2020-12-31')\n",
- " .sort('CLOUD_COVER')\n",
- " .first()\n",
- " .select('B[1-7]')\n",
- ")\n",
- "\n",
- "vis_params = {'min': 0, 'max': 3000, 'bands': ['B4', 'B3', 'B2']}\n",
- "\n",
- "Map.centerObject(point, 8)\n",
- "Map.addLayer(image, vis_params, \"Landsat-8\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ZKCZxS9R0B_d"
- },
- "source": [
- "It worked! As we have specified that we wanted landsat data with as little clouds as possible, let's check the data and the actual cloud cover:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 36
- },
- "executionInfo": {
- "elapsed": 22,
- "status": "ok",
- "timestamp": 1675255713107,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "O4kmjkoM0B_e",
- "outputId": "1b801d42-590f-4781-e21c-bbfb94c1df98"
- },
- "outputs": [],
- "source": [
- "ee.Date(image.get('system:time_start')).format('YYYY-MM-dd').getInfo()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 29,
- "status": "ok",
- "timestamp": 1675255714335,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "fFTjGmvV0B_e",
- "outputId": "ca10a056-5993-4b76-a2f4-a4ec5e375101"
- },
- "outputs": [],
- "source": [
- "image.get('CLOUD_COVER').getInfo()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "pVHCpgVoi1O_"
- },
- "source": [
- "\n",
- "Question 1: Please provide the date and the cloud cover of your map. What does this cloud cover value tell you. Do you think it is good enough? Or do you maybe want to pick a different area. If so, where did you end up eventually (and with which cloud cover)?\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ZfU9Ih5W0B_f"
- },
- "source": [
- "## 3.Create training data using Corine Land Cover\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "IzJAm-n_0B_f"
- },
- "source": [
- "Training data is instrumental to supervised image classification. The training dataset is a labeled set of data that is used to inform or “train” a classifier. The trained classifier can then be applied to new data to create a classification. For example, land cover training data will contain examples of each class in the study’s legend. Based on these labels, the classifier can predict the most likely land cover class for each pixel in an image. This is an example of a categorical classification and the training labels are therefore categorical. By contrast, a continuous variable (e.g. percent tree cover) can be predicted using continuous training labels.\n",
- "\n",
- "Let us first specify the region we are interested in! There are several ways you can create a region for generating the training dataset.\n",
- "\n",
- "- Draw a shape (e.g., rectangle) on the map and the use `region = Map.user_roi`\n",
- "- Define a geometry, such as `region = ee.Geometry.Rectangle([-122.6003, 37.4831, -121.8036, 37.8288])`\n",
- "- Create a buffer zone around a point, such as `region = ee.Geometry.Point([-122.4439, 37.7538]).buffer(10000)`\n",
- "- If you don't define a region, it will use the image footprint by default"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 259,
- "status": "ok",
- "timestamp": 1675255718783,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "lOTsj73S0B_f"
- },
- "outputs": [],
- "source": [
- "region = ee.Geometry.Point([4.5, 52.37]).buffer(10000)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "fGt7peIO0B_g"
- },
- "source": [
- "Now let's explore the Corine Land Cover data.\n",
- "\n",
- "As you will see in the cell below, we use the path where the data is located within the **Google Earth Engine** and specify that we want to see the landcover. \n",
- "\n",
- "Within that same line, we also use the `.clip()` function in which we define that we specifically only want to see the same area as the Landsat-8 image we already selected."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 208,
- "status": "ok",
- "timestamp": 1675255720845,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "hH0QFPo30B_g"
- },
- "outputs": [],
- "source": [
- "CLC = ee.Image('COPERNICUS/CORINE/V20/100m/2012').select('landcover').clip(image.geometry())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "E-mSLVjz0B_g"
- },
- "source": [
- "And let's view it on a map again!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "edcb631dedd44674b28172fa8d16da9b",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "e93fa2d603b04a1faafb9b1c28333856",
- "61b61a156a3644c19edec2945fab21b8",
- "a0bc2372a05943c19c357d3f7198ce30",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "a4a869ed419c43839645d2835664636c",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900",
- "53e0928ed0914f6491e91a215b33573d",
- "082b7daf697b4de3befccfdc1f163478"
- ]
- },
- "executionInfo": {
- "elapsed": 751,
- "status": "ok",
- "timestamp": 1675255723365,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "_QfHCJwW0B_h",
- "outputId": "4c695b9b-2ab9-4fca-aa1f-376b9c3f6e3e"
- },
- "outputs": [],
- "source": [
- "Map.addLayer(CLC, {}, 'CLC')\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "VLhHhXqK0B_h"
- },
- "source": [
- "Now we are going to make the training dataset.\n",
- "\n",
- "In the `sample()` function, we have to specify multiple arguments:\n",
- "\n",
- "- Through the `region`argument we specify the geographic boundary of our training dataset.\n",
- "- Through the `scale` argument we specify the size of the points for our sample (30 would indicate 30m).\n",
- "- Through the `numPixels` argument we specify how many pixels we want to extract. I suggest to not go above 5000.\n",
- "- Through the `seed` argument we specify which random selection order we want. Through using a seed, you can always reproduce your results.\n",
- "- Through the `geometries` argument we specify whether we want to extract the geometries as well. This should always be set to True.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 2,
- "status": "ok",
- "timestamp": 1675255725378,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "P0C5_sCX0B_h"
- },
- "outputs": [],
- "source": [
- "points = CLC.sample(\n",
- " **{\n",
- " 'region': image.geometry(),\n",
- " 'scale': XXX, #check what the resolution of corine land cover is\n",
- " 'numPixels': XXX, \n",
- " 'seed': 0,\n",
- " 'geometries': True, # Set this to False to ignore geometries\n",
- " }\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "BI1EbbpgjYs5"
- },
- "source": [
- "\n",
- "Question 2: Describe the `.sample()` function in your own words. Which values have you chosen? And why?\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "DdvxczSf0B_h"
- },
- "source": [
- "Let's check how many points we got for our sample set."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 559,
- "status": "ok",
- "timestamp": 1675255729643,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "ttf-3K0T0B_i",
- "outputId": "94f9ae69-250d-4d15-a3d1-bd1400fdfcfa"
- },
- "outputs": [],
- "source": [
- "print(points.size().getInfo())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "uiV3u8I50B_i"
- },
- "source": [
- "And what is the information within each of the points:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 332,
- "status": "ok",
- "timestamp": 1675255731328,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "mORPLblO0B_i",
- "outputId": "81c0b9ea-64f9-40d1-e971-ec0e0aaa5cef",
- "tags": []
- },
- "outputs": [],
- "source": [
- "print(points.first().getInfo())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Sv7ywcDg0B_i"
- },
- "source": [
- "## 4.-Train the classifier using Corine Land Cover\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "T0ox2M7J0B_i"
- },
- "source": [
- "Now that we have created the points, we need to sample the Landsat-8 imagery using `.sampleRegions()`. This function will extract the reflectance in the designated bands for each of the points you have created. \n",
- "\n",
- "We will use reflectance from the optical, NIR, and SWIR bands (B2 - B7)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 213,
- "status": "ok",
- "timestamp": 1675255734314,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "2IS74IjT0B_i"
- },
- "outputs": [],
- "source": [
- "# Use these bands for prediction.\n",
- "bands = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7']\n",
- "\n",
- "# This property of the table stores the land cover labels.\n",
- "label = 'landcover'\n",
- "\n",
- "# Overlay the points on the imagery to get training.\n",
- "training = image.select(bands).sampleRegions(\n",
- " **{'collection': points, 'properties': [label], 'scale': 100}\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-jpbuNBc0B_j"
- },
- "source": [
- "A very important part of training a machine learning algorithm is that you ensure that there is some part of your training data left to validate your results. Here we split approximately 80% of the features into a training set and 20% into a validation set. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 207,
- "status": "ok",
- "timestamp": 1675255735649,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "ygy3v8Vl0B_j"
- },
- "outputs": [],
- "source": [
- "sample = points.randomColumn();\n",
- "trainingSample = sample.filter('random <= 0.8');\n",
- "validationSample = sample.filter('random > 0.8');"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "UdkbPQd10B_j"
- },
- "source": [
- "And finally we extract the pixels from the LandSat-8 imagery a well to finalize our training data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 330,
- "status": "ok",
- "timestamp": 1675255737244,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "O1pPDUhm0B_j"
- },
- "outputs": [],
- "source": [
- "# Overlay the points on the imagery to get training.\n",
- "training = image.select(bands).sampleRegions(\n",
- " **{'collection': trainingSample, 'properties': [label], 'scale': 100}\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 5,
- "status": "ok",
- "timestamp": 1675255738102,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "DK2Pn7VF0B_j"
- },
- "outputs": [],
- "source": [
- "# Overlay the points on the imagery to get training.\n",
- "validation = image.select(bands).sampleRegions(\n",
- " **{'collection': validationSample, 'properties': [label], 'scale': 100}\n",
- ")"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Question 3: Describe in your own words why we want to split our data into training data and test data. Why is this important for the evaluation of our results? And what is the downside of this relative simple 80/20 split? Please refer back to some of the validation methods we have discussed in week 3.\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JB9ZstM10B_k",
- "tags": []
- },
- "source": [
- "## 5. Classify and visualise the Landsat-8 using Corine Land Cover\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "qB4em_8m0B_k"
- },
- "source": [
- "We are going to try various algorithms to see how they perform when we want to classify our LandSat image into land-use categories.\n",
- "\n",
- "To assess the accuracy of a classifier, we use a `ConfusionMatrix`. The `sample()` method generated two random samples from the input data: one for training and one for validation. The training sample is used to train the classifier. You can get resubstitution accuracy on the training data from `classifier.confusionMatrix()`. To get validation accuracy, we classify the validation data. This adds a `classification` property to the validation `FeatureCollection`. We will call `errorMatrix()` on the classified `FeatureCollection` to get a confusion matrix representing validation (expected) accuracy. Don't worry if things are bit unclear, we will do this step-by-step."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "LpBDrhs60B_k"
- },
- "source": [
- "### Random Forests\n",
- "---\n",
- "Let's first use Random Forests to classify our satellite image. The inputs are specified in the [documentation](https://developers.google.com/earth-engine/apidocs/ee-classifier-smilerandomforest) of the Google Earth Engine. As you can see from the documentation, the most important input is to specify the amount of trees we are going to use. We will start with **1** and let's have a look how are model performs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 200,
- "status": "ok",
- "timestamp": 1675255742180,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "OC8gLHPo0B_k"
- },
- "outputs": [],
- "source": [
- "trained_RF = ee.Classifier.smileRandomForest(1).train(training, label, bands)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "its-lAp50B_k"
- },
- "source": [
- "Let's first have a look at the accuracy of our training data. Accuracy represents the number of correctly classified data instances over the total number of data instances. We can calculate this by estimating a `confusion Matrix` and using the `accuracy()` function. The accuracy value ranges between 0 and 1. A value of 1 indicates that the model has fitted everything correctly.\n",
- "\n",
- "The `confusionMatrix()` computes a 2D confusion matrix for a classifier based on its training data (ie: resubstitution error). Axis 0 of the matrix correspond to the input classes (i.e., reference data), and axis 1 to the output classes (i.e., classification data). The rows and columns start at class 0 and increase sequentially up to the maximum class value.\n",
- "\n",
- "The overall Accuracy essentially tells us out of all of the reference sites what proportion were mapped correctly. The overall accuracy is usually expressed as a percent, with 100% accuracy being a perfect classification where all reference site were classified correctly. Overall accuracy is the easiest to calculate and understand but ultimately only provides the map user and producer with basic accuracy information."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 920,
- "status": "ok",
- "timestamp": 1675255744186,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "EjcgN80h0B_l",
- "outputId": "42f92fae-6445-422e-a605-4024993b94fc"
- },
- "outputs": [],
- "source": [
- "trainAccuracy = trained_RF.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "IMpOi0rk0B_l"
- },
- "source": [
- "This is not very high. Let's use 10 trees and see how the results compare."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 489,
- "status": "ok",
- "timestamp": 1675255745440,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "qrEIkvVD0B_l",
- "outputId": "b7bd320b-b20f-4e22-fd4e-a642d619d85c"
- },
- "outputs": [],
- "source": [
- "trained_RF = ee.Classifier.smileRandomForest(XXX).train(training, label, bands)\n",
- "trainAccuracy = trained_RF.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "_mCSUjeB0B_l",
- "tags": []
- },
- "source": [
- "This is more like it! What if we increase this towards 100:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 247,
- "status": "ok",
- "timestamp": 1675255746923,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "Pw5r3-9b0B_m",
- "outputId": "8d788365-e19e-4e42-dc89-7e293917fd95"
- },
- "outputs": [],
- "source": [
- "trained_RF = ee.Classifier.smileRandomForest(XXXX).train(training, label, bands)\n",
- "trainAccuracy = trained_RF.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TSQe7kyF0B_m"
- },
- "source": [
- "Almost everything is now classified correctly. can we reach 100% if we would use double the amount of trees?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 947,
- "status": "ok",
- "timestamp": 1675255748667,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "6PJ0d-sg0B_m",
- "outputId": "8208313b-411d-4191-ba13-0397e28f371e"
- },
- "outputs": [],
- "source": [
- "trained_RF = ee.Classifier.smileRandomForest(XXXX).train(training, label, bands)\n",
- "trainAccuracy = trained_RF.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "BoMaUIoc0B_n"
- },
- "source": [
- "As you noticed, it also took notable longer to get the results. And we didnt gain too much either. It would be interesting to plot these results and see how the results converge. Let's collect all the accuracies for different amount of trees:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 2450,
- "status": "ok",
- "timestamp": 1675255752012,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "D-oYmE2n0B_n"
- },
- "outputs": [],
- "source": [
- "trees = [XX,XX,XX,XX,XX,XX,XXX]\n",
- "collect_accuracies = []\n",
- "for tree in trees:\n",
- " collect_accuracies.append(ee.Classifier.smileRandomForest(tree).train(training, label, bands).confusionMatrix().accuracy().getInfo())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 282
- },
- "executionInfo": {
- "elapsed": 46,
- "status": "ok",
- "timestamp": 1675255752012,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "5snBKKjH0B_n",
- "outputId": "c75fc8e0-2d2e-437c-dac0-872096ba7c53"
- },
- "outputs": [],
- "source": [
- "plt.plot(trees,collect_accuracies)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Question 4: Please describe the figure of your Random Forest Classifier progress. What do you think would be a good value to use, and why?\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FDuRbAI_0B_n"
- },
- "source": [
- "The **Kappa Coefficient** is generated from a statistical test to evaluate the accuracy of a classification. **Kappa** essentially evaluates how well the classification performed as compared to just randomly assigning values, i.e. did the classification do better than random. The **Kappa Coefficient** can range from -1 to 1. A value of 0 indicated that the classification is no better than a random classification. A negative number indicates the classification is significantly worse than random. A value close to 1 indicates that the classification is significantly better than random."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 1168,
- "status": "ok",
- "timestamp": 1675255754998,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "mhiFrddp0B_n",
- "outputId": "eafb84b9-834a-4809-de5d-d930df4e9460"
- },
- "outputs": [],
- "source": [
- "trainAccuracy.kappa().getInfo()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FddqpU1L0B_n"
- },
- "source": [
- "Now let's have a look at some validation of our results. Through the using the 20% we reserved for validation, we can see how the model performs for the data that we did not use to train our model. `errorMatrix` computes a 2D error matrix for a collection by comparing two columns of a collection: one containing the actual values, and one containing predicted values."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 2556,
- "status": "ok",
- "timestamp": 1675255757856,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "Ijg0C-P60B_n",
- "outputId": "722297a3-8abd-49c0-9372-edd337b36f25"
- },
- "outputs": [],
- "source": [
- "validationSample = validation.classify(trained_RF)\n",
- "validationAccuracy = validationSample.errorMatrix(label, 'classification')\n",
- "validationAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "yO9NB-Hglp2g"
- },
- "source": [
- "\n",
- "Question 5: Which values did you obtain for Kappa and your validation sample. Please interpret them.\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "StLnH3510B_o"
- },
- "source": [
- "### CART Classifier\n",
- "---\n",
- "Next, we use a **CART** classifier to find the best method to use the spectral values to separate the labels. The classifiers known as Classification and Regression Trees (CART) partition the spectral data space successive binary splits arranged in a tree form. Graphically, classification trees identify lines that successively split the data space to separate the training points into their classes."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 7,
- "status": "ok",
- "timestamp": 1675255757857,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "O02Zvc6c0B_o"
- },
- "outputs": [],
- "source": [
- "trained_CART = ee.Classifier.smileCart().train(training, label, bands)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "r8yr2-6C0B_o"
- },
- "source": [
- "Get a confusion matrix and overall accuracy for the training sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 223,
- "status": "ok",
- "timestamp": 1675255759740,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "DesfJjTu0B_o",
- "outputId": "2f9a5f63-ac94-4164-b876-49d5a1b9e05a"
- },
- "outputs": [],
- "source": [
- "trainAccuracy = trained_CART.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "9mvK8a2q0B_o"
- },
- "source": [
- "Get a confusion matrix and overall accuracy for the validation sample."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 1562,
- "status": "ok",
- "timestamp": 1675255762999,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "kknOXH-i0B_o",
- "outputId": "0cddeebd-448a-41ac-c2ad-47563340e2d2"
- },
- "outputs": [],
- "source": [
- "validationSample = validation.classify(trained_CART)\n",
- "validationAccuracy = validationSample.errorMatrix(label, 'classification')\n",
- "validationAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "LWUF3S4I0B_p"
- },
- "source": [
- "### Naive Bayes\n",
- "---\n",
- "Now let's try to use the **Naive Bayes** classifier. This is classification approach that adopts the principle of class conditional independence from the Bayes Theorem. This means that the presence of one feature does not impact the presence of another in the probability of a given outcome, and each predictor has an equal effect on that result. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 6,
- "status": "ok",
- "timestamp": 1675255763000,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "n2Ya_jcd0B_p"
- },
- "outputs": [],
- "source": [
- "trained_NaiveBayes = ee.Classifier.smileNaiveBayes().train(training, label, bands)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "lUOIAQnL0B_p"
- },
- "source": [
- "And let's get the overall accuracy of the training data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 609,
- "status": "ok",
- "timestamp": 1675255765049,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "zsKWKl3q0B_p",
- "outputId": "c10b2899-6b93-4d47-f490-4598918462fd"
- },
- "outputs": [],
- "source": [
- "trainAccuracy = trained_NaiveBayes.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "coPmQpky0B_q"
- },
- "source": [
- "And the results of our validation data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 1636,
- "status": "ok",
- "timestamp": 1675255767872,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "67FsgzXR0B_q",
- "outputId": "0f75177d-4c88-4077-aaf0-a943e61a3523"
- },
- "outputs": [],
- "source": [
- "validationSample = validation.classify(trained_NaiveBayes)\n",
- "validationAccuracy = validationSample.errorMatrix(label, 'classification')\n",
- "validationAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3XX9MZw1mImP"
- },
- "source": [
- "\n",
- "Question 6: Describe the results of the CART classifier and the Naive Bayes. How did they perform and why do you think the results are different between them?\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "hOFlaXjj0B_q",
- "tags": []
- },
- "source": [
- "### Support Vector Machine\n",
- "---\n",
- "And Finally, we will use a **Support Vector Machine**. This is a popular supervised learning model developed by Vladimir Vapnik, used for both data classification and regression. That said, it is typically leveraged for classification problems, constructing a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is known as the decision boundary, separating the classes of data points (e.g., oranges vs. apples) on either side of the plane.\n",
- "\n",
- "As you can see in the cell below, to make sure we can run the **Support Vector Machine**, we substantially reduce our training data to only 20%, instead of 80%. We do this, because running the classifier on the entire training dataset would take an extremely long time to run."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 2,
- "status": "ok",
- "timestamp": 1675255768743,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "kw8RfLdp0B_q"
- },
- "outputs": [],
- "source": [
- "trainingSample_SVM = sample.filter('random <= 0.2')\n",
- "validationSample_SVM = sample.filter('random > 0.8')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 4,
- "status": "ok",
- "timestamp": 1675255769980,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "d4uhCNTY0B_q"
- },
- "outputs": [],
- "source": [
- "# Overlay the points on the imagery to get training.\n",
- "training_SVM = image.select(bands).sampleRegions(\n",
- " **{'collection': trainingSample_SVM, 'properties': [label], 'scale': 100}\n",
- ")\n",
- "\n",
- "# Overlay the points on the imagery to get training.\n",
- "validation_SVM = image.select(bands).sampleRegions(\n",
- " **{'collection': validationSample_SVM, 'properties': [label], 'scale': 100}\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 202,
- "status": "ok",
- "timestamp": 1675255771142,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "8TBp0sI00B_r"
- },
- "outputs": [],
- "source": [
- "trained_SVM = ee.Classifier.libsvm().train(training_SVM, label, bands)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TY7oY8J30B_r"
- },
- "source": [
- "Now let's have a look at the accuracy again of our training data. **NOTE: This could really take some time to run! You may want to get a drink**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 531,
- "status": "ok",
- "timestamp": 1675255773006,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "RoudSzHZ0B_r",
- "outputId": "c99cce22-dcba-487e-a535-932e0ab6e887"
- },
- "outputs": [],
- "source": [
- "trainAccuracy = trained_SVM.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oBE1whvp0B_r"
- },
- "source": [
- "And the results of our validation data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 2265,
- "status": "ok",
- "timestamp": 1675255776484,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "3htkGlHk0B_r",
- "outputId": "c6a7584b-9ba3-4125-d358-3d146a7329ef"
- },
- "outputs": [],
- "source": [
- "validationSample = validation_SVM.classify(trained_SVM)\n",
- "validationAccuracy = validationSample.errorMatrix(label, 'classification')\n",
- "validationAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "iDziSfz2mc5N"
- },
- "source": [
- "\n",
- "Question 7: Describe the results of the Support Vector Machine. Are you surprised by the outcome?\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "iBw4Qxo20B_r"
- },
- "source": [
- "### Classify and visualize output\n",
- "---\n",
- "Now let's classify the image with the same bands used for training, using the Random Forest model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 2,
- "status": "ok",
- "timestamp": 1675255777327,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "juCVFrd30B_r"
- },
- "outputs": [],
- "source": [
- "CLC_Classified = image.select(bands).classify(trained_RF)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "dRKv-tlT0B_s"
- },
- "source": [
- "Let's look at the results on a map."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "e93fa2d603b04a1faafb9b1c28333856",
- "edcb631dedd44674b28172fa8d16da9b",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "61b61a156a3644c19edec2945fab21b8",
- "a0bc2372a05943c19c357d3f7198ce30",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "a4a869ed419c43839645d2835664636c",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900",
- "53e0928ed0914f6491e91a215b33573d",
- "082b7daf697b4de3befccfdc1f163478"
- ]
- },
- "executionInfo": {
- "elapsed": 1668,
- "status": "ok",
- "timestamp": 1675255784495,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "2O0OhpVV0B_s",
- "outputId": "eb70a56a-9881-4855-c72a-ab368a2de9fb"
- },
- "outputs": [],
- "source": [
- "Map.addLayer(CLC_Classified.randomVisualizer(), {}, 'classified')\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "JHbF3IbH0B_s"
- },
- "source": [
- "To render a categorical map, we can set two image properties: `landcover_class_values` and `landcover_class_palette`. We can use the same style as the CLC so that it is easy to compare the two maps. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 315,
- "status": "ok",
- "timestamp": 1675255787083,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "qZ3FoUmk0B_s"
- },
- "outputs": [],
- "source": [
- "class_values = CLC.get('landcover_class_values').getInfo()\n",
- "class_palette = CLC.get('landcover_class_palette').getInfo()\n",
- "\n",
- "landcover = CLC_Classified.set('classification_class_values', class_values)\n",
- "landcover = landcover.set('classification_class_palette', class_palette)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FcgvM7830B_s"
- },
- "source": [
- "Now we can plot the results again but use the Corine Land Cover legend and colorscheme"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "a0bc2372a05943c19c357d3f7198ce30",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "e93fa2d603b04a1faafb9b1c28333856",
- "61b61a156a3644c19edec2945fab21b8",
- "edcb631dedd44674b28172fa8d16da9b",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "a4a869ed419c43839645d2835664636c",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900",
- "53e0928ed0914f6491e91a215b33573d",
- "082b7daf697b4de3befccfdc1f163478"
- ]
- },
- "executionInfo": {
- "elapsed": 1175,
- "status": "ok",
- "timestamp": 1675255789623,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "LdR9Ss3S0B_s",
- "outputId": "0cd71778-21b8-4074-e861-5a1207c29eb6"
- },
- "outputs": [],
- "source": [
- "Map.addLayer(landcover, {}, 'Land cover')\n",
- "Map.add_legend(builtin_legend='COPERNICUS/CORINE/V20/100m')\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "NA-L_Q9z0B_s",
- "tags": []
- },
- "source": [
- "## 6. Classify the Landsat-8 using ESA WorldCover\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "3--dkQDS0B_t"
- },
- "source": [
- "While the Corine Land Cover data provides us with some promising results, it would be interesting to see if we can do something similar using a different data source. To do so, we are going to make use of the ESA WorldCover data.\n",
- "\n",
- "Let's start again with exploring the data:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "a0bc2372a05943c19c357d3f7198ce30",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "e93fa2d603b04a1faafb9b1c28333856",
- "61b61a156a3644c19edec2945fab21b8",
- "53e0928ed0914f6491e91a215b33573d",
- "edcb631dedd44674b28172fa8d16da9b",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "a4a869ed419c43839645d2835664636c",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900",
- "082b7daf697b4de3befccfdc1f163478"
- ]
- },
- "executionInfo": {
- "elapsed": 544,
- "status": "ok",
- "timestamp": 1675255793325,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "eOtQqnxb0B_t",
- "outputId": "d720089b-faa8-4e91-bbcb-9fdcf642c21b"
- },
- "outputs": [],
- "source": [
- "ESA = ee.Image('ESA/WorldCover/v100/2020').clip(image.geometry())\n",
- "Map.addLayer(ESA, {}, 'ESA')\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ojprtQ2V0B_t"
- },
- "source": [
- "And now we generate the training set again:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 685,
- "status": "ok",
- "timestamp": 1675255796417,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "C3OPr-x_0B_t"
- },
- "outputs": [],
- "source": [
- "classValues = [10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100]\n",
- "remapValues = ee.List.sequence(0, 10)\n",
- "label = 'lc'\n",
- "lc = ESA.remap(classValues, remapValues).rename(label).toByte()\n",
- "\n",
- "sample = image.addBands(lc).stratifiedSample(\n",
- " **{\n",
- " 'numPoints': 5000,\n",
- " 'classBand': label,\n",
- " 'region': image.geometry(),\n",
- " 'scale': 100,\n",
- " 'geometries': True\n",
- "})"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "o97yQzRS0B_t"
- },
- "source": [
- "And again, we split the data into training and validation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 276,
- "status": "ok",
- "timestamp": 1675255799082,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "1FIql2VI0B_t"
- },
- "outputs": [],
- "source": [
- "sample = sample.randomColumn()\n",
- "trainingSample = sample.filter('random <= 0.8')\n",
- "validationSample = sample.filter('random > 0.8')"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "nWPpXPtr0B_t"
- },
- "source": [
- "As the **CART** classifier performed reasonably well for CLC, we will re-use this one to train the algorithm through using the ESA WorldCover data."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 302,
- "status": "ok",
- "timestamp": 1675255801416,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "MKhxhX7W0B_u"
- },
- "outputs": [],
- "source": [
- "trainedClassifier = ee.Classifier.smileCart().train(\n",
- " **{\n",
- " 'features': trainingSample,\n",
- " 'classProperty': label,\n",
- " 'inputProperties': image.bandNames()\n",
- "})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 68
- },
- "executionInfo": {
- "elapsed": 274,
- "status": "ok",
- "timestamp": 1675255802491,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "93o_CbXT0B_u",
- "outputId": "88639c57-3e96-4d0f-fb9c-a291c9788673"
- },
- "outputs": [],
- "source": [
- "trainAccuracy = trainedClassifier.confusionMatrix()\n",
- "trainAccuracy.accuracy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4GYp6Xee0B_u"
- },
- "source": [
- "Resulting in our newly classified map"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 208,
- "status": "ok",
- "timestamp": 1675255803868,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "wMxX-9QT0B_u"
- },
- "outputs": [],
- "source": [
- "ESAWorldCover_classified = image.classify(trainedClassifier)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "kUUEh3MH0B_u"
- },
- "source": [
- "And let's visualize this!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 821,
- "referenced_widgets": [
- "006dc05428bf468e9724f0e0fcd388bd",
- "95d94c3ab3094ddb880ef0b9a2dce01b",
- "952b82e6c713459eb6ace9f47ada219a",
- "3d9a59a8a0a44f55ab359b66edbe3f9d",
- "036d4dad1c8f4dd3bcd37d3154fdf452",
- "ba860e12989c41e8b39524f943f8fc51",
- "402c171b86c9479f9d3cf2743b9055ff",
- "cf561e18f0dc4e578e23de278524ab78",
- "5eb406428f4d4fd2903c67dd9b9ce343",
- "a0bc2372a05943c19c357d3f7198ce30",
- "c3e40ca8e6924930b1367da0f89045a3",
- "f15f98b1745248c7ae46f849862d6c9d",
- "a017674caf934e5ca12b9cf9cd77f2d1",
- "c462bdeb7af24cfd86fd72ac62e04fc7",
- "0da7def710c2472ab80647a297cc6223",
- "ef2ab37274534537a07a30a7d56c5ea4",
- "e93fa2d603b04a1faafb9b1c28333856",
- "61b61a156a3644c19edec2945fab21b8",
- "53e0928ed0914f6491e91a215b33573d",
- "082b7daf697b4de3befccfdc1f163478",
- "edcb631dedd44674b28172fa8d16da9b",
- "c2128a11f9bb4eb894374538f1201791",
- "2a12629894de48ce857509fbc3f44854",
- "fc17ea0440bb4c1cbe6d3e4b98e0d1c7",
- "09cfd28117b746d883b843c26ac8049c",
- "5f0e409a882f4506b42b2caaa442cc33",
- "e5e131b331024a248fe58b7ed690b170",
- "13369112cc4f4a00b2e66fd71fd18efd",
- "a4a869ed419c43839645d2835664636c",
- "46d8a348d25b4416a46977c3fc2f290d",
- "259e9270784d43dd972a4fa676ed8444",
- "4cbf17f1bcf2440093c3609646365ee6",
- "741b145db27c41e4a098b52276a2a0e1",
- "4d10f43d2f6e46ab84d70132eaff4d60",
- "11dccc576b7345d98440c9df0d96c777",
- "557d923cb8e84c06b3b56629f6684900"
- ]
- },
- "executionInfo": {
- "elapsed": 834,
- "status": "ok",
- "timestamp": 1675255805994,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "IQ_dprZZ0B_u",
- "outputId": "ed6fdf4d-c601-491d-996a-db4e100a9e6d"
- },
- "outputs": [],
- "source": [
- "classVis = {\n",
- " 'min': 0,\n",
- " 'max': 10,\n",
- " 'palette': ['006400' ,'ffbb22', 'ffff4c', 'f096ff', 'fa0000', 'b4b4b4',\n",
- " 'f0f0f0', '0064c8', '0096a0', '00cf75', 'fae6a0']\n",
- "};\n",
- "\n",
- "#Map.addLayer(lc, classVis, 'lc');\n",
- "Map.addLayer(ESAWorldCover_classified, classVis, 'Classified')\n",
- "Map"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Question 8: Interpret the results of the classification with ESA WorldCover. How has this model performed compared to using Corine Land Cover?
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "CHWkPEIQ0B_v"
- },
- "source": [
- "## 7. Analyze and assess your Landsat-8 land cover map\n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "eAYl8Ru80B_v"
- },
- "source": [
- "Now we have two land cover maps generated for most part of the Netherlands, let's analyze and judge their quality.\n",
- "\n",
- "To do so, you can either zoom in on the same area you used last week to learn OpenStreetMap, or specify a different area when it is outside the bounds of our land cover map.\n",
- "\n",
- "We do so, by creating a bounding box. You can use [this website](https://boundingbox.klokantech.com/) to create a bounding box. Once you have created a bounding box, you can select the 'CSV' tab and copy paste the coordinates in the cell below."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 220,
- "status": "ok",
- "timestamp": 1675255808972,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "Gqi_VPSH0B_v"
- },
- "outputs": [],
- "source": [
- "bbox = [4.248882,51.853884,4.542965,52.018829]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "XKuThEge0B_v"
- },
- "source": [
- "And now we can translate this into a rectangle to be used with the **Google Earth Engine**:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "executionInfo": {
- "elapsed": 198,
- "status": "ok",
- "timestamp": 1675255810822,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "MuMIBg9z0B_v"
- },
- "outputs": [],
- "source": [
- "focus_region = ee.Geometry.Rectangle([bbox[0], \n",
- " bbox[1],\n",
- " bbox[2],\n",
- " bbox[3]])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "RNcZTd7_0B_v"
- },
- "source": [
- "### Corine Land Cover\n",
- "\n",
- "We will start with identifying how well Corine Land Cover data has been classified with our classifier. To do so, we will first download the images from the **Google Earth Engine**. To do that, we will use the `ee_export_image()` function. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 7798,
- "status": "ok",
- "timestamp": 1675255820686,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "qqWgZUGg0B_v",
- "outputId": "0d6bab8f-1a30-46cb-f98f-e61732fa4f05"
- },
- "outputs": [],
- "source": [
- "geemap.ee_export_image(\n",
- " CLC, filename='CLC.tif', scale=100, file_per_band=False,region=focus_region\n",
- ")\n",
- "\n",
- "geemap.ee_export_image(\n",
- " CLC_Classified, filename='CLC_Classified.tif', scale=100, file_per_band=False,region=focus_region\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "EIu-PJxT0B_w"
- },
- "source": [
- "Now we can use **xarray** to open the geotiffs and prepare them for visualisation."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 291
- },
- "executionInfo": {
- "elapsed": 236,
- "status": "ok",
- "timestamp": 1675255852115,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "q2Hpvsc00B_w",
- "outputId": "23961362-a12b-44eb-98f9-4f9f70adfb61"
- },
- "outputs": [],
- "source": [
- "CLC_original = xr.open_dataset('CLC.tif', engine = 'rasterio')\n",
- "CLC_original = CLC_original.rename({'x': 'lat','y': 'lon'})\n",
- "CLC_original.rio.set_spatial_dims(x_dim=\"lat\",y_dim=\"lon\", inplace=True)\n",
- "\n",
- "CLC_classified = xr.open_dataset('CLC_Classified.tif', engine = 'rasterio')\n",
- "CLC_classified = CLC_classified.rename({'x': 'lat','y': 'lon'})\n",
- "CLC_classified.rio.set_spatial_dims(x_dim=\"lat\",y_dim=\"lon\", inplace=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Ai7YYvmy0B_w"
- },
- "source": [
- "Let's plot the original map (left) and the classified map (right)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TCDarmwZ0B_P",
+ "tags": []
+ },
+ "source": [
+ "# Tutorial 1: Land-cover classification\n",
+ "\n",
+ "In this tutorial we are going to explore the power of the **Google Earth Engine** through classifying land-use categories on satelite imagery. In particular, we will focus on supervised classification using methods from frequentist statistics. Supervised classification refers to the process of using a training dataset with known labels to guide a mathematical classifier in the task of labeling spectral space. They key characteristic is that the training dataset guides (or “supervises”) the labeling.\n",
+ "\n",
+ "The `Classifier` package within the **Google Earth Engine** handles supervised classification by traditional ML algorithms running in Earth Engine. We will be looking at three classifiers, namely logistical regression, decision trees, and random forests. The workflow for classification is as follows:\n",
+ "\n",
+ " 1. Collect training data. Assemble features which have a property that stores the known class label and properties storing numeric values for the predictors.\n",
+ " 2. Instantiate a classifier. Set its parameters if necessary.\n",
+ " 3. Train the classifier using the training data.\n",
+ " 4. Classify an image or feature collection.\n",
+ " 5. Estimate the classification error with independent validation data.\n",
+ " 6. Test the trained and validated classifier on new, unseen data\n",
+ " \n",
+ "## Important before we start\n",
+ "
\n",
+ "Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!\n",
+ "\n",
+ "Now, rename the file into Week4_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "hhGbHqRr0B_V"
+ },
+ "source": [
+ "## Learning Objectives\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "d0cuIiyw0B_V",
+ "tags": []
+ },
+ "source": [
+ "- Understand how one can extract data from **Google Earth Engine**.\n",
+ "- Know how to visualize data from **Google Earth Engine** in Python.\n",
+ "- Understand and apply a supervised classification algorithm.\n",
+ "- Create training data and a simple classifier.\n",
+ "- Compare and judge the performance of a land-use classification model.\n",
+ "\n",
+ "P.S. we will mark important machine learning so you can keep track of the most important terms."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6Q_vJ5ZE0B_W"
+ },
+ "source": [
+ "Tutorial Outline
\n",
+ "
\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "q8qSTJEg0B_X",
+ "tags": []
+ },
+ "source": [
+ "## 1. Introducing the packages\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "T2Ta4PUs0B_X"
+ },
+ "source": [
+ "Within this tutorial, we are going to make use of the following main packages:\n",
+ "\n",
+ "[**ee**](https://developers.google.com/earth-engine/guides/python_install) is a Python package to use the the Google Earth Engine.\n",
+ "\n",
+ "[**geemap**](https://geemap.org/) is a Python package for interactive mapping with the Google Earth Engine."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PVyp9IqKv10G"
+ },
+ "source": [
+ "Now we will import these packages in the cell below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "wvqt9J2P0B_Z"
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import random\n",
+ "import numpy as np\n",
+ "\n",
+ "import ee\n",
+ "import geemap\n",
+ "import matplotlib.pyplot as plt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, let's fix all of our random number generators. This will make our experiments reproducible."
+ ],
+ "metadata": {
+ "id": "z4Av_NOXm5JF"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "### Set global seeds ###\n",
+ "seed = 123\n",
+ "np.random.seed(seed)\n",
+ "random.seed(seed)\n",
+ "os.environ['PYTHONHASHSEED'] = str(seed)"
+ ],
+ "metadata": {
+ "id": "3gUXlHbnm_L5"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "B7c9PTZs0B_a"
+ },
+ "source": [
+ "## 2.Accessing Landsat-8 data from Google Earth Engine\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sehHohh50B_b"
+ },
+ "source": [
+ "You will notice that you need to authorize ourselves to be able to use the Google Earth Engine. To do so, we will first have to register ourselves. on the website of the [Google Earth Engine](https://earthengine.google.com/). Here, make a project and remember the project ID.\n",
+ "\n",
+ "Next, you will have to add it to your Colab notebook as a secret:\n",
+ "1. Cick the `key` button on the left-hand side of the Colab panel\n",
+ "2. Press *add new secret*\n",
+ "3. Allow notebook access, set the Name to `EE_PROJECT_ID`, and the Value to the name of your EE project ID."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Alternatively, you can authorize your project as follows, but then be aware that anyone observing the notebook can see your project name, which is potentially a security issue."
+ ],
+ "metadata": {
+ "id": "Q0LfPa3ekekv"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "ee.Authenticate()\n",
+ "ee.Initialize(project=\"YOUR_PROJECT_NAME\")"
+ ],
+ "metadata": {
+ "id": "oDq0Z2iukjh1"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Then, run the code below to load a basemap and to test if you set your credentials right. You can choose any location in Europe, but the code below will show the coordinates we have selected for the Netherlands. Feel free to change them."
+ ],
+ "metadata": {
+ "id": "qP9FDYmSlTmm"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "wGpIK8FO0B_b"
+ },
+ "outputs": [],
+ "source": [
+ "map = geemap.Map(height=800,width=700,center=[52.37,4.5],zoom=7)\n",
+ "map"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Frgjg-VU0B_c"
+ },
+ "source": [
+ "Now that we have succesfully managed to see a map of the Netherlands, let's load a Landsat-8 satellite image to our project. Landsat-8 is the 8th iteration of the American Landsat project, which has continuous records going back 40 years (!), which makes it the longest continuous satellite mission.\n",
+ "\n",
+ "The code below executes the following steps:\n",
+ "1. Look for all images that intersect our point in the Netherlands\n",
+ "2. Filter it by the year 2015\n",
+ "3. Sort by cloud cover, lower = better\n",
+ "4. Select the least cloudy image, and load all bands"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "NioWCK5v0B_d"
+ },
+ "outputs": [],
+ "source": [
+ "point = ee.Geometry.Point([5.0, 51.37])\n",
+ "\n",
+ "image = (\n",
+ " ee.ImageCollection('LANDSAT/LC08/C02/T1_L2')\n",
+ " .filterBounds(point)\n",
+ " .filterDate('2015-01-01', '2015-12-31')\n",
+ " .sort('CLOUD_COVER')\n",
+ " .first()\n",
+ " .select('SR_B[1-7]')\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "\n",
+ "The values in each band are between 1 and 65,535, the 16-bit unsigned integer limit. The values represent the `surface reflection`, the fraction of light that is reflected at each pixel location.\n",
+ "\n",
+ "**Q1: Why is the `surface reflection` given as an integer value, rather than a floating point (decimal) value? What are the practical benefits of storing satellite image data this way?**\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Now, let's visualize this image in colours that are natural to the human eye, using Red-Green-Blue (RGB) bands.\n",
+ "\n",
+ "**Q2: Which band names should be added to the list to visualize the image with in red-green-blue colours?**\n",
+ "\n",
+ "HINT: [This page contains the documentation, where you can find the answer](https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C02_T1_L2)"
+ ],
+ "metadata": {
+ "id": "TGCFX-bBtzmx"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "bands = ['SR_B4', 'SR_B3', 'SR_B2'] # Fill in this list yourself\n",
+ "vis_params = {'max': 25000, 'bands':bands} # Limit upper range so you can see detail\n",
+ "\n",
+ "map.centerObject(point, 8)\n",
+ "map.addLayer(image, vis_params, \"Landsat-8\")\n",
+ "map"
+ ],
+ "metadata": {
+ "id": "I4QjsgkMtyXK"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZKCZxS9R0B_d"
+ },
+ "source": [
+ "As we have specified that we wanted landsat data with as little clouds as possible, let's check the data and the actual cloud cover:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "O4kmjkoM0B_e"
+ },
+ "outputs": [],
+ "source": [
+ "print(ee.Date(image.get('system:time_start')).format('YYYY-MM-dd').getInfo())\n",
+ "print(\"cloud cover:\", image.get('CLOUD_COVER').getInfo())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we have the image, let's add some indices to add some extra information for the model. Together with the raw spectral values, these will be our input variables, from which the model will learn how to fit to the reference values."
+ ],
+ "metadata": {
+ "id": "_pbcU0H5gMP3"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Add the bands you need yoursef in the select functions\n",
+ "dvi = image.select('...').subtract(image.select('...')).rename('DVI')\n",
+ "image = image.addBands(dvi)\n",
+ "\n",
+ " # Compute the NDVI yourself - check the docs if needed\n",
+ "ndvi = ...\n",
+ "image = image.addBands(ndvi)"
+ ],
+ "metadata": {
+ "id": "jWO-e2IXgfvO"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q3: Which other indices like the NDVI could be useful to add here, and why? Add at least one more to the image.**\n",
+ "\n",
+ "**!!! Don't forget to rename the index, and to add it later in the exercise !!!**"
+ ],
+ "metadata": {
+ "id": "ehSJNOL5g39l"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "my_index = ...\n",
+ "image = image.addBands(my_index)"
+ ],
+ "metadata": {
+ "id": "auXaN6aZhDd1"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "UdkbPQd10B_j"
+ },
+ "source": [
+ "For the modelling, we will use reflectance from the optical, NIR, and SWIR bands (SR_B2 - SR_B7), as well as the indices we derived."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "O1pPDUhm0B_j"
+ },
+ "outputs": [],
+ "source": [
+ "# Use these bands for prediction\n",
+ "bands = ['SR_B2', 'SR_B3', 'SR_B4', 'SR_B5', 'SR_B6', 'SR_B7']\n",
+ "indices = ['NDVI', 'DVI'] # Add your index(es) band here\n",
+ "img_bands = [*bands, *indices]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZfU9Ih5W0B_f"
+ },
+ "source": [
+ "## 3.Create reference data using Corine Land Cover\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IzJAm-n_0B_f"
+ },
+ "source": [
+ "Now that we have had a look at the images we will be working with, we can define the reference data (also referred to as *labelled data* and *ground truth* data) that we're going to use to train classification models. The training data is used to fit models. A fitted or trained models can then be used to predict on a new dataset to determine how well it works on unseen data.\n",
+ "\n",
+ "For example, land cover training data will contain examples of each class in the study’s legend. Based on these labels, the classifier can predict the most likely land cover class for each pixel in an image. This is an example of a categorical classification and the training labels are therefore categorical. By contrast, a continuous variable (e.g. percent tree cover) can be predicted using continuous training labels.\n",
+ "\n",
+ "There are several ways you can create a region to fence off your dataset:\n",
+ "\n",
+ "- Draw a shape (e.g., rectangle) on the map and the use `region = Map.user_roi`\n",
+ "- Define a geometry, such as `region = ee.Geometry.Rectangle([-122.6003, 37.4831, -121.8036, 37.8288])`\n",
+ "- Create a buffer zone around a point, such as `region = ee.Geometry.Point([-122.4439, 37.7538]).buffer(10000)`\n",
+ "- If you don't define a region, it will use the image footprint by default\n",
+ "\n",
+ "In our case, we use the complete extent of the image for training and evaluating our models."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "fGt7peIO0B_g"
+ },
+ "source": [
+ "Our reference data is the CORINE Land Cover dataset (CLC), a pan-European dataset made from a combination of harmonized national land cover inventories. Fortunately, we can use GEE to directly access CLC data. We also use the `.clip()` function in which we define that we specifically only want to see the same area as the Landsat-8 image we already selected, rather than the area of our map interface."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "hH0QFPo30B_g"
+ },
+ "outputs": [],
+ "source": [
+ "CLC = ee.Image('COPERNICUS/CORINE/V20/100m/2012').select('landcover').clip(image.geometry())\n",
+ "map.addLayer(CLC, {}, 'CLC')\n",
+ "map"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VLhHhXqK0B_h"
+ },
+ "source": [
+ "Now we are going to make the training dataset. We do so by randomly sampling pixels from the image array that coincide with a CLC reference pixel.\n",
+ "\n",
+ "In the `sample()` function, we have to specify multiple arguments:\n",
+ "\n",
+ "- Through the `region`argument we specify the geographic boundary of our training dataset.\n",
+ "- Through the `scale` argument we specify the size of the points for our sample (30 would indicate 30m).\n",
+ "- Through the `numPixels` argument we specify how many pixels we want to extract. More pixels means more training time needed, but a higher expected accuracy. Wouldn't recommend setting this to more than 10k.\n",
+ "- Through the `seed` argument we specify which random selection order we want. Through using a seed, you can always reproduce your results.\n",
+ "- Through the `geometries` argument we specify whether we want to extract the geometries as well. In most cases you want to retain the geometry, so specify `true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "P0C5_sCX0B_h"
+ },
+ "outputs": [],
+ "source": [
+ "# Fill in the arguments here yourself\n",
+ "lc_points = CLC.sample(\n",
+ " **{\n",
+ " 'region': image.geometry(),\n",
+ " ...\n",
+ " }\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "DdvxczSf0B_h"
+ },
+ "source": [
+ "Let's check how many points we got for our sample set, and which information a single point contains."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ttf-3K0T0B_i"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"total number of points:\", lc_points.size().getInfo())\n",
+ "print(lc_points.first().getInfo())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "uiV3u8I50B_i"
+ },
+ "source": [
+ "Take note of the final column, the *'landcover'* class. It contains the land cover associated with each individual pixel. Try looking up some information about it to understand what it is that we are trying to classify.\n",
+ "\n",
+ "**Q4: Which land cover class is `211`, and which 'superclasses' (more general classes) does it fall under? Do you think the LC classes of CORINE could be globally applicable, why/why not?**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "While the third level of LC classes are very detailed, they contain a lot of uncertainty, and they are difficult for a model to learn. Instead, let's first generalize the land cover classes to the second level of the hierarchy. For this, we will finish the following function that we will apply to the column."
+ ],
+ "metadata": {
+ "id": "D8L-sJnnWTma"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "mORPLblO0B_i",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "def generalize_clc_class(feature):\n",
+ " lc_value = feature.get('...') # Which feature property needs to be accessed?\n",
+ " l2_hierarchy_lc = ee.String(lc_value).slice(...) # TODO: How should this be sliced to retain the second-level hierarchy?\n",
+ " return feature.set(..., l2_hierarchy_lc) # which feature property needs to be set?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# TODO: Look up the GEE function to run the function over each points\n",
+ "lc_reference_pts = lc_points.map(generalize_clc_class) # Students do this themselves"
+ ],
+ "metadata": {
+ "id": "uyxS7KVHQsbS"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Sv7ywcDg0B_i"
+ },
+ "source": [
+ "## 4.Split dataset into folds\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "-jpbuNBc0B_j"
+ },
+ "source": [
+ "In the previous step we made our co-located dataset consisting of reference data (CLC land cover at the 2nd level hierarchy), and Landsat-8 pixels with additional calculated values. The next step is to partiton the dataset into training and validation splits, so that we can train and evaluate models. In short, the training split is used for a model to fit to the input data, while the validation split is used to test if the learned model can predict well on new, unseen data.\n",
+ "\n",
+ " Later in the notebook there will also be a test split on which we evaluate the final performance of models."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# This property of the table stores the land cover labels\n",
+ "label_col = 'landcover'\n",
+ "\n",
+ "# Sample points for validation / training split datasets\n",
+ "sample = lc_reference_pts.randomColumn();\n",
+ "training_sample = sample.filter('random <= 0.8');\n",
+ "validation_sample = sample.filter('random > 0.8');\n",
+ "\n",
+ "train_data = image.select(img_bands).sampleRegions(\n",
+ " **{'collection': training_sample, 'properties': [label_col], 'scale': 100}\n",
+ ")\n",
+ "\n",
+ "val_data = image.select(img_bands).sampleRegions(\n",
+ " **{'collection': validation_sample, 'properties': [label_col], 'scale': 100}\n",
+ ")"
+ ],
+ "metadata": {
+ "id": "SlTEbyCpofAu"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "PAW_rLopv10O"
+ },
+ "source": [
+ "**Q4: What are the downsides of using a sample randomized 80/20 training/validation split like this? Can you think of other ways to partition the dataset which would be better?**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, let's visualize a single imge pixel to make sure that everything is included as expected. Contrary to what some might say, preparing and organizing your dataset takes up the bulk of your time when working with machine learning models. **Always double-check your dataset before training, or you'll find yourself wasting a lot of time**!"
+ ],
+ "metadata": {
+ "id": "VXSi1VbXl8-X"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Print the first feature in the training collection\n",
+ "print(train_data.first().getInfo())"
+ ],
+ "metadata": {
+ "id": "HCxNoOqzkmBQ"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "As can be seen, each datapoint contains reflectance values from the Landsat-8 image, the indices we defined, and a land cover reference label."
+ ],
+ "metadata": {
+ "id": "fc3_UqRPpo62"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "#### Converting to scikit-learn compatible data\n",
+ "The GEE package provides built-in functions for a couple of popular models, so in principle we can continue working entirely within GEE. However, we're interested in teaching you re-usable machine learning skills. Therefore, we run our models using scikit-learn, Python's most popular package for standard ML models. For this purpose, we should convert our data into a data structure that is compatible with this package."
+ ],
+ "metadata": {
+ "id": "hg35-o14SBdU"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "n2Ya_jcd0B_p"
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "def feature_collection_to_lists(features, input_features, label_col):\n",
+ " # Convert data to Pandas for use with scikit\n",
+ " features_list = features.toList(features.size()).getInfo()\n",
+ " features_list = [feature['properties'] for feature in features_list] # Unpack from dictionary\n",
+ " features_df = pd.DataFrame(features_list)\n",
+ "\n",
+ " inputs = features_df[input_features] # Image pixel values for training\n",
+ " labels = features_df[label_col] # Land cover values for validation\n",
+ " return inputs, labels\n",
+ "\n",
+ "x_train, y_train = feature_collection_to_lists(train_data, img_bands, label_col)\n",
+ "x_val, y_val = feature_collection_to_lists(val_data, img_bands, label_col)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "JB9ZstM10B_k",
+ "tags": []
+ },
+ "source": [
+ "## 5. Training models\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qB4em_8m0B_k"
+ },
+ "source": [
+ "Now that we have our dataset split into training, validation, and testing samples, we can begin training models. But wait up! Before we start throwing increasingly sophisticated models at the problem, we have to understand how statistical learning works in the first place. We'll consider a number of frequentist statistical models of increasing complexity. We'll first take a look at a basic model, namely logistical regression. Next, we consider decision trees, and finally we consider a more sophisticated tree-based classifier, namely random forest models."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### 5a. Logistic Regression\n",
+ "---\n",
+ "Let's first consider what is possibly the simplest approach in frequentist statistics for this task - Logistic Regression. Similar to a *Linear Regression* model, a logistic regression model fits each of the input features to the reference data based on the total set of datapoints in the training dataset. It outputs the probability that a given datapoint belongs to a given label class, e.g. the probability that a given sample belongs to the `urban fabric` class. It creates a distinct *'S-shaped'* probability distribution, courtesy of the sigmoid function used to bound it.\n",
+ "\n",
+ "data:image/s3,"s3://crabby-images/26fdd/26fddcb6854f62430e5793b6e2d0c207490b6743" alt="Exam_pass_logistic_curve.svg.png"\n",
+ "\n",
+ "(Image courtesy of Wikimedia)"
+ ],
+ "metadata": {
+ "id": "OXFAonsI0eYo"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q5: Why can't we use a regular linear regression to regress the probabilities of each class, why do we have to use a sigmoid function to bound our predictions? Which kinds of problem(s) could using a linear regression model cause for this case?**"
+ ],
+ "metadata": {
+ "id": "k5p-61m6Pm3z"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's fit the model to our training data and evaluate how well it performs on the training split using a straightforward overall accuracy measures, where we simply count the ratio of correct positives. Under the hood, we assign a `1` or a `0` to each class, which we call one-hot encoding (as in - only one of the labels is *hot*, with a value of `1`). The model is then optimized to predict a confidence for each class separately which indicates if the given class is present or not. In the scikit implementation, the most confident class (closest to `1`) gets assigned to the datapoint. Let's try it out!"
+ ],
+ "metadata": {
+ "id": "pGWTKZDXRszD"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.metrics import accuracy_score, cohen_kappa_score\n",
+ "\n",
+ "# Train a logistic regression model\n",
+ "model = LogisticRegression()\n",
+ "model.fit(x_train.values, y_train.values)\n",
+ "\n",
+ "# Predict and evaluate\n",
+ "train_y_pred = model.predict(x_train)\n",
+ "train_accuracy = accuracy_score(train_y_pred, y_train)\n",
+ "print(\"Training accuracy:\", round(train_accuracy, 3))"
+ ],
+ "metadata": {
+ "id": "E6LOxKyjFLfh"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Okay, a little under 50% accurate, that's reasonably good. But is it better than random guessing? For this, we use Cohen's Kappa. The **Kappa Coefficient** is generated from a statistical test to evaluate the accuracy of a classification. **Kappa** essentially evaluates how well the classification performed as compared to just randomly assigning values, i.e. did the classification do better than random. The **Kappa Coefficient** can range from -1 to 1. A value of 0 indicated that the classification is no better than a random classification. A negative number indicates the classification is significantly worse than random guessing. A value of to 1 indicates that the predicted values are equal to the ground truth, therefore not being the product of any random guessing.\n",
+ "\n",
+ "By the way, in professional parlay, we refer to measures as the `accuracy` and `Cohen's Kappa` as performance metrics."
+ ],
+ "metadata": {
+ "id": "YWYPUzDpTWE5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "train_kappa = cohen_kappa_score(train_y_pred, y_train)\n",
+ "print(\"Training Kappa:\", round(train_kappa, 3))"
+ ],
+ "metadata": {
+ "id": "1GPPPS0dR7ee"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Looking good. It's not a perfect performance, but the model outperforms random guessing by a significant margin. Not bad for the simplest possible model with a fairly large number of target classes.\n",
+ "\n",
+ "Now, let's see if it manages to predict well on pixels which it hasn't been trained for."
+ ],
+ "metadata": {
+ "id": "UL_g9aFxSubH"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "val_y_pred = model.predict(x_val)\n",
+ "\n",
+ "val_accuracy = accuracy_score(y_val, val_y_pred)\n",
+ "print(\"Validation accuracy:\", round(val_accuracy, 3))\n",
+ "\n",
+ "val_kappa = cohen_kappa_score(val_y_pred, y_val)\n",
+ "print(\"Validation Kappa:\", round(val_kappa, 3))"
+ ],
+ "metadata": {
+ "id": "TX2MyP5vBmsP"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q6: Interpret the results. What do you think of the performance on the validation set compared to training, is this an expected outcome, why/why not? Considering the random split we did, do you think this model could suitable for predicting in new, unseen regions?**"
+ ],
+ "metadata": {
+ "id": "A1am1vKbVUDC"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Before we move on, let's make two quick functions that we can re-use whenever we want to evaluate our models quickly. It will save us a lot of code re-use when iterating through candidate models."
+ ],
+ "metadata": {
+ "id": "nm7CpdAmfO5U"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "DesfJjTu0B_o"
+ },
+ "outputs": [],
+ "source": [
+ "def train_model(model, x_val, y_val):\n",
+ " ## Training ##\n",
+ " model.fit(x_train.values, y_train.values)\n",
+ "\n",
+ " train_y_pred = model.predict(x_train)\n",
+ "\n",
+ " train_accuracy = accuracy_score(train_y_pred, y_train)\n",
+ " print(\"Training accuracy:\", round(train_accuracy, 3))\n",
+ "\n",
+ " train_kappa = cohen_kappa_score(train_y_pred, y_train)\n",
+ " print(\"Training Kappa:\", round(train_kappa, 3))\n",
+ " return model\n",
+ "\n",
+ "def validate_model(...):\n",
+ " # Fill in and finish this function based on the above examples\n",
+ " # Make sure to report both metrics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "StLnH3510B_o"
+ },
+ "source": [
+ "### 5b. Decision tree classifiers\n",
+ "---\n",
+ "Now that we have some experience with the simplest possible approach, let's talk about models which are a bit more clever - tree-based classfiers. First, we'll take a look at a standard decision tree model. It learns the best places to cut off the data to create the best separation between classes. As such, contrary to a logistic regression, a decision tree can approximate complex decision boundaries. Graphically, classification trees identify lines that successively split the data space to separate the training points into their classes, as shown below.\n",
+ "\n",
+ "data:image/s3,"s3://crabby-images/7bde6/7bde62443dd5c8fda8a84b9efd3b24f4106bafa6" alt="An-example-of-a-decision-tree-left-with-the-decision-boundary-for-two-features-X1.png"\n",
+ "\n",
+ "([Image courtesy of Clayton Miller](https://www.researchgate.net/publication/313720565_Screening_meter_data_Characterization_of_temporal_energy_data_from_large_groups_of_non-residential_buildings))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's repeat the process of training a model by using a decision tree classifier. This is a parametrized model, which means that there are certain decisions to make, which will affect the performance of the model. First, let's run it with the default settings, and you'll quickly see why it's important to pay attention to the parameters of your models."
+ ],
+ "metadata": {
+ "id": "P_Vj87ExYbwg"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "O02Zvc6c0B_o"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.tree import DecisionTreeClassifier\n",
+ "\n",
+ "decision_tree_model = DecisionTreeClassifier()\n",
+ "decision_tree_model = train_model(decision_tree_model, x_train, y_train)\n",
+ "validate_model(decision_tree_model, x_val, y_val)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "This results in a very big contrast in the model's performance. How can it be that the model almost performs perfectly on the training dataset, but does poorly on the validation dataset? For the answer, [take a look at the documentation of the model](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). In particular, look at the parameters `max_depth` and `min_samples_split`. Use the docs and any other means at your disposal to answer the following question.\n",
+ "\n",
+ "**Q7: In your own words, what is the effect of changing `max depth` and `min_samples_split`? How could these two parameters result in a perfect fit for the model on the training data?**"
+ ],
+ "metadata": {
+ "id": "55KDeyABdsaL"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "9mvK8a2q0B_o"
+ },
+ "source": [
+ "So now that we know about parameters, let's try setting these two parameters such that the model doesn't become hyper-specialized on the training dataset, at the cost of performance on unseen data. Experiment with different parameter settings in the code box below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "kknOXH-i0B_o"
+ },
+ "outputs": [],
+ "source": [
+ "# Try it yourself a couple of times: Instantiate the model and change the parameters.\n",
+ "decision_tree_model = ...\n",
+ "decision_tree_model = train_model(decision_tree_model, x_train, y_train)\n",
+ "validate_model(decision_tree_model, x_val, y_val)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q8: Report the settings of the best-performing model on the validation dataset, along with the validation accuracy and Kappa. Why do you think the parameters you chose ended up working better than the default settings, which produced a perfect fit on the training dataset?**"
+ ],
+ "metadata": {
+ "id": "y1XZI7AbhSDi"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "LpBDrhs60B_k"
+ },
+ "source": [
+ "### 5c. Random Forests\n",
+ "---\n",
+ "\n",
+ "> *Do you know what is better than one tree? Two trees! And what is better than two trees? A forest!*\n",
+ "\n",
+ "*Tin Kam Ho, inventor of Random Forest models, 1995* (probably).\n",
+ "\n",
+ "While the quote above is a tongue-in-cheek fake quote, the author of Random Forest (RF) models clearly was onto something. What if we use multiple of decision trees, then take their average prediction? This is an ensemble approach, where the rationale is that we can achieve a better performance by averaging multiple prediction models. In the case of RF, we ensemble individual decision trees, and through some tricks (i.e. randomly not giving trees access to some input variables), we produce trees which have slightly different prediction characteristics.\n",
+ "\n",
+ "This intuitive approach has been a mainstay of machine learning for remote sensing for decades, with good performance even on small datasets. In fact, up until very recently they were still preferred over sophisticated deep learning model for niche tasks with a small amount of datapoints, due to their ease of use and efficiency! Bigger is certainly not always better. So let's try them out!\n",
+ "\n",
+ "data:image/s3,"s3://crabby-images/43a07/43a07231f19ce5d359bb59dcb91e752655311e69" alt="Artboard-1-copy-2-80-1.jpg"\n",
+ "\n",
+ "([Courtesy of Sruthi E.R.](https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "[First, take a look at the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), where you'll see that (logically) it is a parametrized model too. To keep it simple, let's focus on two important parameters while fixing the rest. We focus on `min_samples_split`, and `n_estimators`. You know the first one, and by its name you can reasonably figure out that `n_estimators` is the number of trees we'd like to use.\n",
+ "\n",
+ "Rather than guessing which parameters are best like we did before, we will write a function that will help us to find the best parameters based on a grid search. With a grid search optimization approach, we iterate over all combinations of parameters to find the best one of the set.\n",
+ "\n",
+ "data:image/s3,"s3://crabby-images/3ead8/3ead8fd2914a4965d2f84155eea2aa621305879b" alt="gridsearch.png"\n",
+ "\n",
+ "(A visual example of grid search, where parameter searching occurs in a grid with regular intervals. Adapted from [Pilario, Cao, and Shafiee](https://www.researchgate.net/publication/341691661_A_Kernel_Design_Approach_to_Improve_Kernel_Subspace_Identification))"
+ ],
+ "metadata": {
+ "id": "AcA_nTNEphG7"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "EjcgN80h0B_l"
+ },
+ "outputs": [],
+ "source": [
+ "# RF Parameter choices. Change them as you see fit.\n",
+ "n_estimator_ranges = [10, 25, 50, 100, 250]\n",
+ "min_sample_ranges = [2, 5, 10, 25, 50]\n",
+ "total_combinations = len(n_estimator_ranges) * len(min_sample_ranges)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The following loop goes through all of the parameter combinations to measure their performance. In a nutshell, the code below performs the following steps:\n",
+ "\n",
+ "\n",
+ "1. Instantiate two matrices, one for tracking performance, the other for time taken\n",
+ "2. Iterate over the number of trees in the outer loop (estimators)\n",
+ "3. Iterate over the number of splits per sample in the inner loop\n",
+ "4. In the inner loop, we train our model on the dataset, measure time taken and performance, and insert these in the matrices\n",
+ "\n",
+ "It can take a minute or two to complete running, so have a sip of coffee and wait patiently - after you finish filling in the code to complete it!"
+ ],
+ "metadata": {
+ "id": "XGvysGuTsObl"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from datetime import datetime\n",
+ "from tqdm import tqdm # fancy progress bar package\n",
+ "from IPython.display import clear_output\n",
+ "\n",
+ "import numpy as np\n",
+ "from sklearn.ensemble import RandomForestClassifier\n",
+ "\n",
+ "## Instantiate performance and time-keeping matrices\n",
+ "kappa_paramsearch = np.ndarray([len(n_estimator_ranges), len(min_sample_ranges)])\n",
+ "time_paramsearch = np.ndarray([len(n_estimator_ranges), len(min_sample_ranges)])\n",
+ "\n",
+ "# Time-keeping function to track if the code runs correctly.\n",
+ "pbar = tqdm(total=total_combinations)\n",
+ "\n",
+ "## Perform the parameter searching and keep track of performance/time\n",
+ "## We loop over all possible combinations of both parameters\n",
+ "for n_trees_index, n_trees in enumerate(n_estimator_ranges):\n",
+ " for min_leaves_index, min_samples in enumerate(min_sample_ranges):\n",
+ " ## Update our time keeping function\n",
+ " pbar.set_description(f\"n_trees: {n_trees}, min_leaves: {min_samples}\")\n",
+ "\n",
+ " ## Initialize the random forest classifier - set maximum depth to 25\n",
+ " model = ...\n",
+ "\n",
+ " ## Fit the classifier to the training data/labels and measure time taken\n",
+ " start = datetime.now()\n",
+ " model.fit(x_train.values, y_train.values)\n",
+ " end = datetime.now()\n",
+ " time_taken = float(f\"{(end - start).seconds}.{round((end - start).microseconds, 2)}\")\n",
+ "\n",
+ " ## Use the model to predict all of the validation pixels and calculate the Kappa\n",
+ " val_y_pred = ...\n",
+ " val_kappa = ...\n",
+ "\n",
+ " ## Store accuracies and time taken for this parameter combination\n",
+ " kappa_paramsearch[n_trees_index, min_leaves_index] = round(val_kappa, 3)\n",
+ " time_paramsearch[n_trees_index, min_leaves_index] = round(time_taken, 3)\n",
+ "\n",
+ " pbar.update()\n",
+ "\n",
+ "# Clear the progress bar print(s) once it's done running\n",
+ "clear_output()"
+ ],
+ "metadata": {
+ "id": "SoQTiliasbY_"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's have a look at which parameter combination provides the best performance, and how long it took to run each model."
+ ],
+ "metadata": {
+ "id": "oz4gOjMQsJ4M"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "## Dirty work-around for first label not showing when plotting\n",
+ "n_trees_labels = [0] + n_estimator_ranges\n",
+ "min_leaves_labels = [0] + min_sample_ranges\n",
+ "\n",
+ "fig, (ax_acc, ax_time) = plt.subplots(1, 2, figsize=(total_combinations,total_combinations))\n",
+ "\n",
+ "## Plot accuracy\n",
+ "ax_acc.set_title(\"Cohen's Kappa\", weight='bold')\n",
+ "ax_acc.set(xlabel='Number of trees', ylabel='Min samples per leaf')\n",
+ "ax_acc.set_xticklabels(n_trees_labels)\n",
+ "ax_acc.set_yticklabels(min_leaves_labels)\n",
+ "ax_acc.matshow(kappa_paramsearch.T, cmap=plt.get_cmap('coolwarm_r'))\n",
+ "\n",
+ "## Plot time taken\n",
+ "ax_time.set_title(\"Time Taken in Seconds\", weight='bold')\n",
+ "ax_time.set(xlabel='Number of trees', ylabel='Min samples per leaf')\n",
+ "ax_time.set_xticklabels(n_trees_labels)\n",
+ "ax_time.set_yticklabels(min_leaves_labels)\n",
+ "ax_time.matshow(time_paramsearch.T, cmap=plt.get_cmap('coolwarm'))\n",
+ "\n",
+ "## Add text to accuracy plot\n",
+ "for t_i, t_val in enumerate(n_estimator_ranges):\n",
+ " for l_i, l_val in enumerate(min_sample_ranges):\n",
+ " ax_acc.text(t_i, l_i, s=kappa_paramsearch[t_i, l_i], ha='center')\n",
+ "\n",
+ "## Add text to time plot\n",
+ "for t_i, t_val in enumerate(n_estimator_ranges):\n",
+ " for l_i, l_val in enumerate(min_sample_ranges):\n",
+ " ax_time.text(t_i, l_i, s=time_paramsearch[t_i, l_i], ha='center')"
+ ],
+ "metadata": {
+ "id": "_PLD8spawTUb"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "IMpOi0rk0B_l"
+ },
+ "source": [
+ "**Q9: Interpret and discuss the performance matrices above. What do you believe are the best parameters for the model, and why?**"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's use one more trick to evaluate the performance of our model - a confusion matrix. It is a simple table which compares the reference labels with the model's predictions by looking at the mistakes made between all reference class combinations to determine which classes are most often confused.\n",
+ "\n",
+ "It is evaluated through four metrics:\n",
+ "\n",
+ "* True Positives (TP): The number of times the model correctly predicted the positive class.\n",
+ "* True Negatives (TN): The number of times the model correctly predicted the negative class.\n",
+ "* False Positives (FP): The number of times the model incorrectly predicted the positive class when it was actually negative (also called a \"Type I error\").\n",
+ "* False Negatives (FN): The number of times the model incorrectly predicted the negative class when it was actually positive (also called a \"Type II error\").\n",
+ "\n",
+ "By looking at these values, you can understand how well your model is performing for each class, as well as which classes are confused most often."
+ ],
+ "metadata": {
+ "id": "X-_K6F73DoCU"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
+ "labels = sorted(set(y_val))\n",
+ "cm = ... # Call the confusion_matrix here - figure out how to using the documentation\n",
+ "disp = ConfusionMatrixDisplay(confusion_matrix=cm,\n",
+ " display_labels=labels)\n",
+ "disp.plot()\n",
+ "plt.show()"
+ ],
+ "metadata": {
+ "id": "kgPBK3cnDnTn"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q10: Which classes are most often confused, and why do you think this is the case?**\n",
+ "\n",
+ "(look at the CLC class labels for a more insightful discussion)"
+ ],
+ "metadata": {
+ "id": "YGTfcnNsFh36"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Finally, let's consider the topic of *interpretability*. That is, how easy is it to understand the reasoning behind a model's decisions. When you use a large number of trees, it is technically possible to understand exactly how decisions are made by analyzing the splits in each trees. However, this is a very time-consuming task, and not realistic to do in almost any scenario. How then can we make sense of our predictions, while still making use of its complex, non-linear performance benefits?\n",
+ "\n",
+ "RF models have a built-in trick\n",
+ "\n",
+ "Entropy is based on information theory. It quantifies the uncertainty or impurity in a set of data. A node with high entropy has a mix of different classes (high disorder), while a node with low entropy is more homogeneous (low disorder). In this context, a higher value means that removing the variable causes more disorder, so therefore the class is more important.\n",
+ "\n",
+ "Gini coefficient (or Gini impurity) measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of classes in the node. Like entropy, a higher Gini impurity for a given variable indicates that it is more important for the predictions of the model.\n",
+ "\n",
+ "For more information, please have a look at the book [Introduction to Statistical Learning](https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf), page 312 discusses these metrics."
+ ],
+ "metadata": {
+ "id": "WGmpkc-2k55h"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "## We train two models with the same parameter settings - one for entropy, and one for the GINI coefficient.\n",
+ "entropy_model = RandomForestClassifier(n_estimators=100, min_samples_split=10, max_depth=25, criterion='entropy')\n",
+ "entropy_model.fit(x_train.values, y_train.values)\n",
+ "features_entropy = entropy_model.feature_importances_\n",
+ "\n",
+ "gini_model = RandomForestClassifier(n_estimators=n_trees, min_samples_split=min_samples, max_depth=25, criterion='gini')\n",
+ "gini_model.fit(x_train.values, y_train.values)\n",
+ "features_gini = gini_model.feature_importances_"
+ ],
+ "metadata": {
+ "id": "b1jFOOvOl6fd"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Let's plot the results for the model trained using both of these metrics."
+ ],
+ "metadata": {
+ "id": "rQ7Nabb-3-PX"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "## Make bar plots\n",
+ "x_axis = range(len(features_gini))\n",
+ "fig, (ax_gini, ax_entropy) = plt.subplots(1, 2, figsize=(10,10))\n",
+ "ax_gini.bar(x_axis, features_gini)\n",
+ "ax_gini.set_title(\"GINI coefficient per variable\", weight='bold')\n",
+ "ax_entropy.bar(x_axis, features_entropy, color='darkgreen')\n",
+ "ax_entropy.set_title(\"Entropy per variable\", weight='bold')\n",
+ "\n",
+ "## Add labels to the bars\n",
+ "for index, label in enumerate(img_bands):\n",
+ " ax_gini.text(index, y=features_gini[index]+0.003, s=label, ha='center', rotation=90)\n",
+ "for index, label in enumerate(img_bands):\n",
+ " ax_entropy.text(index, y=features_entropy[index]+0.003, s=label, ha='center', rotation=90)"
+ ],
+ "metadata": {
+ "id": "Z-0SD4Efk4dB"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q10: What is depicted in the plots above? What do the metrics tell us about the variables we used? How would you interpret the results of these plots, and based on them, would you remove any variables?**\n"
+ ],
+ "metadata": {
+ "id": "N3f4lE4ox6dn"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "iBw4Qxo20B_r"
+ },
+ "source": [
+ "# 6. Measuring test set performance\n",
+ "---\n",
+ "We've trained and validated several models, now it's time to find out how they perform when we provide it with new data one last time. This is called model testing. We go through this step only once, and this is generally the metrics that we add to the paper. It's because we used the validation set to find a set of parameters that should work well on new data, so we shouldn't then fine-tune on top of the test set.\n",
+ "\n",
+ "So let's re-make our dataset, this time with an image taken more recently in 2018, and with CLC reference data from 2018. This is a fairly substantial challenge, because we're trying to see if we can use the model to update our land cover maps. Let's see how it goes..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "## Initialize the random forest classifier\n",
+ "model = ...\n",
+ "\n",
+ "## Fit the classifier to the training data/labels and measure time taken\n",
+ "..."
+ ],
+ "metadata": {
+ "id": "DNXuuh7dA6ki"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Sample new Landsat-8 image from 2018 in the same area"
+ ],
+ "metadata": {
+ "id": "c_iY9qUj7WJq"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "test_image = (\n",
+ " ee.ImageCollection('LANDSAT/LC08/C02/T1_L2')\n",
+ " .filterBounds(point)\n",
+ " .filterDate('2018-01-01', '2018-12-31')\n",
+ " .sort('CLOUD_COVER')\n",
+ " .first()\n",
+ " .select('SR_B[1-7]')\n",
+ ")\n",
+ "map.centerObject(point, 8)\n",
+ "map.addLayer(test_image, vis_params, \"Landsat-8\")\n",
+ "map"
+ ],
+ "metadata": {
+ "id": "52VYuUJC686F"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Add indices - you can add your own indices if you like."
+ ],
+ "metadata": {
+ "id": "-ygW3_nQ8gk0"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "dvi = test_image.select('SR_B5').subtract(test_image.select('SR_B4')).rename('DVI') # Add the bands you need yoursef\n",
+ "test_image = test_image.addBands(dvi)\n",
+ "\n",
+ "ndvi = test_image.normalizedDifference(['SR_B5', 'SR_B4']).rename('NDVI') # Add the bands you need yoursef\n",
+ "test_image = test_image.addBands(ndvi)"
+ ],
+ "metadata": {
+ "id": "itsBGwNb8htp"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Sample CLC data in new area"
+ ],
+ "metadata": {
+ "id": "cvoffvJO7qF2"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "CLC = ee.Image('COPERNICUS/CORINE/V20/100m/2018').select('landcover').clip(test_image.geometry())\n",
+ "### Use the same settings as during training. Take care not to load the 2012 version above here!\n",
+ "test_lc_points = CLC.sample(\n",
+ " **{\n",
+ " 'region': test_image.geometry(),\n",
+ " ...\n",
+ " }\n",
+ ")\n",
+ "test_lc_reference_pts = test_lc_points.map(generalize_clc_class) # Students do this themselves"
+ ],
+ "metadata": {
+ "id": "R6YPrXKM7xMd"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Make test dataset"
+ ],
+ "metadata": {
+ "id": "WrBS4G9E7JM_"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "test_data = test_image.select(img_bands).sampleRegions(\n",
+ " **{'collection': test_lc_reference_pts, 'properties': [label_col], 'scale': 100}\n",
+ ")\n",
+ "x_test, y_test = feature_collection_to_lists(test_data, img_bands, label_col)"
+ ],
+ "metadata": {
+ "id": "Bzkx126168PK"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We already have our trained model, and we're not supposed to re-train it on this image. Therefore, all that's left to do is to run this model and test its performance on this new dataset."
+ ],
+ "metadata": {
+ "id": "50uZymf0_ENJ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "## Use the model to predict all of the validation pixels\n",
+ "test_y_pred = model.predict(x_test)\n",
+ "test_accuracy = accuracy_score(test_y_pred, y_test)\n",
+ "print(\"Test set accuracy:\", round(test_accuracy, 3))\n",
+ "\n",
+ "test_kappa = cohen_kappa_score(test_y_pred, y_test)\n",
+ "print(\"Test Kappa:\", round(test_kappa, 3))"
+ ],
+ "metadata": {
+ "id": "hzY4Lgff_J_T"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "In principle, the model's results on the test set are what we would report in a paper. Depending on your expectations, you might be surprised with the results of the model compared to the test set. However, it's important to remember a few things:\n",
+ "1. We haven't played around much with the input variabes. There are more tricks that we could tap into that we haven't discussed (e.g. think about how important it is to see what happens around a pixel in order to determine what's happening inside it).\n",
+ "2. There's a total of 15 output classes. This is a **hard** task for a simple model.\n",
+ "3. We are not consistent with the time of year at which we're predicting, which for some land cover classes has a large impact on how it's perceived.\n",
+ "4. We're training it on just 10'000 pixels. This is a limited amount for a hard task, and usually we'd prefer to use much more data.\n",
+ "\n",
+ "And there's many more reasons we could list. The reality is that modelling is a hard task, and a lot of the trial-and-error and failed models get underreported in papers. By showing you the harsh realities of real-life, difficult modelling problems, we hope to have demonstrated that this type of modelling has strong potential, but also that it doesn't come easy. With that in mind, let's reflect a little bit on how we can do better. Please answer the question below, and we will provide feedback and discuss some of the answers."
+ ],
+ "metadata": {
+ "id": "9xYDtaSU7-rk"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Q11: What are your thoughts on the final test set accuracy? Which factors of the image, reference data, and our workflow do you think affected the final performance of the model the most? How do you think the model workflow can be improved to provide better predictons?**"
+ ],
+ "metadata": {
+ "id": "0KLl5eZRCpK0"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Wrapping up\n",
+ "That's it for this practical. There's many more nuances we could get into, but we'd need another lecture or two if we want to cover them all. Hopefully this practical gave you some insights into what it means to run a machine learning pipeline. We also hope that you see now that machine learning isn't a fully-automatic magical problem-solving box, but that there are quite a lot of steps involved and decisions to make. It's up to you, the modeller, to make the right choices with regards to data set-up and model parameter choices. Since this tutorial is getting lengthy already, we unfortunately can't go into more detail on how to improve the model further, especially on the test set. We will provide feedback based on the questions, and provide general tips on what makes a successful ML pipeline. We also hope that this tutorial inspires you to look deeper into modelling, while retaining a critical outlook on what it is and what it can do!\n",
+ "\n",
+ "and, remember...\n",
+ "\n",
+ "> *All models are wrong, some are useful*\n",
+ "\n",
+ "*George Box, 1976*\n"
+ ],
+ "metadata": {
+ "id": "rucZKCH0F_b2"
+ }
+ }
+ ],
+ "metadata": {
"colab": {
- "base_uri": "https://localhost:8080/",
- "height": 570
- },
- "executionInfo": {
- "elapsed": 1274,
- "status": "ok",
- "timestamp": 1675255871504,
- "user": {
- "displayName": "RA Odongo",
- "userId": "17326618845752559881"
- },
- "user_tz": -60
- },
- "id": "0t0mwCj20B_w",
- "outputId": "e82e9792-657c-4f2b-bfab-44c75b6642d2"
- },
- "outputs": [],
- "source": [
- "fig, axes = plt.subplots(1, 2,figsize=(28,10))\n",
- "\n",
- "class_palette_coded = ['#'+x for x in class_palette]\n",
- "\n",
- "CLC_original[\"band_data\"].plot(ax=axes[0],levels=len(class_palette_coded),colors=class_palette_coded)\n",
- "CLC_classified[\"band_data\"].plot(ax=axes[1],levels=len(class_palette_coded),colors=class_palette_coded)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Question 9: Describe the results of the Corine Land Cover classification. How do you think the model has performed when you zoom in on a specific area. Do you notice specific land-use classes to be worse than expected? Or better than expected? Please elaborate. \n",
- "
"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "Question 10: Obtain the Corine Land Cover results once more by using a different classification model. How do the results change. Have they become better, or worse? Please elaborate.\n",
- "
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "colab": {
- "provenance": [
- {
- "file_id": "https://github.com/ElcoK/BigData_AED/blob/main/week5/tutorial1.ipynb",
- "timestamp": 1674849567568
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.11"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "f323064ae63d54ed8d769390a968e914fbf7abacffc63e116cd2e04a08ed2d24"
+ }
}
- ]
- },
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.11"
},
- "vscode": {
- "interpreter": {
- "hash": "f323064ae63d54ed8d769390a968e914fbf7abacffc63e116cd2e04a08ed2d24"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file