diff --git a/TAA1/tutorial.ipynb b/TAA1/tutorial.ipynb index dcbc9df..1424d54 100644 --- a/TAA1/tutorial.ipynb +++ b/TAA1/tutorial.ipynb @@ -18,13 +18,22 @@ " 3. Train the classifier using the training data.\n", " 4. Classify an image or feature collection.\n", " 5. Estimate the classification error with independent validation data.\n", - " 6. Test the trained and validated classifier on new, unseen data\n", + " 6. Test the trained and validated classifier on new, unseen data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WedLwKLSV-Nr" + }, + "source": [ " \n", - "## Important before we start\n", + "## Important before we start!\n", "
\n", - "Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive!\n", "\n", - "Now, rename the file into Week4_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen." + "⚠️**Questions are indicated in full bold face.** They should be filled in on Canvas. There are 13 questions in total, all with open answers. We suggest keeping answers brief, the questions are mostly to guide your thinking about what is important in each section!\n", + "\n", + "⚠️ Make sure that you save this file before you continue, else you will lose everything. To do so, go to Bestand/File and click on Een kopie opslaan in Drive/Save a Copy on Drive! Now, rename the file into Week4_Tutorial1.ipynb. You can do so by clicking on the name in the top of this screen." ] }, { @@ -53,24 +62,6 @@ "P.S. we will mark important machine learning so you can keep track of the most important terms." ] }, - { - "cell_type": "markdown", - "metadata": { - "id": "6Q_vJ5ZE0B_W" - }, - "source": [ - "

Tutorial Outline

\n", - "
\n", - "
" - ] - }, { "cell_type": "markdown", "metadata": { @@ -92,7 +83,9 @@ "\n", "[**ee**](https://developers.google.com/earth-engine/guides/python_install) is a Python package to use the the Google Earth Engine.\n", "\n", - "[**geemap**](https://geemap.org/) is a Python package for interactive mapping with the Google Earth Engine." + "[**geemap**](https://geemap.org/) is a Python package for interactive mapping with the Google Earth Engine.\n", + "\n", + "[**scikit-learn**](https://scikit-learn.org/stable/) is a Python package for statistical learning that is built on the general NumPy ecosystem. It is the most-used package for classical machine learning." ] }, { @@ -123,27 +116,27 @@ }, { "cell_type": "markdown", - "source": [ - "Finally, let's fix all of our random number generators. This will make our experiments reproducible." - ], "metadata": { "id": "z4Av_NOXm5JF" - } + }, + "source": [ + "Finally, let's fix all of our random number generators. This will make our experiments reproducible." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3gUXlHbnm_L5" + }, + "outputs": [], "source": [ "### Set global seeds ###\n", "seed = 123\n", "np.random.seed(seed)\n", "random.seed(seed)\n", "os.environ['PYTHONHASHSEED'] = str(seed)" - ], - "metadata": { - "id": "3gUXlHbnm_L5" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -171,33 +164,33 @@ }, { "cell_type": "markdown", - "source": [ - "Alternatively, you can authorize your project as follows, but then be aware that anyone observing the notebook can see your project name, which is potentially a security issue." - ], "metadata": { "id": "Q0LfPa3ekekv" - } + }, + "source": [ + "Alternatively, you can authorize your project as follows, but then be aware that anyone observing the notebook can see your project name, which is potentially a security issue." + ] }, { "cell_type": "code", - "source": [ - "ee.Authenticate()\n", - "ee.Initialize(project=\"YOUR_PROJECT_NAME\")" - ], + "execution_count": null, "metadata": { "id": "oDq0Z2iukjh1" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "ee.Authenticate()\n", + "ee.Initialize(project=\"YOUR_PROJECT_NAME\")" + ] }, { "cell_type": "markdown", - "source": [ - "Then, run the code below to load a basemap and to test if you set your credentials right. You can choose any location in Europe, but the code below will show the coordinates we have selected for the Netherlands. Feel free to change them." - ], "metadata": { "id": "qP9FDYmSlTmm" - } + }, + "source": [ + "Then, run the code below to load a basemap and to test if you set your credentials right. You can choose any location in Europe, but the code below will show the coordinates we have selected for the Netherlands. Feel free to change them." + ] }, { "cell_type": "code", @@ -248,6 +241,9 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "TGCFX-bBtzmx" + }, "source": [ "\n", "The values in each band are between 1 and 65,535, the 16-bit unsigned integer limit. The values represent the `surface reflection`, the fraction of light that is reflected at each pixel location.\n", @@ -261,13 +257,15 @@ "**Q2: Which band names should be added to the list to visualize the image with in red-green-blue colours?**\n", "\n", "HINT: [This page contains the documentation, where you can find the answer](https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C02_T1_L2)" - ], - "metadata": { - "id": "TGCFX-bBtzmx" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "I4QjsgkMtyXK" + }, + "outputs": [], "source": [ "bands = ['SR_B4', 'SR_B3', 'SR_B2'] # Fill in this list yourself\n", "vis_params = {'max': 25000, 'bands':bands} # Limit upper range so you can see detail\n", @@ -275,12 +273,7 @@ "map.centerObject(point, 8)\n", "map.addLayer(image, vis_params, \"Landsat-8\")\n", "map" - ], - "metadata": { - "id": "I4QjsgkMtyXK" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -305,15 +298,20 @@ }, { "cell_type": "markdown", - "source": [ - "Now that we have the image, let's add some indices to add some extra information for the model. Together with the raw spectral values, these will be our input variables, from which the model will learn how to fit to the reference values." - ], "metadata": { "id": "_pbcU0H5gMP3" - } + }, + "source": [ + "Now that we have the image, let's add some indices to add some extra information for the model. Together with the raw spectral values, these will be our input variables, from which the model will learn how to fit to the reference values." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jWO-e2IXgfvO" + }, + "outputs": [], "source": [ "# Add the bands you need yoursef in the select functions\n", "dvi = image.select('...').subtract(image.select('...')).rename('DVI')\n", @@ -322,35 +320,30 @@ " # Compute the NDVI yourself - check the docs if needed\n", "ndvi = ...\n", "image = image.addBands(ndvi)" - ], - "metadata": { - "id": "jWO-e2IXgfvO" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "ehSJNOL5g39l" + }, "source": [ "**Q3: Which other indices like the NDVI could be useful to add here, and why? Add at least one more to the image.**\n", "\n", "**!!! Don't forget to rename the index, and to add it later in the exercise !!!**" - ], - "metadata": { - "id": "ehSJNOL5g39l" - } + ] }, { "cell_type": "code", - "source": [ - "my_index = ...\n", - "image = image.addBands(my_index)" - ], + "execution_count": null, "metadata": { "id": "auXaN6aZhDd1" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "my_index = ...\n", + "image = image.addBands(my_index)" + ] }, { "cell_type": "markdown", @@ -495,12 +488,12 @@ }, { "cell_type": "markdown", - "source": [ - "While the third level of LC classes are very detailed, they contain a lot of uncertainty, and they are difficult for a model to learn. Instead, let's first generalize the land cover classes to the second level of the hierarchy. For this, we will finish the following function that we will apply to the column." - ], "metadata": { "id": "D8L-sJnnWTma" - } + }, + "source": [ + "While the third level of LC classes are very detailed, they contain a lot of uncertainty, and they are difficult for a model to learn. Instead, let's first generalize the land cover classes to the second level of the hierarchy. For this, we will finish the following function that we will apply to the column." + ] }, { "cell_type": "code", @@ -519,15 +512,15 @@ }, { "cell_type": "code", - "source": [ - "# TODO: Look up the GEE function to run the function over each points\n", - "lc_reference_pts = lc_points.map(generalize_clc_class) # Students do this themselves" - ], + "execution_count": null, "metadata": { "id": "uyxS7KVHQsbS" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "# TODO: Look up the GEE function to run the function over each points\n", + "lc_reference_pts = lc_points.map(generalize_clc_class) # Students do this themselves" + ] }, { "cell_type": "markdown", @@ -552,6 +545,11 @@ }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SlTEbyCpofAu" + }, + "outputs": [], "source": [ "# This property of the table stores the land cover labels\n", "label_col = 'landcover'\n", @@ -568,12 +566,7 @@ "val_data = image.select(img_bands).sampleRegions(\n", " **{'collection': validation_sample, 'properties': [label_col], 'scale': 100}\n", ")" - ], - "metadata": { - "id": "SlTEbyCpofAu" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -581,48 +574,48 @@ "id": "PAW_rLopv10O" }, "source": [ - "**Q4: What are the downsides of using a sample randomized 80/20 training/validation split like this? Can you think of other ways to partition the dataset which would be better?**" + "**Q5: What are the downsides of using a sample randomized 80/20 training/validation split like this? Can you think of other ways to partition the dataset which would be better?**" ] }, { "cell_type": "markdown", - "source": [ - "Finally, let's visualize a single imge pixel to make sure that everything is included as expected. Contrary to what some might say, preparing and organizing your dataset takes up the bulk of your time when working with machine learning models. **Always double-check your dataset before training, or you'll find yourself wasting a lot of time**!" - ], "metadata": { "id": "VXSi1VbXl8-X" - } + }, + "source": [ + "Finally, let's visualize a single imge pixel to make sure that everything is included as expected. Contrary to what some might say, preparing and organizing your dataset takes up the bulk of your time when working with machine learning models. **Always double-check your dataset before training, or you'll find yourself wasting a lot of time**!" + ] }, { "cell_type": "code", - "source": [ - "# Print the first feature in the training collection\n", - "print(train_data.first().getInfo())" - ], + "execution_count": null, "metadata": { "id": "HCxNoOqzkmBQ" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "# Print the first feature in the training collection\n", + "print(train_data.first().getInfo())" + ] }, { "cell_type": "markdown", - "source": [ - "As can be seen, each datapoint contains reflectance values from the Landsat-8 image, the indices we defined, and a land cover reference label." - ], "metadata": { "id": "fc3_UqRPpo62" - } + }, + "source": [ + "As can be seen, each datapoint contains reflectance values from the Landsat-8 image, the indices we defined, and a land cover reference label." + ] }, { "cell_type": "markdown", + "metadata": { + "id": "hg35-o14SBdU" + }, "source": [ "#### Converting to scikit-learn compatible data\n", "The GEE package provides built-in functions for a couple of popular models, so in principle we can continue working entirely within GEE. However, we're interested in teaching you re-usable machine learning skills. Therefore, we run our models using scikit-learn, Python's most popular package for standard ML models. For this purpose, we should convert our data into a data structure that is compatible with this package." - ], - "metadata": { - "id": "hg35-o14SBdU" - } + ] }, { "cell_type": "code", @@ -670,6 +663,9 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "OXFAonsI0eYo" + }, "source": [ "### 5a. Logistic Regression\n", "---\n", @@ -678,31 +674,33 @@ "![Exam_pass_logistic_curve.svg.png]()\n", "\n", "(Image courtesy of Wikimedia)" - ], - "metadata": { - "id": "OXFAonsI0eYo" - } + ] }, { "cell_type": "markdown", - "source": [ - "**Q5: Why can't we use a regular linear regression to regress the probabilities of each class, why do we have to use a sigmoid function to bound our predictions? Which kinds of problem(s) could using a linear regression model cause for this case?**" - ], "metadata": { "id": "k5p-61m6Pm3z" - } + }, + "source": [ + "**Q6: Why can't we use a regular linear regression to regress the probabilities of each class, why do we have to use a sigmoid function to bound our predictions? Which kinds of problem(s) could using a linear regression model cause for this case?**" + ] }, { "cell_type": "markdown", - "source": [ - "Let's fit the model to our training data and evaluate how well it performs on the training split using a straightforward overall accuracy measures, where we simply count the ratio of correct positives. Under the hood, we assign a `1` or a `0` to each class, which we call one-hot encoding (as in - only one of the labels is *hot*, with a value of `1`). The model is then optimized to predict a confidence for each class separately which indicates if the given class is present or not. In the scikit implementation, the most confident class (closest to `1`) gets assigned to the datapoint. Let's try it out!" - ], "metadata": { "id": "pGWTKZDXRszD" - } + }, + "source": [ + "Let's fit the model to our training data and evaluate how well it performs on the training split using a straightforward overall accuracy measures, where we simply count the ratio of correct positives. Under the hood, we assign a `1` or a `0` to each class, which we call one-hot encoding (as in - only one of the labels is *hot*, with a value of `1`). The model is then optimized to predict a confidence for each class separately which indicates if the given class is present or not. In the scikit implementation, the most confident class (closest to `1`) gets assigned to the datapoint. Let's try it out!" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E6LOxKyjFLfh" + }, + "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score, cohen_kappa_score\n", @@ -715,49 +713,49 @@ "train_y_pred = model.predict(x_train)\n", "train_accuracy = accuracy_score(train_y_pred, y_train)\n", "print(\"Training accuracy:\", round(train_accuracy, 3))" - ], - "metadata": { - "id": "E6LOxKyjFLfh" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "YWYPUzDpTWE5" + }, "source": [ "Okay, a little under 50% accurate, that's reasonably good. But is it better than random guessing? For this, we use Cohen's Kappa. The **Kappa Coefficient** is generated from a statistical test to evaluate the accuracy of a classification. **Kappa** essentially evaluates how well the classification performed as compared to just randomly assigning values, i.e. did the classification do better than random. The **Kappa Coefficient** can range from -1 to 1. A value of 0 indicated that the classification is no better than a random classification. A negative number indicates the classification is significantly worse than random guessing. A value of to 1 indicates that the predicted values are equal to the ground truth, therefore not being the product of any random guessing.\n", "\n", "By the way, in professional parlay, we refer to measures as the `accuracy` and `Cohen's Kappa` as performance metrics." - ], - "metadata": { - "id": "YWYPUzDpTWE5" - } + ] }, { "cell_type": "code", - "source": [ - "train_kappa = cohen_kappa_score(train_y_pred, y_train)\n", - "print(\"Training Kappa:\", round(train_kappa, 3))" - ], + "execution_count": null, "metadata": { "id": "1GPPPS0dR7ee" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "train_kappa = cohen_kappa_score(train_y_pred, y_train)\n", + "print(\"Training Kappa:\", round(train_kappa, 3))" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "UL_g9aFxSubH" + }, "source": [ "Looking good. It's not a perfect performance, but the model outperforms random guessing by a significant margin. Not bad for the simplest possible model with a fairly large number of target classes.\n", "\n", "Now, let's see if it manages to predict well on pixels which it hasn't been trained for." - ], - "metadata": { - "id": "UL_g9aFxSubH" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TX2MyP5vBmsP" + }, + "outputs": [], "source": [ "val_y_pred = model.predict(x_val)\n", "\n", @@ -766,30 +764,25 @@ "\n", "val_kappa = cohen_kappa_score(val_y_pred, y_val)\n", "print(\"Validation Kappa:\", round(val_kappa, 3))" - ], - "metadata": { - "id": "TX2MyP5vBmsP" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "**Q6: Interpret the results. What do you think of the performance on the validation set compared to training, is this an expected outcome, why/why not? Considering the random split we did, do you think this model could suitable for predicting in new, unseen regions?**" - ], "metadata": { "id": "A1am1vKbVUDC" - } + }, + "source": [ + "**Q7: Interpret the results. What do you think of the performance on the validation set compared to training, is this an expected outcome, why/why not? Considering the random split we did, do you think this model could suitable for predicting in new, unseen regions?**" + ] }, { "cell_type": "markdown", - "source": [ - "Before we move on, let's make two quick functions that we can re-use whenever we want to evaluate our models quickly. It will save us a lot of code re-use when iterating through candidate models." - ], "metadata": { "id": "nm7CpdAmfO5U" - } + }, + "source": [ + "Before we move on, let's make two quick functions that we can re-use whenever we want to evaluate our models quickly. It will save us a lot of code re-use when iterating through candidate models." + ] }, { "cell_type": "code", @@ -834,12 +827,12 @@ }, { "cell_type": "markdown", - "source": [ - "Let's repeat the process of training a model by using a decision tree classifier. This is a parametrized model, which means that there are certain decisions to make, which will affect the performance of the model. First, let's run it with the default settings, and you'll quickly see why it's important to pay attention to the parameters of your models." - ], "metadata": { "id": "P_Vj87ExYbwg" - } + }, + "source": [ + "Let's repeat the process of training a model by using a decision tree classifier. This is a parametrized model, which means that there are certain decisions to make, which will affect the performance of the model. First, let's run it with the default settings, and you'll quickly see why it's important to pay attention to the parameters of your models." + ] }, { "cell_type": "code", @@ -858,14 +851,14 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "55KDeyABdsaL" + }, "source": [ "This results in a very big contrast in the model's performance. How can it be that the model almost performs perfectly on the training dataset, but does poorly on the validation dataset? For the answer, [take a look at the documentation of the model](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). In particular, look at the parameters `max_depth` and `min_samples_split`. Use the docs and any other means at your disposal to answer the following question.\n", "\n", - "**Q7: In your own words, what is the effect of changing `max depth` and `min_samples_split`? How could these two parameters result in a perfect fit for the model on the training data?**" - ], - "metadata": { - "id": "55KDeyABdsaL" - } + "**Q8: In your own words, what is the effect of changing `max depth` and `min_samples_split`? How could these two parameters result in a perfect fit for the model on the training data?**" + ] }, { "cell_type": "markdown", @@ -892,12 +885,12 @@ }, { "cell_type": "markdown", - "source": [ - "**Q8: Report the settings of the best-performing model on the validation dataset, along with the validation accuracy and Kappa. Why do you think the parameters you chose ended up working better than the default settings, which produced a perfect fit on the training dataset?**" - ], "metadata": { "id": "y1XZI7AbhSDi" - } + }, + "source": [ + "**Q9: Report the settings of the best-performing model on the validation dataset, along with the validation accuracy and Kappa. Why do you think the parameters you chose ended up working better than the default settings, which produced a perfect fit on the training dataset?**" + ] }, { "cell_type": "markdown", @@ -923,6 +916,9 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "AcA_nTNEphG7" + }, "source": [ "[First, take a look at the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), where you'll see that (logically) it is a parametrized model too. To keep it simple, let's focus on two important parameters while fixing the rest. We focus on `min_samples_split`, and `n_estimators`. You know the first one, and by its name you can reasonably figure out that `n_estimators` is the number of trees we'd like to use.\n", "\n", @@ -931,10 +927,7 @@ "![gridsearch.png]()\n", "\n", "(A visual example of grid search, where parameter searching occurs in a grid with regular intervals. Adapted from [Pilario, Cao, and Shafiee](https://www.researchgate.net/publication/341691661_A_Kernel_Design_Approach_to_Improve_Kernel_Subspace_Identification))" - ], - "metadata": { - "id": "AcA_nTNEphG7" - } + ] }, { "cell_type": "code", @@ -952,6 +945,9 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "XGvysGuTsObl" + }, "source": [ "The following loop goes through all of the parameter combinations to measure their performance. In a nutshell, the code below performs the following steps:\n", "\n", @@ -962,13 +958,15 @@ "4. In the inner loop, we train our model on the dataset, measure time taken and performance, and insert these in the matrices\n", "\n", "It can take a minute or two to complete running, so have a sip of coffee and wait patiently - after you finish filling in the code to complete it!" - ], - "metadata": { - "id": "XGvysGuTsObl" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SoQTiliasbY_" + }, + "outputs": [], "source": [ "from datetime import datetime\n", "from tqdm import tqdm # fancy progress bar package\n", @@ -1012,24 +1010,24 @@ "\n", "# Clear the progress bar print(s) once it's done running\n", "clear_output()" - ], - "metadata": { - "id": "SoQTiliasbY_" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Let's have a look at which parameter combination provides the best performance, and how long it took to run each model." - ], "metadata": { "id": "oz4gOjMQsJ4M" - } + }, + "source": [ + "Let's have a look at which parameter combination provides the best performance, and how long it took to run each model." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_PLD8spawTUb" + }, + "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", @@ -1062,12 +1060,7 @@ "for t_i, t_val in enumerate(n_estimator_ranges):\n", " for l_i, l_val in enumerate(min_sample_ranges):\n", " ax_time.text(t_i, l_i, s=time_paramsearch[t_i, l_i], ha='center')" - ], - "metadata": { - "id": "_PLD8spawTUb" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1075,11 +1068,14 @@ "id": "IMpOi0rk0B_l" }, "source": [ - "**Q9: Interpret and discuss the performance matrices above. What do you believe are the best parameters for the model, and why?**" + "**Q10: Interpret and discuss the performance matrices above. What do you believe are the best parameters for the model, and why?**" ] }, { "cell_type": "markdown", + "metadata": { + "id": "X-_K6F73DoCU" + }, "source": [ "Let's use one more trick to evaluate the performance of our model - a confusion matrix. It is a simple table which compares the reference labels with the model's predictions by looking at the mistakes made between all reference class combinations to determine which classes are most often confused.\n", "\n", @@ -1091,13 +1087,15 @@ "* False Negatives (FN): The number of times the model incorrectly predicted the negative class when it was actually positive (also called a \"Type II error\").\n", "\n", "By looking at these values, you can understand how well your model is performing for each class, as well as which classes are confused most often." - ], - "metadata": { - "id": "X-_K6F73DoCU" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kgPBK3cnDnTn" + }, + "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "labels = sorted(set(y_val))\n", @@ -1106,26 +1104,24 @@ " display_labels=labels)\n", "disp.plot()\n", "plt.show()" - ], - "metadata": { - "id": "kgPBK3cnDnTn" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "YGTfcnNsFh36" + }, "source": [ - "**Q10: Which classes are most often confused, and why do you think this is the case?**\n", + "**Q11: Which classes are most often confused, and why do you think this is the case?**\n", "\n", "(look at the CLC class labels for a more insightful discussion)" - ], - "metadata": { - "id": "YGTfcnNsFh36" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "WGmpkc-2k55h" + }, "source": [ "Finally, let's consider the topic of *interpretability*. That is, how easy is it to understand the reasoning behind a model's decisions. When you use a large number of trees, it is technically possible to understand exactly how decisions are made by analyzing the splits in each trees. However, this is a very time-consuming task, and not realistic to do in almost any scenario. How then can we make sense of our predictions, while still making use of its complex, non-linear performance benefits?\n", "\n", @@ -1136,13 +1132,15 @@ "Gini coefficient (or Gini impurity) measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of classes in the node. Like entropy, a higher Gini impurity for a given variable indicates that it is more important for the predictions of the model.\n", "\n", "For more information, please have a look at the book [Introduction to Statistical Learning](https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf), page 312 discusses these metrics." - ], - "metadata": { - "id": "WGmpkc-2k55h" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b1jFOOvOl6fd" + }, + "outputs": [], "source": [ "## We train two models with the same parameter settings - one for entropy, and one for the GINI coefficient.\n", "entropy_model = RandomForestClassifier(n_estimators=100, min_samples_split=10, max_depth=25, criterion='entropy')\n", @@ -1152,24 +1150,24 @@ "gini_model = RandomForestClassifier(n_estimators=n_trees, min_samples_split=min_samples, max_depth=25, criterion='gini')\n", "gini_model.fit(x_train.values, y_train.values)\n", "features_gini = gini_model.feature_importances_" - ], - "metadata": { - "id": "b1jFOOvOl6fd" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Let's plot the results for the model trained using both of these metrics." - ], "metadata": { "id": "rQ7Nabb-3-PX" - } + }, + "source": [ + "Let's plot the results for the model trained using both of these metrics." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z-0SD4Efk4dB" + }, + "outputs": [], "source": [ "## Make bar plots\n", "x_axis = range(len(features_gini))\n", @@ -1184,21 +1182,16 @@ " ax_gini.text(index, y=features_gini[index]+0.003, s=label, ha='center', rotation=90)\n", "for index, label in enumerate(img_bands):\n", " ax_entropy.text(index, y=features_entropy[index]+0.003, s=label, ha='center', rotation=90)" - ], - "metadata": { - "id": "Z-0SD4Efk4dB" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "**Q10: What is depicted in the plots above? What do the metrics tell us about the variables we used? How would you interpret the results of these plots, and based on them, would you remove any variables?**\n" - ], "metadata": { "id": "N3f4lE4ox6dn" - } + }, + "source": [ + "**Q12: What is depicted in the plots above? What do the metrics tell us about the variables we used? How would you interpret the results of these plots, and based on them, would you remove any variables?**\n" + ] }, { "cell_type": "markdown", @@ -1215,30 +1208,35 @@ }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DNXuuh7dA6ki" + }, + "outputs": [], "source": [ "## Initialize the random forest classifier\n", "model = ...\n", "\n", "## Fit the classifier to the training data/labels and measure time taken\n", "..." - ], - "metadata": { - "id": "DNXuuh7dA6ki" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Sample new Landsat-8 image from 2018 in the same area" - ], "metadata": { "id": "c_iY9qUj7WJq" - } + }, + "source": [ + "Sample new Landsat-8 image from 2018 in the same area" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "52VYuUJC686F" + }, + "outputs": [], "source": [ "test_image = (\n", " ee.ImageCollection('LANDSAT/LC08/C02/T1_L2')\n", @@ -1251,48 +1249,48 @@ "map.centerObject(point, 8)\n", "map.addLayer(test_image, vis_params, \"Landsat-8\")\n", "map" - ], - "metadata": { - "id": "52VYuUJC686F" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Add indices - you can add your own indices if you like." - ], "metadata": { "id": "-ygW3_nQ8gk0" - } + }, + "source": [ + "Add indices - you can add your own indices if you like." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "itsBGwNb8htp" + }, + "outputs": [], "source": [ "dvi = test_image.select('SR_B5').subtract(test_image.select('SR_B4')).rename('DVI') # Add the bands you need yoursef\n", "test_image = test_image.addBands(dvi)\n", "\n", "ndvi = test_image.normalizedDifference(['SR_B5', 'SR_B4']).rename('NDVI') # Add the bands you need yoursef\n", "test_image = test_image.addBands(ndvi)" - ], - "metadata": { - "id": "itsBGwNb8htp" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Sample CLC data in new area" - ], "metadata": { "id": "cvoffvJO7qF2" - } + }, + "source": [ + "Sample CLC data in new area" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R6YPrXKM7xMd" + }, + "outputs": [], "source": [ "CLC = ee.Image('COPERNICUS/CORINE/V20/100m/2018').select('landcover').clip(test_image.geometry())\n", "### Use the same settings as during training. Take care not to load the 2012 version above here!\n", @@ -1303,47 +1301,47 @@ " }\n", ")\n", "test_lc_reference_pts = test_lc_points.map(generalize_clc_class) # Students do this themselves" - ], - "metadata": { - "id": "R6YPrXKM7xMd" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Make test dataset" - ], "metadata": { "id": "WrBS4G9E7JM_" - } + }, + "source": [ + "Make test dataset" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Bzkx126168PK" + }, + "outputs": [], "source": [ "test_data = test_image.select(img_bands).sampleRegions(\n", " **{'collection': test_lc_reference_pts, 'properties': [label_col], 'scale': 100}\n", ")\n", "x_test, y_test = feature_collection_to_lists(test_data, img_bands, label_col)" - ], - "metadata": { - "id": "Bzkx126168PK" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "We already have our trained model, and we're not supposed to re-train it on this image. Therefore, all that's left to do is to run this model and test its performance on this new dataset." - ], "metadata": { "id": "50uZymf0_ENJ" - } + }, + "source": [ + "We already have our trained model, and we're not supposed to re-train it on this image. Therefore, all that's left to do is to run this model and test its performance on this new dataset." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hzY4Lgff_J_T" + }, + "outputs": [], "source": [ "## Use the model to predict all of the validation pixels\n", "test_y_pred = model.predict(x_test)\n", @@ -1352,41 +1350,41 @@ "\n", "test_kappa = cohen_kappa_score(test_y_pred, y_test)\n", "print(\"Test Kappa:\", round(test_kappa, 3))" - ], - "metadata": { - "id": "hzY4Lgff_J_T" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "yQCOZhlo0N_P" + }, "source": [ "Lastly, let's quickly take a look at the classification for the entire area. For this, we re-train the model using packages built into GEE, such that we don't have to convert data back and forth.\n", "\n", "This will also show you how much Google Earth Engine has streamlined the pipeline of working with satellite data!" - ], - "metadata": { - "id": "yQCOZhlo0N_P" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KXpWvNd0JSbo" + }, + "outputs": [], "source": [ "def convert_to_float(feature):\n", " return feature.set(label_col, ee.Number.parse(feature.get(label_col)))\n", "\n", "train_data = train_data.map(convert_to_float)\n", "test_data = test_data.map(convert_to_float)" - ], - "metadata": { - "id": "KXpWvNd0JSbo" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MNvycQ_tFTW3" + }, + "outputs": [], "source": [ "# Train the model\n", "classifier = ee.Classifier.smileRandomForest(numberOfTrees=250, minLeafPopulation=2, maxNodes=25)\n", @@ -1398,15 +1396,15 @@ "# Calculate the confusion matrix on test set\n", "confusion_matrix = classified_test.errorMatrix(label_col, 'classification')\n", "print('Test Accuracy:', confusion_matrix.accuracy().getInfo())" - ], - "metadata": { - "id": "MNvycQ_tFTW3" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QxzzxUhaFUCG" + }, + "outputs": [], "source": [ "# Classify the test image\n", "classified_image = image.classify(trained_classifier)\n", @@ -1429,36 +1427,30 @@ " 51: '#00FFFF', # Inland waters (Aqua)\n", " 52: '#E0FFFF' # Marine waters (Light Cyan)\n", "}\n", + "palette = [clc_colors[label] for label in clc_colors.keys()]\n", "\n", "# Convert string labels to numeric codes\n", "def classify_to_numeric(image):\n", - " # Create a dictionary that maps string labels to numeric values\n", " label_to_numeric = {label: index for index, label in enumerate(clc_colors.keys())}\n", "\n", - " # Convert string label to numeric value\n", " return image.remap(\n", " list(label_to_numeric.keys()),\n", " list(label_to_numeric.values())\n", " )\n", - "\n", - "# Convert the classified image\n", "numeric_classified_image = classify_to_numeric(classified_image)\n", "\n", - "# Generate a palette for visualization\n", - "palette = [clc_colors[label] for label in clc_colors.keys()]\n", - "\n", "# Add the numeric classified image to the map\n", "map.addLayer(numeric_classified_image, {'palette': palette, 'min': 0, 'max': len(clc_colors) - 1}, 'Classified Image')\n", "map" - ], - "metadata": { - "id": "QxzzxUhaFUCG" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7Q3AlI3XFWs1" + }, + "outputs": [], "source": [ "# In case you're curious, you can see approximations of your unique values with this script:\n", "unique_values = classified_image.reduceRegion(\n", @@ -1468,15 +1460,13 @@ " maxPixels=1e8\n", ").getInfo()\n", "unique_values" - ], - "metadata": { - "id": "7Q3AlI3XFWs1" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "9xYDtaSU7-rk" + }, "source": [ "In principle, the model's results on the test set are what we would report in a paper. Depending on your expectations, you might be surprised with the results of the model compared to the test set. However, it's important to remember a few things:\n", "1. We haven't played around much with the input variabes. There are more tricks that we could tap into that we haven't discussed (e.g. think about how important it is to see what happens around a pixel in order to determine what's happening inside it).\n", @@ -1485,22 +1475,22 @@ "4. We're training it on just 10'000 pixels. This is a limited amount for a hard task, and usually we'd prefer to use much more data.\n", "\n", "And there's many more reasons we could list. The reality is that modelling is a hard task, and a lot of the trial-and-error and failed models get underreported in papers. By showing you the harsh realities of real-life, difficult modelling problems, we hope to have demonstrated that this type of modelling has strong potential, but also that it doesn't come easy. With that in mind, let's reflect a little bit on how we can do better. Please answer the question below, and we will provide feedback and discuss some of the answers." - ], - "metadata": { - "id": "9xYDtaSU7-rk" - } + ] }, { "cell_type": "markdown", - "source": [ - "**Q11: What are your thoughts on the final test set results? Which factors of the image, reference data, and our workflow do you think affected the final performance of the model the most? How do you think the model workflow can be improved to provide better predictons?**" - ], "metadata": { "id": "0KLl5eZRCpK0" - } + }, + "source": [ + "**Q13: What are your thoughts on the final test set results? Which factors of the image, reference data, and our workflow do you think affected the final performance of the model the most? How do you think the model workflow can be improved to provide better predictons?**" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "rucZKCH0F_b2" + }, "source": [ "## Wrapping up\n", "That's it for this practical. There's many more nuances we could get into, but we'd need another lecture or two if we want to cover them all. Hopefully this practical gave you some insights into what it means to run a machine learning pipeline. We also hope that you see now that machine learning isn't a fully-automatic magical problem-solving box, but that there are quite a lot of steps involved and decisions to make. It's up to you, the modeller, to make the right choices with regards to data set-up and model parameter choices. Since this tutorial is getting lengthy already, we unfortunately can't go into more detail on how to improve the model further, especially on the test set. We will provide feedback based on the questions, and provide general tips on what makes a successful ML pipeline. We also hope that this tutorial inspires you to look deeper into modelling, while retaining a critical outlook on what it is and what it can do!\n", @@ -1510,10 +1500,7 @@ "> *All models are wrong, some are useful*\n", "\n", "*George Box, 1976*\n" - ], - "metadata": { - "id": "rucZKCH0F_b2" - } + ] } ], "metadata": {