Deforestation Prediction Project
-
Project Overview This project aims to leverage machine learning to predict and monitor deforestation in vital tropical rainforests. By analyzing a rich dataset of satellite imagery and environmental factors, the model seeks to identify areas at high risk of future forest loss. The ultimate goal is to create a proactive tool that can help conservation organizations and policymakers make informed decisions to protect these critical ecosystems.
-
The Problem of Deforestation Deforestation is a critical environmental issue with far-reaching consequences, including biodiversity loss, disruption of water cycles, soil erosion, and a significant contribution to global climate change. Traditional methods of monitoring are often reactive. This project takes a proactive approach, aiming to forecast deforestation events before they occur.
-
Objective The primary objective is to build a robust predictive model that accurately classifies areas as being at high or low risk of a deforestation event. This involves:
Preprocessing and cleaning diverse geospatial and environmental data.
Analyzing the key drivers and indicators of deforestation.
Training and evaluating various machine learning models to find the most effective one.
Creating a system that can provide a "Predicted Risk Score" for a given geographical tile.
- The Dataset The analysis is based on the deforestation_sample_1100.csv dataset. This dataset contains 1100 unique observations, each representing a specific geographical tile at a particular point in time.
The data is split into train, validation, and test sets.
Key Data Features Include: Geospatial Data: Latitude, Longitude, Elevation (m), Slope (°)
Climatic Data: Rainfall (mm), Temperature (°C), Cloud Cover (%)
Satellite Indices: NDVI (Vegetation Health), NDMI (Moisture Index), EVI (Enhanced Vegetation Index)
Forest Metrics: Tree Cover (%), Canopy Height (m), Forest Loss Last 3Y (%)
Human Activity Indicators: Distance to Road (km), Distance to Settlement (km), Population Density, Fire Alerts (7d)
Land Use Data: Protected Area, Logging Concession
Target Variable: Deforestation Event (Yes=1, No=0)
- Project Workflow Data Loading: The initial dataset is loaded using Python's pandas library.
Data Preprocessing:
Column Name Cleaning: Standardized column names by removing special characters and spaces (e.g., Tree Cover (%) becomes tree_cover_percent).
Type Conversion: The Date column was converted to a datetime object for time-series analysis.
Categorical Encoding: Features like Region and Country were one-hot encoded to be used in the machine learning model.
Exploratory Data Analysis (EDA): In-depth analysis to understand the relationships between different features and their correlation with deforestation events.
Model Building & Training: Different classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) will be trained on the preprocessed data.
Model Evaluation: The models will be evaluated based on metrics such as Accuracy, Precision, Recall, and F1-Score to determine the best-performing algorithm.
This project utilizes a Logistic Regression model, a robust and highly interpretable machine learning algorithm ideal for binary classification tasks. After experimenting with other models, Logistic Regression was chosen for its excellent performance and reliability in this context.
What the Model Predicts The model is trained to solve a specific classification problem: "Will a deforestation event occur in this specific area?"
It analyzes 30 different input features for a geographical tile and produces two key outputs:
A binary prediction:
1 if a deforestation event is likely.
0 if a deforestation event is not likely.
A confidence score: The probability of the prediction being correct, which helps in assessing the risk level.
How the Model Makes Predictions The model learns patterns from historical data to identify which factors are most strongly associated with forest loss. The analysis revealed that the most influential features for its predictions are:
Historical Forest Loss: Cumulative Deforested Area (%) and Forest Loss Last 3Y (%) are the strongest predictors.
Human Activity: Population Density (per km²), Distance to Road (km), and the presence of a Logging Concession.
Environmental Factors: Vegetation health indices like NDVI and EVI.
Essentially, the model has learned that areas with a history of deforestation that are close to human infrastructure are at the highest risk.
Final Model Performance The model demonstrates a high degree of accuracy and reliability on unseen test data, making it a trustworthy tool for an early-warning system.
Overall Accuracy: 97.04%
Recall (for "Deforestation"): 92% - The model successfully identifies 92% of all actual deforestation events, meaning very few critical events are missed.
Precision (for "Deforestation"): 96% - When the model raises an alarm, it is correct 96% of the time, leading to very few false alarms.
This strong balance between recall and precision makes the model highly effective and suitable for real-world deployment.