Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning

Harmful Algal Blooms (HABs) pose significant ecological, economic, and public health challenges. This project investigates the use of synthetic data augmentation, specifically using Gaussian Copulas, to enhance machine learning models for early HAB detection. By augmenting real-world datasets with synthetic data, the study aims to address the scarcity of high-quality datasets and improve predictive accuracy for HAB early warning systems.

This repository is an official implementation of Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning. It includes all the code and data required to replicate the study's results, which assess the impact of synthetic data volume on model performance.

Introduction

This project leverages Gaussian Copulas to generate synthetic data that preserves the interdependencies of environmental features such as:

Water Temperature (°C)
Salinity (PSU)
UVB Radiation (mW/m²)

The target variable, Corrected Chlorophyll-a Concentration (µg/L), is a well-established indicator of HAB risk. By systematically analyzing synthetic data volumes ranging from 100 to 1000 samples, the study evaluates their impact on ML model performance.

Installation

Follow the steps below to set up the repository:

Prerequisites

Python >= 3.8
pip for package installation
Recommended: Virtual environment setup using venv or conda

Setup

Clone the repository:

git clone https://github.com/Tonyhrule/Synthetic-HAB-ML-Augmentation.git

Install the required dependencies:

pip install -r requirements.txt
cd hab-detection

Repository Structure

├── evaluation/                  # Scripts for evaluating models
│   ├── CV_eval.py               # Cross-validation metrics evaluation
│   ├── percent_error_eval.py    # Percent error evaluation script
│   └── values_eval.py           # General evaluation metrics
├── figures/                     # Contains generated figures for results
├── output/                      # Contains processed data and scaler files
├── Dataset.xlsx                 # Original dataset for preprocessing
├── LICENSE                      # License for the project
├── preprocess_basic.py          # Preprocessing for baseline dataset
├── preprocess_synthetic.py      # Preprocessing for synthetic-augmented dataset
├── README.md                    # Project README
├── requirements.txt             # Python dependencies
└── train.py                     # Script for training the models

Usage

Data Preprocessing

The preprocess_basic.py and preprocess_synthetic.py scripts handle preprocessing for the baseline and synthetic datasets, respectively. Run these scripts to clean, scale, and prepare data for training and evaluation.

Model Training

Use train.py to train new models or fine-tune existing ones. The script supports hyperparameter tuning and outputs trained models in the models/ directory.

Model Evaluation

Evaluation scripts (CV_eval.py, percent_error_eval.py, and values_eval.py) compute various metrics, including:

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Percent Error

Dataset

The original dataset includes environmental features such as:

Water temperature (°C)
Salinity (PSU)
UVB radiation (mW/m²)

Target Variable: Corrected chlorophyll-a levels (µg/L).

Data Preprocessing Steps:

Imputation of missing values.
Standardization of features.
Polynomial feature expansion.

The dataset was sourced from publicly available environmental data and is described in detail in the following publication: PLoS ONE: Water temperature drives phytoplankton blooms in coastal waters.

Synthetic data was generated using Gaussian Copulas at varying volumes (100, 250, 500, 750, 1000 rows) to analyze its impact on model performance.

Evaluation

Models were evaluated using the following metrics:

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Percent Error

Outputs:

Cross-validation results and error distribution plots are available in the figures/ directory for more analysis.

Cross-Validation Results

Percent Error Distribution

Results

Findings from the study include:

Improved Accuracy: Models trained with 100–250 synthetic rows achieved the lowest mean percent error (7.16–7.21%), significantly better than the baseline model (10.17%).
Noise from Excessive Synthetic Data: Models with 1000 synthetic rows showed reduced accuracy due to noise and overfitting.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning

Table of Contents

Introduction

Installation

Prerequisites

Setup

Repository Structure

Usage

Data Preprocessing

Model Training

Model Evaluation

Dataset

Data Preprocessing Steps:

Evaluation

Outputs:

Cross-Validation Results

Percent Error Distribution

Results

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
evaluation		evaluation
figures		figures
output		output
.gitignore		.gitignore
Dataset.xlsx		Dataset.xlsx
LICENSE		LICENSE
README.md		README.md
preprocess_basic.py		preprocess_basic.py
preprocess_synthetic.py		preprocess_synthetic.py
requirements.txt		requirements.txt
train.py		train.py

License

Tonyhrule/Synthetic-HAB-ML-Augmentation

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning

Table of Contents

Introduction

Installation

Prerequisites

Setup

Repository Structure

Usage

Data Preprocessing

Model Training

Model Evaluation

Dataset

Data Preprocessing Steps:

Evaluation

Outputs:

Cross-Validation Results

Percent Error Distribution

Results

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages