diff --git a/README.md b/README.md index d372d36..08225a3 100644 --- a/README.md +++ b/README.md @@ -1,40 +1,49 @@ -# Time Series Analysis Using Machine Learning +# ๐ŸŒŠ Time Series Analysis Using Machine Learning -## Project Overview -This repository contains the code and data for a time series forecasting project aimed at predicting heavy metal concentrations in industrial wastewater. The study combines machine learning and statistical modeling techniques, including ARIMA (AutoRegressive Integrated Moving Average) and PSO-LSTM (Particle Swarm Optimization - Long Short-Term Memory), to accurately forecast heavy metal levels based on input features such as pH, chemical dosage, redox potential, and conductivity. +--- -The goal of this project is to develop robust predictive models to optimize wastewater treatment processes, enabling better control of heavy metal removal and improving overall operational efficiency. +## ๐Ÿ“Œ Project Overview +This repository focuses on predicting **heavy metal concentrations** in industrial wastewater using **time series forecasting**. Models like **ARIMA** and **PSO-LSTM** combine statistical and machine learning techniques to: -Below is a flowchart illustrating the stages of wastewater treatment and measurement: +- Forecast heavy metal levels based on features like **pH**, **chemical dosage**, **redox potential**, and **conductivity**. +- Optimize wastewater treatment for better efficiency and control. + +### ๐ŸŽฏ Goal +Build reliable models to improve wastewater treatment processes and operational performance. + +--- + +### ๐Ÿ’ก Workflow +Stages of wastewater treatment and measurement: ![Wastewater Treatment Workflow](results/processflow.png) -## Models Implemented +--- -### 1. **ARIMA (AutoRegressive Integrated Moving Average)** - - Used to model and forecast heavy metal concentration in wastewater treatment. - - Model parameters were tuned using grid search to find the optimal p, d, q values. - - The following pseudocode illustrates the algorithmic approach: +## ๐Ÿง  Models Implemented -### Algorithm: AutoRegressive Integrated Moving Average (ARIMA) +### 1. **ARIMA (AutoRegressive Integrated Moving Average)** + - Forecasts heavy metal concentrations in wastewater. + - Parameters (p, d, q) tuned via grid search. + - Key steps: + 1. Identify parameters (p, d, q) using ACF and PACF plots. + 2. Make the series stationary (if needed). + 3. Fit the ARIMA model. + 4. Forecast future values. + 5. Evaluate performance (MSE, MAE). -![ARIMA Algorithm](results/ARIMA_algorithm.png) + **Algorithm Overview**: - - Input: Time series data - - Output: Predicted future values + ![ARIMA Algorithm](results/ARIMA_algorithm.png) - 1. Identify the model parameters (p, d, q) using ACF and PACF plots - 2. Perform differencing on the time series to make it stationary (if needed) - 3. Fit the ARIMA model with the identified parameters - 4. Generate forecasts using the fitted model - 5. Evaluate model performance using error metrics like MSE, MAE +--- ### 2. **PSO-LSTM (Particle Swarm Optimization - Long Short-Term Memory)** - - A hybrid model combining LSTM networks for capturing long-term dependencies in time series data and PSO for optimizing model hyperparameters. - - Features such as pH, chemical dosage, redox potential, and conductivity were used as input variables to predict future concentrations of heavy metals. - - The PSO algorithm helps to find optimal hyperparameters, including learning rate, number of neurons, and dropout rate, to improve the LSTM's performance. + - Combines **LSTM** for long-term dependencies with **PSO** for hyperparameter optimization. + - Input features: **pH**, **chemical dosage**, **redox potential**, and **conductivity**. + - PSO optimizes key hyperparameters (learning rate, neurons, dropout rate) for better model performance. -## Repository Structure +## ๐Ÿ“‚ Repository Structure time-series-forecasting/ โ”œโ”€โ”€ .github/ # GitHub Actions workflows for CI/CD (if applicable) โ”‚ โ””โ”€โ”€ workflows/ @@ -74,7 +83,7 @@ Below is a flowchart illustrating the stages of wastewater treatment and measure โ”œโ”€โ”€ requirements.txt # Python dependencies โ””โ”€โ”€ LICENSE # License for the project (if applicable) -## How to Run the Code +## โš™๏ธ How to Run the Code 1. Clone the repository: - `git clone https://github.com/yasirusama61/Time-Series-Analysis.git` - `cd Time-Series-Analysis` @@ -94,7 +103,7 @@ Below is a flowchart illustrating the stages of wastewater treatment and measure 5. Hyperparameter Optimization - `python scripts/hyperparameter_optimization.py` -## Data +## ๐Ÿ“Š Data The data used in this project was collected from an industrial wastewater treatment facility in collaboration with a company specializing in environmental protection and energy-saving technologies. Due to confidentiality agreements, the original dataset cannot be publicly shared. However, the analysis conducted in this project utilized features such as: @@ -108,180 +117,218 @@ The data used in this project was collected from an industrial wastewater treatm For privacy reasons, certain sensitive details have been anonymized or modified in the dataset used for analysis. The original raw data is not included in the repository. Only code and scripts for data processing, model training, and evaluation are provided. -### Optimization Techniques +### ๐Ÿ› ๏ธ Optimization Techniques -#### Particle Swarm Optimization (PSO) +#### ๐Ÿฆ Particle Swarm Optimization (PSO) +PSO is a population-based optimization algorithm inspired by the social behavior of birds or fish. Each particle represents a potential solution and adjusts its position based on its own experience and neighboring particles' performance. -PSO is a population-based optimization algorithm inspired by the social behavior of bird flocking or fish schooling. In PSO, each particle represents a potential solution and adjusts its position based on its own experience and that of neighboring particles. +**Algorithm Overview**: +![PSO Algorithm](results/pso_algorithm.png) -### Algorithm: Particle Swarm Optimization +--- -![PSO Algorithm](results/pso_algorithm.png) +## ๐Ÿ“Š Evaluation Metrics +The models were evaluated using the following metrics: +- **Mean Squared Error (MSE)** +- **Mean Absolute Error (MAE)** +- **Mean Squared Logarithmic Error (MSLE)** +- **R-Squared (Rยฒ)** -## Evaluation Metrics +--- -The models are evaluated using the following metrics: - - Mean Squared Error (MSE) - - Mean Absolute Error (MAE) - - Mean Squared Logarithmic Error (MSLE) - - R-Squared (Rยฒ) +## ๐Ÿ“ˆ Results +Key results, including performance metrics and visualizations, are saved in the `results/` folder. -## Results -Results, including performance metrics and plots, are stored in the `results/` folder: +### Highlights: +- **ARIMA Forecast**: Demonstrates trends and seasonality predictions. +- **PSO-LSTM Performance**: Includes loss curves and predicted vs. actual plots. +- **Evaluation Metrics**: Summary of MSE, MAE, and Rยฒ for both models. +- **Sensitivity Analysis**: Highlights influential features affecting predictions. +- **Batch Size Tuning**: Displays the effect of batch sizes on RMSE. +- **Block Number Tuning**: Shows how hidden layer block counts impact RMSE. - - **ARIMA Forecast**: Shows the model's ability to capture trends and seasonality. - - **PSO-LSTM Performance**: Visualizations such as training loss curves and predicted vs. actual plots. - - **Evaluation Metrics**: A file summarizing the MSE, MAE, and Rยฒ scores for both models. - - **Sensitivity Analysis**: A plot showing the influential parameters identified through sensitivity analysis. - - **Batch Size Tuning**: Shows the effect of different batch sizes on RMSE. - - **Block Number Tuning**: Illustrates the impact of varying the number of blocks in the hidden layer on RMSE. +--- -## Hyperparameter Optimization +### ๐ŸŽ›๏ธ Hyperparameter Optimization ![Batch Size Tuning](results/batch_size_rmse.png) ![Block Number Tuning](results/block_number_rmse.png) -During the development of the LSTM model, hyperparameter tuning was performed to achieve the optimal settings for better prediction accuracy. The table below summarizes the optimal hyperparameter values used in the final model: +#### Optimized Hyperparameters for PSO-LSTM: -| Hyperparameters | Optimal Settings | -|-------------------------------------|------------------| -| Number of Epochs | 500 | -| Batch Size | 2 | -| Number of Blocks per Hidden Layer | 1 | -| Dense Layer | 1 | -| Learning Rate | 0.1 | -| Dropout Ratio | 0.7 | -| Optimizer | Adam | -| Activation Function | Hyperbolic Tangent| -| Training Loss | 0.0153 | -| Validation Loss | 0.0198 | +| Hyperparameter | Optimal Setting | +|------------------------------------|------------------| +| **Number of Epochs** | 500 | +| **Batch Size** | 2 | +| **Blocks per Hidden Layer** | 1 | +| **Dense Layer** | 1 | +| **Learning Rate** | 0.1 | +| **Dropout Ratio** | 0.7 | +| **Optimizer** | Adam | +| **Activation Function** | Hyperbolic Tangent | +| **Training Loss** | 0.0153 | +| **Validation Loss** | 0.0198 | -### Model Loss Curve +--- -The following plot shows the Training and Validation loss over 500 epochs: +### ๐Ÿ“‰ Model Loss Curve +The plot below shows **Training** and **Validation Loss** over 500 epochs: ![Model Loss Curve](results/pso_lstm_loss_curve.png) -### Insights +#### Key Insights: +- **Convergence:** Loss decreases sharply during initial epochs, indicating effective learning. +- **Stabilization:** Loss stabilizes after ~40 epochs, suggesting the model has converged. +- **Generalization:** Minimal gap between training and validation loss, indicating low overfitting. +- **Validation Trends:** Slight fluctuations in validation loss, but consistent overall. -- **Convergence:** Both the training and validation loss decrease significantly during the initial epochs, indicating that the model is learning effectively and reducing errors. -- **Stabilization:** After around 30-40 epochs, the losses begin to stabilize, suggesting that the model has reached a plateau and is no longer making substantial improvements. -- **Close Gap Between Training and Validation Loss:** The training and validation losses remain close throughout the training process, indicating good generalization and minimal overfitting. -- **Validation Loss Trends:** The slight fluctuations in the validation loss suggest some variations in performance on the validation set, but the overall trend remains consistent with the training loss. +--- -These observations suggest that the model is well-tuned and exhibits a good balance between fitting the training data and generalizing to unseen validation data. +### ๐Ÿ” LSTM Model Comparison +The plot below compares predictions from **Univariate** and **Multivariate LSTM models** against actual heavy metal concentrations: -### LSTM Model Comparison +![LSTM Comparison](results/lstm_comparison.png) -The figure below compares the predictions made by Univariate and Multivariate LSTM models against the actual heavy metal concentration. +--- -![LSTM Comparison](results/lstm_comparison.png) +### โš”๏ธ ARIMA vs. LSTM Performance +#### Key Observations: +- **ARIMA**: + - Excels at capturing **linear trends** and **seasonality**. + - Struggles with rapid fluctuations and non-linear relationships. +- **LSTM**: + - Handles **non-linear dependencies** and **abrupt changes** better. + - Slightly outperforms ARIMA during rapid concentration changes. -### ARIMA vs. LSTM Performance -In this project, both ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory) models were employed to forecast the heavy metal concentration in wastewater. The results were compared to evaluate the effectiveness of each model in time series forecasting: +#### Comparison Plot: +The following plot compares ARIMA and LSTM predictions against actual values: -- **ARIMA Model**: The ARIMA model demonstrated the capability to capture linear patterns in the time series data. It performed well for predicting the general trend and seasonality of the heavy metal concentration. However, ARIMA struggled with rapid fluctuations and non-linear relationships within the dataset, leading to some inaccuracies during sudden changes. +![ARIMA vs. LSTM Comparison](results/arima_vs_lstm_comparison.png) -- **LSTM Model**: The LSTM model, with its ability to learn long-term dependencies and handle non-linear relationships, was better suited for capturing abrupt changes in heavy metal concentration. It provided more accurate predictions during periods of rapid change. However, the overall performance was similar to ARIMA in terms of capturing the main trend and seasonal variations. +- **Conclusion:** Both models closely follow trends, but LSTM demonstrates an edge in handling rapid changes, making it more suitable for dynamic datasets. -- **Comparison Plot**: The plot below compares the predictions of ARIMA and LSTM against the actual heavy metal concentration values. Both models closely follow the actual trends, but LSTM shows a slight edge in accuracy, especially during rapid changes in concentration. +# ๐Ÿ“Š Performance Metrics -![ARIMA vs. LSTM Comparison](results/arima_vs_lstm_comparison.png) +--- + +## ๐Ÿ” Predictive Index of Heavy Metal Concentration -# Performance Metrics +| Methodology | Prediction Average | Prediction MSE | Prediction MAE | Prediction MSLE | Rยฒ | +|------------------|--------------------|----------------|----------------|-----------------|----------| +| **PSO-LSTM** | 0.048 | 0.021 | 0.064 | 0.0069 | 85% | +| **LSTM** | 0.049 | 0.010 | 0.096 | 0.0025 | 85% | +| **PSO-ARIMA** | 0.125 | 0.053 | 0.025 | 0.0373 | 90% | +| **Univariate LSTM** | 0.120 | 0.020 | 0.070 | 0.0015 | 98% | -## Predictive Index of Heavy Metal Concentration +### โœจ Key Observations: +- **PSO-LSTM** and **LSTM** models exhibit comparable Rยฒ values of **85%**, showing similar ability to explain variance in the data. However, **LSTM** achieves a lower MSE, reflecting smaller average prediction errors. +- **PSO-ARIMA** performs better in terms of **Rยฒ (90%)**, but its higher MSE and MSLE suggest larger prediction errors, especially during abrupt changes. +- **Univariate LSTM** has the highest **Rยฒ (98%)**, indicating that it explains nearly all the variability in the data. It also achieves the lowest MSLE, making it the best performer in terms of logarithmic error. - | Methodology | Prediction Average | Prediction MSE | Prediction MAE | Prediction MSLE | R Square | - |------------------|--------------------|----------------|----------------|-----------------|----------| - | PSO-LSTM | 0.048 | 0.021 | 0.064 | 0.0069 | 85% | - | LSTM | 0.049 | 0.01 | 0.096 | 0.0025 | 85% | - | PSO-ARIMA | 0.125 | 0.053 | 0.025 | 0.0373 | 90% | - | Univariate LSTM | 0.120 | 0.02 | 0.07 | 0.0015 | 98% | +--- -### Key Observations: - - PSO-LSTM and LSTM models show similar Rยฒ values of 85%, indicating comparable accuracy in explaining the variance in the data. However, the LSTM has a lower MSE, suggesting it has smaller prediction errors on average. - - PSO-ARIMA performs better in terms of Rยฒ (90%) but shows a higher MSE and MSLE, suggesting that while it captures the trend well, its prediction errors may be larger than the LSTM models. - - Univariate LSTM has the highest Rยฒ at 98%, indicating that it explains almost all the variability in the data. It also has the lowest MSLE, making it the best performer in terms of logarithmic error. +## โš–๏ธ Performance Metric Comparison Across Methods -## Performance Metric of Comparison Between All Methods +| Analytical Methods | Training MSE | Testing MSE | MAE | MSLE | +|--------------------|--------------|-------------|-------|--------| +| **PSO-LSTM** | 0.048 | 0.020 | 0.064 | 0.0069 | +| **PSO-ARIMA** | 0.238 | 0.238 | 0.025 | 0.0373 | +| **Grid Search** | 0.203 | 0.139 | 0.096 | 0.0025 | +| **LSTM** | 0.203 | 0.139 | 0.096 | 0.0025 | - | Analytical Methods | Training MSE | Testing MSE | MAE | MSLE | - |--------------------|--------------|-------------|-------|--------| - | PSO-LSTM | 0.048 | 0.020 | 0.064 | 0.0069 | - | PSO-ARIMA | 0.238 | 0.238 | 0.025 | 0.0373 | - | Grid Search | 0.203 | 0.139 | 0.096 | 0.0025 | - | LSTM | 0.203 | 0.139 | 0.096 | 0.0025 | +### โœจ Key Observations: +- **PSO-LSTM** achieves the **lowest testing MSE (0.020)**, demonstrating its strong generalization capability to unseen data. +- **PSO-ARIMA** exhibits a significantly higher MSE in both training and testing, indicating possible overfitting or difficulty in handling complex data patterns. +- **Grid Search** and plain **LSTM models** show similar performance, with moderate testing MSE and MAE. Both models excel with the lowest MSLE, highlighting their effectiveness in minimizing relative errors. -### Key Observations: - - PSO-LSTM shows the lowest testing MSE (0.020), indicating strong generalization to new data compared to other methods. - - PSO-ARIMA has a significantly higher MSE, both in training and testing, which may indicate overfitting or difficulty in capturing the complexity of the data. - - Grid Search LSTM and the plain LSTM models exhibit similar performance, with moderate testing MSE and MAE values. Both models have the lowest MSLE among all methods, highlighting their ability to handle smaller relative errors effectively. +--- -![ Performance Metric Comparison](results/metrics.png) +### ๐Ÿ“Š Performance Metric Comparison +![Performance Metric Comparison](results/metrics.png) -## Sensitivity Analysis +## ๐Ÿ” Sensitivity Analysis -The following plot shows the impact of different features on predicting heavy metal concentration, measured by the change in Mean Squared Error (MSE) when each feature is excluded: +The plot below shows the impact of excluding each feature on the prediction of heavy metal concentrations, measured by changes in **Mean Squared Error (MSE)**: ![Sensitivity Analysis Plot](results/sensitivity_analysis_plot.png) -### Insights +--- + +### โœจ Insights + +- **Key Influential Features**: + - **Electrical Conductivity**: Highest impact with an MSE change of ~0.35, indicating it is a critical predictor strongly correlated with heavy metal concentrations. + - **Chemical A**: Significant impact with an MSE change of ~0.30, highlighting its importance in influencing predictions. + +- **Moderate Impact Features**: + - **pH_ORP**: MSE change of ~0.22, reflecting a moderate contribution to the modelโ€™s performance. + - **Chemical B**: MSE change of ~0.18, suggesting relevance but less impactful than top factors. -- **Key Influential Features:** - - **Electrical Conductivity** has the highest impact, with a change in MSE of approximately 0.35. This indicates that it is a crucial predictor and highly correlated with the heavy metal concentration. - - **Chemical A** is also significant, showing a change in MSE around 0.30. This suggests that its concentration or dosage greatly influences the prediction. +- **Less Influential Features**: + - **Heavy Metal Input Concentration**: MSE change of ~0.12, contributing to predictions but with a smaller impact. + - **pH**: Least impact with an MSE change of ~0.08, showing a minor role compared to other features. -- **Moderate Impact Features:** - - **pH_ORP** and **Chemical A_ORP** have a moderate effect on the prediction, with changes in MSE of approximately 0.22 and 0.15, respectively. - - **Chemical B** shows a change in MSE of about 0.18, indicating it is an important predictor but less impactful than the top factors. +--- -- **Less Influential Features:** - - **Heavy Metal Input concentration** and **pH** show smaller changes in MSE, around 0.12 and 0.08, respectively. While they still contribute to the prediction, their impact is comparatively lower. +### ๐Ÿ› ๏ธ Recommendations -### Recommendations +1. **Prioritize Key Features**: + - Focus on monitoring and controlling **Electrical Conductivity** and **Chemical A** to improve predictive accuracy. -- **Focus on Key Features:** Given the significant influence of electrical conductivity and Chemical A, these variables should be prioritized for monitoring and control to improve predictive accuracy. -- **Feature Engineering:** Consider adding interaction terms or derived features involving electrical conductivity, Chemical A, and pH_ORP to better capture complex relationships. -- **Further Investigation:** Explore why features like Heavy Metal Input concentration and pH have a lower impact, as this could reveal additional insights into the process. +2. **Feature Engineering**: + - Create interaction terms or derived features (e.g., Electrical Conductivity ร— pH_ORP) to capture complex relationships. -These insights can guide the selection and engineering of features to enhance the model's predictive performance. +3. **Further Analysis**: + - Investigate why features like **pH** and **Heavy Metal Input Concentration** have lower impacts, potentially uncovering additional process insights. -### Conclusion -While both models demonstrated similar overall accuracy, the LSTM model's ability to handle non-linear relationships and rapid changes in data makes it a slightly better choice for this time series forecasting task. The sensitivity analysis revealed that certain features, such as electrical conductivity and Chemical A concentration, have a significant impact on the prediction of heavy metal concentration. These findings suggest that focusing on the most influential features could further improve model accuracy. Incorporating interaction terms or engineered features based on these key variables may also enhance predictive performance. +--- -For future work, combining the strengths of ARIMA and LSTM could be a promising approach. While ARIMA is effective for modeling linear trends and seasonality, LSTM excels in capturing complex patterns and sudden changes. A hybrid model leveraging these complementary strengths, along with the insights gained from sensitivity analysis, may yield even better forecasting results. +## ๐Ÿงช Conclusion -## Recommendations for Real-World Application +- **Model Comparison**: + - LSTM handles non-linear relationships and abrupt changes better than ARIMA, making it more suitable for dynamic datasets. + - ARIMA is effective for linear trends and seasonality but struggles with rapid fluctuations. -The PSO-LSTM model can be effectively used in industrial wastewater treatment to optimize processes and enhance decision-making. Here are some key recommendations: +- **Sensitivity Analysis Insights**: + - **Electrical Conductivity** and **Chemical A** are the most influential features. + - Incorporating interaction terms or engineered features can further enhance predictions. + +- **Future Work**: + - Explore a hybrid ARIMA-LSTM model to combine the strengths of both approaches. ARIMA can handle trends and seasonality, while LSTM excels at capturing non-linear patterns and sudden changes. + +--- + +## ๐ŸŒŸ Recommendations for Real-World Applications ### 1. **Real-Time Monitoring and Control** - - **Integration with Monitoring Systems:** Use the model for real-time predictions of heavy metal concentrations to dynamically adjust treatment parameters, such as chemical dosage and flow rate. - - **Automated Control Systems:** Enable automated process adjustments to maintain compliance with environmental standards and optimize treatment efficiency. + - Integrate the model with monitoring systems for dynamic adjustments to chemical dosage and flow rates. + - Automate process adjustments to maintain compliance and optimize efficiency. ### 2. **Early Warning System** - - **Anomaly Detection:** Identify unexpected spikes in heavy metal levels to prevent potential regulatory violations or equipment failures. - - **Compliance Monitoring:** Predict when heavy metal concentrations approach regulatory limits and take preventive action. + - Detect anomalies and spikes in heavy metal levels to prevent regulatory violations or equipment issues. + - Proactively manage compliance thresholds with predictive alerts. ### 3. **Optimizing Chemical Usage** - - **Predictive Dosing Optimization:** Forecast the necessary chemical dosage to maintain target heavy metal levels, reducing chemical costs. - - **Focus on Key Influencers:** Utilize sensitivity analysis results to prioritize influential parameters (e.g., electrical conductivity) for better control. + - Use predictions to optimize chemical dosages, reducing costs while maintaining target levels. + - Focus on influential parameters like **Electrical Conductivity** to improve control. ### 4. **Scenario Analysis** - - **Simulate Operating Conditions:** Evaluate the impact of different treatment strategies and operating conditions on heavy metal removal efficiency. - - **Assess Upgrades:** Predict the effects of potential system upgrades or changes in the treatment process. + - Simulate different operating conditions to evaluate treatment efficiency. + - Predict the impact of system upgrades or process changes on performance. ### 5. **Deployment Considerations** - - **Model Retraining:** Periodically update the model with new data to maintain accuracy. - - **Cloud vs. Edge Deployment:** Choose deployment based on latency requirementsโ€”cloud for centralized processing or edge for faster local predictions. + - Retrain the model periodically with updated data to maintain accuracy. + - Select deployment strategy: + - **Cloud Deployment**: For centralized processing. + - **Edge Deployment**: For low-latency local predictions. + +--- -These recommendations help guide the practical use of the PSO-LSTM model for optimizing wastewater treatment and ensuring compliance with regulatory standards. +## ๐Ÿ“œ License +This project is licensed under the MIT License. See the `LICENSE` file for details. -## License -This project is licensed under the MIT License - see the LICENSE file for details. +--- -## Acknowledgments -This project was conducted as part of a master's thesis in collaboration with a leading company in the environmental protection and energy-saving sector, and with guidance from Professor Huang Hao, Yuan Ze University. -The original dataset was modified to protect proprietary information, following advice from the project supervisor. +## ๐Ÿ™ Acknowledgments +This project was part of a masterโ€™s thesis conducted in collaboration with a leading environmental protection and energy-saving company. Special thanks to **Professor Huang Hao** of Yuan Ze University for guidance. The original dataset was modified to protect proprietary information, following recommendations from the project supervisor. \ No newline at end of file