Codecademy's Machine Learning Career Path - Final Project (End-to-end ML Pipeline)
-
Implemented a Neural Network model to enhance the prediction capability using deep learning.
-
Trained a multi-layer perceptron (MLP) with ReLU activation functions and a linear output layer.
-
Achieved MSE Test: 0.0215 and R² Test: 0.9931, demonstrating competitive performance with the Random Forest model.
-
Conducted a comparison between Random Forest and Neural Networks, evaluating accuracy and computational efficiency.
- Predict the height of ocean waves using machine learning techniques.
- Source: Global Ocean Waves Analysis and Forecast
- Variable of Interest: Sea surface wave maximum height (VCMX) in meters.
- Build an end-to-end ML pipeline to predict the height of waves based on oceanographic and meteorological features.
-
Preprocessing:
- Features extracted from the dataset: Latitude, longitude, significant wave height, swell characteristics, wind wave characteristics, etc.
- Handled missing values and standardized the data.
-
Dimensionality Reduction:
- Applied PCA to reduce the number of features while retaining variability.
-
Model:
- Original Approach: Random Forest Regressor with hyperparameter tuning using GridSearchCV and validation folds.
- Improved Approach: Transitioned to
RandomForestLearnerfrom YDF (Google's TensorFlow Decision Forests) for better efficiency and compatibility with large datasets.
-
Optimized Hyperparameters:
- After extensive hyperparameter tuning, we determined the following best parameters for the Random Forest Learner:
{ 'num_trees': 50, 'max_depth': 20, 'min_examples': 2 }
- After extensive hyperparameter tuning, we determined the following best parameters for the Random Forest Learner:
-
Avoided Redundant Training:
- By leveraging these hyperparameters directly, we skipped retraining for less promising combinations, significantly reducing computation time.
-
Transition to YDF:
- Replaced scikit-learn's Random Forest implementation with
ydf.RandomForestLearnerfor compatibility with large datasets and efficient tree-based modeling.
- Replaced scikit-learn's Random Forest implementation with
-
Improved Pipeline:
- Modified the ML pipeline to include PCA and scaling while adapting the training process to work seamlessly with YDF.
-
Previous Model:
- Simple Linear Regression: ( R^2 = 0.6079 )
- Random Forest Regressor: ( R^2 = 0.9999 ) (scikit-learn implementation).
-
Current Model:
- RandomForestLearner: ( R^2 = 0.99996 ), achieving near-perfect predictions with reduced training time.
- Predicted vs. Actual Values:
- Visualized the relationship between predicted and actual wave heights, showing a strong correlation.
- Feature Importance:
- The model identified significant wave height and wind wave characteristics as key predictors.
- Integrate GPU acceleration for larger datasets.
- Experiment with other tree-based algorithms such as Gradient Boosted Trees (GBT) in YDF.
- Automate hyperparameter tuning using Bayesian Optimization or similar techniques.