Machine Learning

Pre-processing

Handling Missing Data: Filling missing values (e.g., using mean, median, mode, or interpolation).
Data Cleaning: Removing duplicates, fixing incorrect labels, correcting inconsistencies.
Scaling/Normalization: Standardizing or normalizing numerical features to ensure consistency.
Encoding Categorical Variables: Converting categorical data into numerical form (e.g., one-hot encoding, label encoding).
Handling Outliers: Removing or transforming extreme values that may distort the model.
Splitting Data: Dividing data into training, validation, and test sets.

Scaling

Use separate scalers for X and Y
- X and Y have different distributions (different scales and meanings)
- You can scale Y if it's a regression problem. Don't scale if it's a classification problem, since it's categorical
- Tree-based models like XGBoost, Decision Trees, or Random Forests usually don't need scaling because these models are not sensitive to feature scaling

Data Leakage

Divide training and test into separate datasets before performing scaling the features
- The mean and standard deviation used for scaling will be computed from the entire dataset.
- This means that information from the test set is indirectly influencing the training data.
- Your model will learn from statistics that it would not have access to in a real-world scenario.
- This can lead to overfitting and poor generalization.

Feature Engineering

PCA

Use PCA to reduce dimensionality
- Always scale the predictors before applying PCA
- PCA relies on the variance of the data to identify the principal components. If your predictors are on different scales, PCA may disproportionately weigh the features with larger scales
What's covariance matrix?
- A covariance matrix is a square matrix that contains the covariances between pairs of variables in a dataset.
- Covariance measures the degree to which two variables change together

Model Training

Analysis

Model fits the training data well but fail to generalize to new examples
- The cost is low for the training set because it fits well, but the cost for the test set will be high because it doesn't generalize well
- Split the dataset into two parts
  - 70%: training set - fit the data
  - 30%: test set - test the model to this data

Model Selection

Which model is better? It depends on the problem at hand. If the relationship between the features and the response is well approximated by a linear model as in, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure. If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model, then decision trees may outperform classical approaches.

Model Performance

Prefer choosing models that have good cross-validation and test accuracy
- The test cost estimates how well the model generalizes to new data (compared to the training cost)
- training/cross-validation/test
  - cross-validation is also called dev or validation set
  - It improves the robustness and reliability of your model evaluation and hyperparameter tuning process
  - Cross-validation involves splitting your training data into multiple subsets (folds). The model is trained on a subset of these folds and then evaluated on the remaining fold. This process is repeated multiple times, with each fold serving as the validation set once. This gives you multiple performance estimates on different "held-out" portions of your training data.
  - By averaging the performance across all the validation folds, you get a more stable and less biased estimate of how well your model is likely to generalize to unseen data compared to relying on a single test set evaluation during development.
- Good Cross-Validation Accuracy: a good cross-validation accuracy indicates good stability and generalization across different subsets of data
- Good Test Accuracy: the model generalizes well on unseen data
Bias/Variance tradeoff
- High bias: underfit
  - Simple model
  - If the cost of the training set is high, the costs of cross validation and test sets will also be high
  - It doesn't matter if we collect more data, the model is too simple and won't learn more
- High variance: overfit
  - Complex model
  - The training cost will be low and the cross validation and test costs will be high
  - Increasing the training size can help training and cross validation error
- Balanced bias/variance: optimal
  - The costs of training, cross validation, and test will be low: it performs well
- Model complexity vs Cost
  - Training cost: when the degrees of the polynomial (or the model complexity) increases, the cost decreases
  - Cross validation cost: with the increase of model, the cost will decrease until one point where the model is overfitting and the cost will start increase again
- Regularization influence in bias/variance
  - Regularization adds a penalty to the cost function that discourages the model from learning overly complex patterns and prevent overfitting
  - As the lambda increases, the bias gets higher
  - As the lambda decreases, the variance gets higher
Establishing a baseline level of performance
- Human error (or competing algorithm or guess based on prior experience) as the baseline vs Training Error vs Cross validation error: analyse gaps between these errors
- High variance: 0.2% gap between baseline and training / 4% gap between training and cross-validation (overfitting to the training data)
  - baseline: 10.6%
  - training: 10.8%
  - cross-validation: 14.8%
- High bias: 4.4% gap between baseline and training (not performing well) / 0.5% gap between training and cross-validation (performing similarly in training and cross validation)
  - baseline: 10.6%
  - training: 15%
  - cross-validation: 15.5%
Debugging a learning algorithm
- Get more training examples -> fixes high variance
- Try smaller set of features -> fixes high variance
- Try getting additional features -> fixes high bias
- Try adding polynomial features -> fixes high bias
- Try decreasing the regularization term lambda -> fixes high bias
- Try increasing the regularization term lambda -> fixes high variance
In classification models, the way to measure performance is based on accuracy, precision, recall (sensitivity), specificity, and f1 score
- Precision: Out of all the instances that the model predicted as positive, how many were actually positive?
  - Precision = TP / (TP + FP)
    - TP = True positive
    - FP = False positive
  - High Precision: Indicates that when the model predicts a positive class, it is often correct. This is crucial in applications where the cost of a false positive is high.
  - Low Precision: Suggests that the model frequently predicts positive incorrectly, leading to many false alarms.
  - e.g. Cancer tumor is malignant
    - High precision: when the model predicts that cancer tumor is malignant, it's often correct. It's a high change a person has malignant cancer
    - Low precision: the model predicting that a person has malignant cancer is probably incorrect, leading to false alarms, and in this particular case, anxiety
- Recall (Sensitivity): Measures the proportion of actual positives that were correctly identified.
  - Recall = TP / (TP + FN)
  - True positive: correctly identified as positive
  - False negative: incorrectly identified as negative (it's actually positive)
- Precision-Recall tradeoff
  - The bigger the threshold, the bigger the precision and smaller the recall
    - Predict Y=1 only if very confident. e.g. a very rare disease
  - The smaller the threshold, the bigger the recall and smaller the precision
    - Avoiding too many cases of rare disease
  - We need to specify the threshold point
- F1 Score: The "harmonic mean" of precision and recall, providing a balance between the two.
  - F1 Score = 2 x (Precision x Recall / (Precision + Recall))
- Importance in applications: In medical diagnosis, the diseases where a false positive can cause unnecessary stress or treatment, high precision is essential.

MSE

def mean_squared_error(y_true, y_pred):
    return np.mean((Y_test - prediction) ** 2)

R²

R² (coefficient of determination): measures how well your model explains the variance in the target variable

def r2_score(Y_true, Y_pred):
   residual_sum_of_squares = np.sum((Y_true - Y_pred) ** 2)
   total_sum_of_squares = np.sum((Y_true - np.mean(Y_true)) ** 2)
   return 1 - (residual_sum_of_squares / total_sum_of_squares)

Transfer Learning

Learn parameters with a ML model for a given dataset
Download the pre-trained parameters
Train/fine-tune the model on the new data
- If you first trained in a big dataset, the fine tuning can be done with a smaller dataset
Training the model
- Train all model parameters
- Train only the output parameters, leaving the other parameters of the model fixed