Skip to content

Conversation

@davidancor
Copy link

UofT-DSI | LCR- Assignment 1

What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)

We're completing assignment 1: data inspection, standardization and data-splitting, model initialization and cross-validation, and model evaluation.

What did you learn from the changes you have made?

I learned the importance of data standardization and ensuring proper data splitting. Data splitting ensures model training is done only on the training data set and validated against the validation data set. And only once complete, the final test is validated against the test data. Also, learned about GridSearchCV can automate cross-validation testing to provide us with a way to identify the best number k-neighbours for the data set (in this case, best k = 7). I also learned that knn.score() returns accuracy (not recall), and accuracy_score(y_true, y_pred) is basically the explicit way to calculate the same thing using predictions.

Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?

When looking at the assignment I saw the predictors were standardized but the classes were not - here, I was concerned about indexation mismatch once combining the "classes" and the predictor variable; however, I've managed to solve this by only standardizing the predictor columns and keeping the class column untouched.

Also, I initially used the knn.score() to measure model accuracy, however, since we were told to use "accuracy_score", I had to redo this function. I also considered using precision, but the assignment specifically asks for accuracy_score.

Were there any challenges? If so, what issue(s) did you face? How did you overcome it?

At first it wasn’t super obvious why we don’t just evaluate everything on the test set once it exists. I realized the point of the test set is that it’s the “final exam” — you shouldn’t touch it during tuning, or you’ll start picking k based on it (aka leakage). So: tune using CV on training, test only once at the end.

I assumed knn.score() was mixing accuracy and recall or doing something more “advanced.” After looking into it, I learned knn.score() is just accuracy for classification. accuracy_score is the same idea but you calculate it manually from y_true + y_pred.

How were these changes tested?

Printed/checked the dataset after loading (head/info/shape) to make sure it looks right.

Confirmed standardized predictors, and the class’s index synced up post data standardization.

A reference to a related issue in your repository (if applicable)

Checklist

  • [ x ] I can confirm that my changes are working as intended

@juliagallucci
Copy link
Collaborator

accidently sent PR to main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants