UofT-DSI | LCR- Assignment 1 #319
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
UofT-DSI | LCR- Assignment 1
What changes are you trying to make? (e.g. Adding or removing code, refactoring existing code, adding reports)
We're completing assignment 1: data inspection, standardization and data-splitting, model initialization and cross-validation, and model evaluation.
What did you learn from the changes you have made?
I learned the importance of data standardization and ensuring proper data splitting. Data splitting ensures model training is done only on the training data set and validated against the validation data set. And only once complete, the final test is validated against the test data. Also, learned about GridSearchCV can automate cross-validation testing to provide us with a way to identify the best number k-neighbours for the data set (in this case, best k = 7). I also learned that knn.score() returns accuracy (not recall), and accuracy_score(y_true, y_pred) is basically the explicit way to calculate the same thing using predictions.
Was there another approach you were thinking about making? If so, what approach(es) were you thinking of?
When looking at the assignment I saw the predictors were standardized but the classes were not - here, I was concerned about indexation mismatch once combining the "classes" and the predictor variable; however, I've managed to solve this by only standardizing the predictor columns and keeping the class column untouched.
Also, I initially used the knn.score() to measure model accuracy, however, since we were told to use "accuracy_score", I had to redo this function. I also considered using precision, but the assignment specifically asks for accuracy_score.
Were there any challenges? If so, what issue(s) did you face? How did you overcome it?
At first it wasn’t super obvious why we don’t just evaluate everything on the test set once it exists. I realized the point of the test set is that it’s the “final exam” — you shouldn’t touch it during tuning, or you’ll start picking k based on it (aka leakage). So: tune using CV on training, test only once at the end.
I assumed knn.score() was mixing accuracy and recall or doing something more “advanced.” After looking into it, I learned knn.score() is just accuracy for classification. accuracy_score is the same idea but you calculate it manually from y_true + y_pred.
How were these changes tested?
Printed/checked the dataset after loading (head/info/shape) to make sure it looks right.
Confirmed standardized predictors, and the class’s index synced up post data standardization.
A reference to a related issue in your repository (if applicable)
Checklist