forked from Ibra57/2023-assignment-sklearn
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit a1957bc
Showing
7 changed files
with
412 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# This workflow will install Python dependencies, run tests and lint with a single version of Python | ||
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions | ||
|
||
name: Assignment Validation | ||
|
||
on: | ||
push: | ||
branches: | ||
- 'main' | ||
|
||
pull_request: | ||
|
||
jobs: | ||
test: | ||
name: Test Code | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python 3.8 | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.8 | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install pytest | ||
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi | ||
- name: Test with pytest | ||
run: pytest -v | ||
|
||
flake8: | ||
name: Check code style | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python 3.8 | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.8 | ||
- name: Install dependencies | ||
run: | | ||
pip install flake8 | ||
- name: Lint with flake8 | ||
run: | | ||
# stop the build if there are Python syntax errors or undefined names | ||
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics | ||
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide | ||
flake8 . --count --max-complexity=10 --max-line-length=80 --statistics | ||
check-doc: | ||
name: Check doc style | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python 3.8 | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.8 | ||
- name: Install dependencies | ||
run: | | ||
pip install pydocstyle | ||
- name: Check doc style with pydocstyle | ||
run: pydocstyle |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
BSD 2-Clause License | ||
|
||
Copyright (c) 2021, Thomas Moreau | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Assignment 2 for the Advanced ML training @ BCG Gamma: scikit-learn API | ||
|
||
## What we want you to check that you know how to do by doing this assignment: | ||
|
||
- Use Git and GitHub | ||
- Work with Python files (and not just notebooks!) | ||
- Do a pull request on a GitHub repository | ||
- Format your code properly using standard Python conventions | ||
- Make your code pass tests run automatically on a continuous integration system (GitHub actions) | ||
- Understand how to code scikit-learn compatible objects. | ||
|
||
## How? | ||
|
||
- For the repository by clicking on the `Fork` button on the upper right corner | ||
- Clone the repository of your fork with: `git clone https://github.com/MYLOGIN/assignment_sklearn` (replace MYLOGIN with your GitHub login) | ||
- Create a branch called `myassignment-$MYLOGIN` using `git checkout -b myassignment-$MYLOGIN` | ||
- Make the changes to complete the assignment. You have to modify the files that contain `questions` in their name. Do not modify the files that start with `test_`. | ||
- Open the pull request on GitHub | ||
- Keep pushing to your branch until the continuous integration system is green. | ||
- When it is green notify the instructors on Slack that your done. | ||
|
||
## Getting Help | ||
|
||
If you need help ask on the Slack of the training. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
numpy | ||
scipy | ||
scikit-learn | ||
pandas |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
"""Assignment - making a sklearn estimator and cv splitter. | ||
The goal of this assignment is to implement by yourself: | ||
- a scikit-learn estimator for the KNearestNeighbors for classification | ||
tasks and check that it is working properly. | ||
- a scikit-learn CV splitter where the splits are based on a Pandas | ||
DateTimeIndex. | ||
Detailed instructions for question 1: | ||
The nearest neighbor classifier predicts for a point X_i the target y_k of | ||
the training sample X_k which is the closest to X_i. We measure proximity with | ||
the Euclidean distance. The model will be evaluated with the accuracy (average | ||
number of samples corectly classified). You need to implement the `fit`, | ||
`predict` and `score` methods for this class. The code you write should pass | ||
the test we implemented. You can run the tests by calling at the root of the | ||
repo `pytest test_sklearn_questions.py`. | ||
Detailed instructions for question 2: | ||
The data to split should contain the index or one column in | ||
datatime format. Then the aim is to split the data between train and test | ||
sets when for each pair of successive months, we learn on the first and | ||
predict of the following. For example if you have data distributed from | ||
november 2020 to march 2021, you have have 5 splits. The first split | ||
will allow to learn on november data and predict on december data, the | ||
second split to learn december and predict on january etc. | ||
We also ask you to respect the pep8 convention: https://pep8.org. This will be | ||
enforced with `flake8`. You can check that there is no flake8 errors by | ||
calling `flake8` at the root of the repo. | ||
Finally, you need to write docstrings for the methods you code and for the | ||
class. The docstring will be checked using `pydocstyle` that you can also | ||
call at the root of the repo. | ||
Hints | ||
----- | ||
- You can use the function: | ||
from sklearn.metrics.pairwise import pairwise_distances | ||
to compute distances between 2 sets of samples. | ||
""" | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from sklearn.base import BaseEstimator | ||
from sklearn.base import ClassifierMixin | ||
|
||
from sklearn.model_selection import BaseCrossValidator | ||
|
||
from sklearn.utils.validation import check_X_y, check_is_fitted | ||
from sklearn.utils.validation import check_array | ||
from sklearn.utils.multiclass import check_classification_targets | ||
from sklearn.metrics.pairwise import pairwise_distances | ||
|
||
|
||
class KNearestNeighbors(BaseEstimator, ClassifierMixin): | ||
"""KNearestNeighbors classifier.""" | ||
|
||
def __init__(self, n_neighbors=1): # noqa: D107 | ||
self.n_neighbors = n_neighbors | ||
|
||
def fit(self, X, y): | ||
"""Fitting function. | ||
Parameters | ||
---------- | ||
X : ndarray, shape (n_samples, n_features) | ||
training data. | ||
y : ndarray, shape (n_samples,) | ||
target values. | ||
Returns | ||
---------- | ||
self : instance of KNearestNeighbors | ||
The current instance of the classifier | ||
""" | ||
return self | ||
|
||
def predict(self, X): | ||
"""Predict function. | ||
Parameters | ||
---------- | ||
X : ndarray, shape (n_test_samples, n_features) | ||
Test data to predict on. | ||
Returns | ||
---------- | ||
y : ndarray, shape (n_test_samples,) | ||
Class labels for each test data sample. | ||
""" | ||
y_pred = np.zeros(X.shape[0]) | ||
return y_pred | ||
|
||
def score(self, X, y): | ||
"""Calculate the score of the prediction. | ||
Parameters | ||
---------- | ||
X : ndarray, shape (n_samples, n_features) | ||
training data. | ||
y : ndarray, shape (n_samples,) | ||
target values. | ||
Returns | ||
---------- | ||
score : float | ||
Accuracy of the model computed for the (X, y) pairs. | ||
""" | ||
return 0. | ||
|
||
|
||
class MonthlySplit(BaseCrossValidator): | ||
"""CrossValidator based on monthly split. | ||
Split data based on the given `time_col` (or default to index). Each split | ||
corresponds to one month of data for the training and the next month of | ||
data for the test. | ||
Parameters | ||
---------- | ||
time_col : str, defaults to 'index' | ||
Column of the input DataFrame that will be used to split the data. This | ||
column should be of type datetime. If split is called with a DataFrame | ||
for which this column is not a datetime, it will raise a ValueError. | ||
To use the index as column just set `time_col` to `'index'`. | ||
""" | ||
|
||
def __init__(self, time_col='index'): # noqa: D107 | ||
self.time_col = time_col | ||
|
||
def get_n_splits(self, X, y=None, groups=None): | ||
"""Return the number of splitting iterations in the cross-validator. | ||
Parameters | ||
---------- | ||
X : array-like of shape (n_samples, n_features) | ||
Training data, where `n_samples` is the number of samples | ||
and `n_features` is the number of features. | ||
y : array-like of shape (n_samples,) | ||
Always ignored, exists for compatibility. | ||
groups : array-like of shape (n_samples,) | ||
Always ignored, exists for compatibility. | ||
Returns | ||
------- | ||
n_splits : int | ||
The number of splits. | ||
""" | ||
return 0 | ||
|
||
def split(self, X, y, groups=None): | ||
"""Generate indices to split data into training and test set. | ||
Parameters | ||
---------- | ||
X : array-like of shape (n_samples, n_features) | ||
Training data, where `n_samples` is the number of samples | ||
and `n_features` is the number of features. | ||
y : array-like of shape (n_samples,) | ||
Always ignored, exists for compatibility. | ||
groups : array-like of shape (n_samples,) | ||
Always ignored, exists for compatibility. | ||
Yields | ||
------ | ||
idx_train : ndarray | ||
The training set indices for that split. | ||
idx_test : ndarray | ||
The testing set indices for that split. | ||
""" | ||
|
||
n_samples = X.shape[0] | ||
n_splits = self.get_n_splits(X, y, groups) | ||
for i in range(n_splits): | ||
idx_train = range(n_samples) | ||
idx_test = range(n_samples) | ||
yield ( | ||
idx_train, idx_test | ||
) |
Oops, something went wrong.