Skip to content

Commit

Permalink
INIT assignment sklearn
Browse files Browse the repository at this point in the history
  • Loading branch information
tomMoral committed Nov 3, 2021
0 parents commit a1957bc
Show file tree
Hide file tree
Showing 7 changed files with 412 additions and 0 deletions.
63 changes: 63 additions & 0 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Assignment Validation

on:
push:
branches:
- 'main'

pull_request:

jobs:
test:
name: Test Code
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with pytest
run: pytest -v

flake8:
name: Check code style
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
pip install flake8
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --max-complexity=10 --max-line-length=80 --statistics
check-doc:
name: Check doc style
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
pip install pydocstyle
- name: Check doc style with pydocstyle
run: pydocstyle
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__
25 changes: 25 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
BSD 2-Clause License

Copyright (c) 2021, Thomas Moreau
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Assignment 2 for the Advanced ML training @ BCG Gamma: scikit-learn API

## What we want you to check that you know how to do by doing this assignment:

- Use Git and GitHub
- Work with Python files (and not just notebooks!)
- Do a pull request on a GitHub repository
- Format your code properly using standard Python conventions
- Make your code pass tests run automatically on a continuous integration system (GitHub actions)
- Understand how to code scikit-learn compatible objects.

## How?

- For the repository by clicking on the `Fork` button on the upper right corner
- Clone the repository of your fork with: `git clone https://github.com/MYLOGIN/assignment_sklearn` (replace MYLOGIN with your GitHub login)
- Create a branch called `myassignment-$MYLOGIN` using `git checkout -b myassignment-$MYLOGIN`
- Make the changes to complete the assignment. You have to modify the files that contain `questions` in their name. Do not modify the files that start with `test_`.
- Open the pull request on GitHub
- Keep pushing to your branch until the continuous integration system is green.
- When it is green notify the instructors on Slack that your done.

## Getting Help

If you need help ask on the Slack of the training.
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
numpy
scipy
scikit-learn
pandas
182 changes: 182 additions & 0 deletions sklearn_questions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
"""Assignment - making a sklearn estimator and cv splitter.
The goal of this assignment is to implement by yourself:
- a scikit-learn estimator for the KNearestNeighbors for classification
tasks and check that it is working properly.
- a scikit-learn CV splitter where the splits are based on a Pandas
DateTimeIndex.
Detailed instructions for question 1:
The nearest neighbor classifier predicts for a point X_i the target y_k of
the training sample X_k which is the closest to X_i. We measure proximity with
the Euclidean distance. The model will be evaluated with the accuracy (average
number of samples corectly classified). You need to implement the `fit`,
`predict` and `score` methods for this class. The code you write should pass
the test we implemented. You can run the tests by calling at the root of the
repo `pytest test_sklearn_questions.py`.
Detailed instructions for question 2:
The data to split should contain the index or one column in
datatime format. Then the aim is to split the data between train and test
sets when for each pair of successive months, we learn on the first and
predict of the following. For example if you have data distributed from
november 2020 to march 2021, you have have 5 splits. The first split
will allow to learn on november data and predict on december data, the
second split to learn december and predict on january etc.
We also ask you to respect the pep8 convention: https://pep8.org. This will be
enforced with `flake8`. You can check that there is no flake8 errors by
calling `flake8` at the root of the repo.
Finally, you need to write docstrings for the methods you code and for the
class. The docstring will be checked using `pydocstyle` that you can also
call at the root of the repo.
Hints
-----
- You can use the function:
from sklearn.metrics.pairwise import pairwise_distances
to compute distances between 2 sets of samples.
"""
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin

from sklearn.model_selection import BaseCrossValidator

from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.utils.validation import check_array
from sklearn.utils.multiclass import check_classification_targets
from sklearn.metrics.pairwise import pairwise_distances


class KNearestNeighbors(BaseEstimator, ClassifierMixin):
"""KNearestNeighbors classifier."""

def __init__(self, n_neighbors=1): # noqa: D107
self.n_neighbors = n_neighbors

def fit(self, X, y):
"""Fitting function.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
training data.
y : ndarray, shape (n_samples,)
target values.
Returns
----------
self : instance of KNearestNeighbors
The current instance of the classifier
"""
return self

def predict(self, X):
"""Predict function.
Parameters
----------
X : ndarray, shape (n_test_samples, n_features)
Test data to predict on.
Returns
----------
y : ndarray, shape (n_test_samples,)
Class labels for each test data sample.
"""
y_pred = np.zeros(X.shape[0])
return y_pred

def score(self, X, y):
"""Calculate the score of the prediction.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
training data.
y : ndarray, shape (n_samples,)
target values.
Returns
----------
score : float
Accuracy of the model computed for the (X, y) pairs.
"""
return 0.


class MonthlySplit(BaseCrossValidator):
"""CrossValidator based on monthly split.
Split data based on the given `time_col` (or default to index). Each split
corresponds to one month of data for the training and the next month of
data for the test.
Parameters
----------
time_col : str, defaults to 'index'
Column of the input DataFrame that will be used to split the data. This
column should be of type datetime. If split is called with a DataFrame
for which this column is not a datetime, it will raise a ValueError.
To use the index as column just set `time_col` to `'index'`.
"""

def __init__(self, time_col='index'): # noqa: D107
self.time_col = time_col

def get_n_splits(self, X, y=None, groups=None):
"""Return the number of splitting iterations in the cross-validator.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples
and `n_features` is the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
Returns
-------
n_splits : int
The number of splits.
"""
return 0

def split(self, X, y, groups=None):
"""Generate indices to split data into training and test set.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training data, where `n_samples` is the number of samples
and `n_features` is the number of features.
y : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
groups : array-like of shape (n_samples,)
Always ignored, exists for compatibility.
Yields
------
idx_train : ndarray
The training set indices for that split.
idx_test : ndarray
The testing set indices for that split.
"""

n_samples = X.shape[0]
n_splits = self.get_n_splits(X, y, groups)
for i in range(n_splits):
idx_train = range(n_samples)
idx_test = range(n_samples)
yield (
idx_train, idx_test
)
Loading

0 comments on commit a1957bc

Please sign in to comment.