INIT assignment sklearn

Nico9803 · Nov 3, 2021 · a1957bc · a1957bc
commit a1957bc
Show file tree

Hide file tree

Showing 7 changed files with 412 additions and 0 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -0,0 +1,63 @@
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: Assignment Validation
+
+on:
+  push:
+    branches:
+      - 'main'
+
+  pull_request:
+
+jobs:
+  test:
+    name: Test Code
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Test with pytest
+      run: pytest -v
+
+  flake8:
+    name: Check code style
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Install dependencies
+      run: |
+        pip install flake8
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --max-complexity=10 --max-line-length=80 --statistics
+
+  check-doc:
+    name: Check doc style
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Install dependencies
+      run: |
+        pip install pydocstyle
+    - name: Check doc style with pydocstyle
+      run: pydocstyle
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+__pycache__
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,25 @@
+BSD 2-Clause License
+
+Copyright (c) 2021, Thomas Moreau
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -0,0 +1,24 @@
+# Assignment 2 for the Advanced ML training @ BCG Gamma: scikit-learn API
+
+## What we want you to check that you know how to do by doing this assignment:
+
+  - Use Git and GitHub
+  - Work with Python files (and not just notebooks!)
+  - Do a pull request on a GitHub repository
+  - Format your code properly using standard Python conventions
+  - Make your code pass tests run automatically on a continuous integration system (GitHub actions)
+  - Understand how to code scikit-learn compatible objects.
+
+## How?
+
+  - For the repository by clicking on the `Fork` button on the upper right corner
+  - Clone the repository of your fork with: `git clone https://github.com/MYLOGIN/assignment_sklearn` (replace MYLOGIN with your GitHub login)
+  - Create a branch called `myassignment-$MYLOGIN` using `git checkout -b myassignment-$MYLOGIN`
+  - Make the changes to complete the assignment. You have to modify the files that contain `questions` in their name. Do not modify the files that start with `test_`.
+  - Open the pull request on GitHub
+  - Keep pushing to your branch until the continuous integration system is green.
+  - When it is green notify the instructors on Slack that your done.
+
+## Getting Help
+
+If you need help ask on the Slack of the training.
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,4 @@
+numpy
+scipy
+scikit-learn
+pandas
diff --git a/sklearn_questions.py b/sklearn_questions.py
@@ -0,0 +1,182 @@
+"""Assignment - making a sklearn estimator and cv splitter.
+
+The goal of this assignment is to implement by yourself:
+
+- a scikit-learn estimator for the KNearestNeighbors for classification
+  tasks and check that it is working properly.
+- a scikit-learn CV splitter where the splits are based on a Pandas
+  DateTimeIndex.
+
+Detailed instructions for question 1:
+The nearest neighbor classifier predicts for a point X_i the target y_k of
+the training sample X_k which is the closest to X_i. We measure proximity with
+the Euclidean distance. The model will be evaluated with the accuracy (average
+number of samples corectly classified). You need to implement the `fit`,
+`predict` and `score` methods for this class. The code you write should pass
+the test we implemented. You can run the tests by calling at the root of the
+repo `pytest test_sklearn_questions.py`.
+
+Detailed instructions for question 2:
+The data to split should contain the index or one column in
+datatime format. Then the aim is to split the data between train and test
+sets when for each pair of successive months, we learn on the first and
+predict of the following. For example if you have data distributed from
+november 2020 to march 2021, you have have 5 splits. The first split
+will allow to learn on november data and predict on december data, the
+second split to learn december and predict on january etc.
+
+We also ask you to respect the pep8 convention: https://pep8.org. This will be
+enforced with `flake8`. You can check that there is no flake8 errors by
+calling `flake8` at the root of the repo.
+
+Finally, you need to write docstrings for the methods you code and for the
+class. The docstring will be checked using `pydocstyle` that you can also
+call at the root of the repo.
+
+Hints
+-----
+- You can use the function:
+
+from sklearn.metrics.pairwise import pairwise_distances
+
+to compute distances between 2 sets of samples.
+"""
+import numpy as np
+import pandas as pd
+
+from sklearn.base import BaseEstimator
+from sklearn.base import ClassifierMixin
+
+from sklearn.model_selection import BaseCrossValidator
+
+from sklearn.utils.validation import check_X_y, check_is_fitted
+from sklearn.utils.validation import check_array
+from sklearn.utils.multiclass import check_classification_targets
+from sklearn.metrics.pairwise import pairwise_distances
+
+
+class KNearestNeighbors(BaseEstimator, ClassifierMixin):
+    """KNearestNeighbors classifier."""
+
+    def __init__(self, n_neighbors=1):  # noqa: D107
+        self.n_neighbors = n_neighbors
+
+    def fit(self, X, y):
+        """Fitting function.
+
+         Parameters
+        ----------
+        X : ndarray, shape (n_samples, n_features)
+            training data.
+        y : ndarray, shape (n_samples,)
+            target values.
+
+        Returns
+        ----------
+        self : instance of KNearestNeighbors
+            The current instance of the classifier
+        """
+        return self
+
+    def predict(self, X):
+        """Predict function.
+
+        Parameters
+        ----------
+        X : ndarray, shape (n_test_samples, n_features)
+            Test data to predict on.
+
+        Returns
+        ----------
+        y : ndarray, shape (n_test_samples,)
+            Class labels for each test data sample.
+        """
+        y_pred = np.zeros(X.shape[0])
+        return y_pred
+
+    def score(self, X, y):
+        """Calculate the score of the prediction.
+
+        Parameters
+        ----------
+        X : ndarray, shape (n_samples, n_features)
+            training data.
+        y : ndarray, shape (n_samples,)
+            target values.
+
+        Returns
+        ----------
+        score : float
+            Accuracy of the model computed for the (X, y) pairs.
+        """
+        return 0.
+
+
+class MonthlySplit(BaseCrossValidator):
+    """CrossValidator based on monthly split.
+
+    Split data based on the given `time_col` (or default to index). Each split
+    corresponds to one month of data for the training and the next month of
+    data for the test.
+
+    Parameters
+    ----------
+    time_col : str, defaults to 'index'
+        Column of the input DataFrame that will be used to split the data. This
+        column should be of type datetime. If split is called with a DataFrame
+        for which this column is not a datetime, it will raise a ValueError.
+        To use the index as column just set `time_col` to `'index'`.
+    """
+
+    def __init__(self, time_col='index'):  # noqa: D107
+        self.time_col = time_col
+
+    def get_n_splits(self, X, y=None, groups=None):
+        """Return the number of splitting iterations in the cross-validator.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            Training data, where `n_samples` is the number of samples
+            and `n_features` is the number of features.
+        y : array-like of shape (n_samples,)
+            Always ignored, exists for compatibility.
+        groups : array-like of shape (n_samples,)
+            Always ignored, exists for compatibility.
+
+        Returns
+        -------
+        n_splits : int
+            The number of splits.
+        """
+        return 0
+
+    def split(self, X, y, groups=None):
+        """Generate indices to split data into training and test set.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_samples, n_features)
+            Training data, where `n_samples` is the number of samples
+            and `n_features` is the number of features.
+        y : array-like of shape (n_samples,)
+            Always ignored, exists for compatibility.
+        groups : array-like of shape (n_samples,)
+            Always ignored, exists for compatibility.
+
+        Yields
+        ------
+        idx_train : ndarray
+            The training set indices for that split.
+        idx_test : ndarray
+            The testing set indices for that split.
+        """
+
+        n_samples = X.shape[0]
+        n_splits = self.get_n_splits(X, y, groups)
+        for i in range(n_splits):
+            idx_train = range(n_samples)
+            idx_test = range(n_samples)
+            yield (
+                idx_train, idx_test
+            )