INRIA
diff --git a/‎.github/workflows/formatting.yml‎
Lines changed: 28 additions & 0 deletions b/‎.github/workflows/formatting.yml‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 22 additions & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 25 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎python_scripts/01_tabular_data_exploration.py‎
Lines changed: 39 additions & 34 deletions b/‎python_scripts/01_tabular_data_exploration.py‎
Lines changed: 39 additions & 34 deletions
diff --git a/‎python_scripts/02_numerical_pipeline_cross_validation.py‎
Lines changed: 13 additions & 10 deletions b/‎python_scripts/02_numerical_pipeline_cross_validation.py‎
Lines changed: 13 additions & 10 deletions
diff --git a/‎python_scripts/02_numerical_pipeline_ex_00.py‎
Lines changed: 1 addition & 0 deletions b/‎python_scripts/02_numerical_pipeline_ex_00.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎python_scripts/02_numerical_pipeline_ex_01.py‎
Lines changed: 1 addition & 2 deletions b/‎python_scripts/02_numerical_pipeline_ex_01.py‎
Lines changed: 1 addition & 2 deletions
@@ -0,0 +1,28 @@
+name: Formatting
+
+on:
+  push:
+    branches:
+      - "main"
+
+  pull_request:
+    branches:
+      - '*'
+
+jobs:
+  run-linters:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.11"
+          allow-prereleases: true
+
+      - name: Run the linters via pre-commit
+        run: |
+          python -m pip install pre-commit
+          # only run pre-commit on the folder `python_scripts`
+          pre-commit run --files python_scripts/*
@@ -0,0 +1,22 @@
+repos:
+- repo: https://github.com/pre-commit/pre-commit-hooks
+  rev: v4.4.0
+  hooks:
+  -   id: check-yaml
+      exclude: doc/
+  -   id: end-of-file-fixer
+      exclude: doc/
+  -   id: trailing-whitespace
+      exclude: doc/
+- repo: https://github.com/psf/black
+  rev: 23.1.0
+  hooks:
+  -   id: black
+      exclude: doc/
+- repo: https://github.com/pycqa/flake8
+  rev: 4.0.1
+  hooks:
+    - id: flake8
+      entry: pflake8
+      additional_dependencies: [pyproject-flake8]
+      types: [file, python]
@@ -0,0 +1,25 @@
+[tool.black]
+line-length = 79
+target_version = ['py38', 'py39', 'py310', 'py311']
+preview = true
+exclude = '''
+/(
+    \.eggs         # exclude a few common directories in the
+  | \.git          # root of the project
+  | \.mypy_cache
+  | \.vscode
+  | build
+  | dist
+)/
+'''
+
+[tool.flake8]
+ignore = [
+    'E402',  # module level import not at top of file
+    'F401',  # imported but unused
+    'E501',  # line too long
+    'E203',  # whitespace before ':'
+    'W503',  # line break before binary operator
+    'W504',  # Line break occurred after a binary operator
+    'E24',
+]
@@ -15,8 +15,8 @@
 # * looking at the variables in the dataset, in particular, differentiate
 #   between numerical and categorical variables, which need different
 #   preprocessing in most machine learning workflows;
-# * visualizing the distribution of the variables to gain some insights into
-#   the dataset.
+# * visualizing the distribution of the variables to gain some insights into the
+#   dataset.
 
 # %% [markdown]
 # ## Loading the adult census dataset
@@ -50,9 +50,9 @@
 # %% [markdown]
 # ## The variables (columns) in the dataset
 #
-# The data are stored in a `pandas` dataframe. A dataframe is a type of structured
-# data composed of 2 dimensions. This type of data is also referred as tabular
-# data.
+# The data are stored in a `pandas` dataframe. A dataframe is a type of
+# structured data composed of 2 dimensions. This type of data is also referred
+# as tabular data.
 #
 # Each row represents a "sample". In the field of machine learning or
 # descriptive statistics, commonly used equivalent terms are "record",
@@ -71,27 +71,27 @@
 adult_census.head()
 
 # %% [markdown]
-# The column named **class** is our target variable (i.e., the variable which
-# we want to predict). The two possible classes are `<=50K` (low-revenue) and
-# `>50K` (high-revenue). The resulting prediction problem is therefore a
-# binary classification problem as `class` has only two possible values.
-# We will use the left-over columns (any column other than `class`) as input
-# variables for our model.
+# The column named **class** is our target variable (i.e., the variable which we
+# want to predict). The two possible classes are `<=50K` (low-revenue) and
+# `>50K` (high-revenue). The resulting prediction problem is therefore a binary
+# classification problem as `class` has only two possible values. We will use
+# the left-over columns (any column other than `class`) as input variables for
+# our model.
 
 # %%
 target_column = "class"
 adult_census[target_column].value_counts()
 
 # %% [markdown]
 # ```{note}
-# Here, classes are slightly imbalanced, meaning there are more samples of one or
-# more classes compared to others. In this case, we have many more samples with
-# `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice
+# Here, classes are slightly imbalanced, meaning there are more samples of one
+# or more classes compared to others. In this case, we have many more samples
+# with `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice
 # and may need special techniques when building a predictive model.
 #
-# For example in a medical setting, if we are trying to predict whether
-# subjects will develop a rare disease, there will be a lot more healthy
-# subjects than ill subjects in the dataset.
+# For example in a medical setting, if we are trying to predict whether subjects
+# will develop a rare disease, there will be a lot more healthy subjects than
+# ill subjects in the dataset.
 # ```
 
 # %% [markdown]
@@ -197,9 +197,9 @@
 # real life setting.
 #
 # We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)
-# for resources on how to quantify and potentially mitigate fairness
-# issues related to the deployment of automated decision making
-# systems that rely on machine learning components.
+# for resources on how to quantify and potentially mitigate fairness issues
+# related to the deployment of automated decision making systems that rely on
+# machine learning components.
 #
 # Studying why the data collection process of this dataset lead to such an
 # unexpected gender imbalance is beyond the scope of this MOOC but we should
@@ -211,21 +211,24 @@
 adult_census["education"].value_counts()
 
 # %% [markdown]
-# As noted above, `"education-num"` distribution has two clear peaks around 10 and
-# 13. It would be reasonable to expect that `"education-num"` is the number of
-# years of education.
+# As noted above, `"education-num"` distribution has two clear peaks around 10
+# and 13. It would be reasonable to expect that `"education-num"` is the number
+# of years of education.
 #
 # Let's look at the relationship between `"education"` and `"education-num"`.
 # %%
-pd.crosstab(index=adult_census["education"], columns=adult_census["education-num"])
+pd.crosstab(
+    index=adult_census["education"], columns=adult_census["education-num"]
+)
 
 # %% [markdown]
 # For every entry in `\"education\"`, there is only one single corresponding
-# value in `\"education-num\"`. This shows that `"education"` and `"education-num"`
-# give you the same information. For example, `"education-num"=2` is equivalent to
-# `"education"="1st-4th"`. In practice that means we can remove
-# `"education-num"` without losing information. Note that having redundant (or
-# highly correlated) columns can be a problem for machine learning algorithms.
+# value in `\"education-num\"`. This shows that `"education"` and
+# `"education-num"` give you the same information. For example,
+# `"education-num"=2` is equivalent to `"education"="1st-4th"`. In practice that
+# means we can remove `"education-num"` without losing information. Note that
+# having redundant (or highly correlated) columns can be a problem for machine
+# learning algorithms.
 
 # %% [markdown]
 # ```{note}
@@ -299,7 +302,9 @@
 plt.axvline(x=age_limit, ymin=0, ymax=1, color="black", linestyle="--")
 
 hours_per_week_limit = 40
-plt.axhline(y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--")
+plt.axhline(
+    y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--"
+)
 
 plt.annotate("<=50K", (17, 25), rotation=90, fontsize=35)
 plt.annotate("<=50K", (35, 20), fontsize=35)
@@ -322,10 +327,10 @@
 # will choose the "best" splits based on data without human intervention or
 # inspection. Decision trees will be covered more in detail in a future module.
 #
-# Note that machine learning is often used when creating rules by hand
-# is not straightforward. For example because we are in high dimension (many
-# features in a table) or because there are no simple and obvious rules that
-# separate the two classes as in the top-right region of the previous plot.
+# Note that machine learning is often used when creating rules by hand is not
+# straightforward. For example because we are in high dimension (many features
+# in a table) or because there are no simple and obvious rules that separate the
+# two classes as in the top-right region of the previous plot.
 #
 # To sum up, the important thing to remember is that in a machine-learning
 # setting, a model automatically creates the "rules" from the existing data in
 
@@ -13,7 +13,7 @@
 # We will discuss the practical aspects of assessing the generalization
 # performance of our model via **cross-validation** instead of a single
 # train-test split.
-# 
+#
 # ## Data preparation
 #
 # First, let's load the full adult census dataset.
@@ -79,10 +79,11 @@
 #
 # ```{note}
 # This figure shows the particular case of **K-fold** cross-validation strategy.
-# For each cross-validation split, the procedure trains a clone of model on all the red
-# samples and evaluate the score of the model on the blue samples.
-# As mentioned earlier, there is a variety of different cross-validation
-# strategies. Some of these aspects will be covered in more detail in future notebooks.
+# For each cross-validation split, the procedure trains a clone of model on all
+# the red samples and evaluate the score of the model on the blue samples. As
+# mentioned earlier, there is a variety of different cross-validation
+# strategies. Some of these aspects will be covered in more detail in future
+# notebooks.
 # ```
 #
 # Cross-validation is therefore computationally intensive because it requires
@@ -104,8 +105,10 @@
 # %% [markdown]
 # The output of `cross_validate` is a Python dictionary, which by default
 # contains three entries:
-# - (i) the time to train the model on the training data for each fold, `fit_time`
-# - (ii) the time to predict with the model on the testing data for each fold, `score_time`
+# - (i) the time to train the model on the training data for each fold,
+#   `fit_time`
+# - (ii) the time to predict with the model on the testing data for each fold,
+#   `score_time`
 # - (iii) the default score on the testing data for each fold, `test_score`.
 #
 # Setting `cv=5` created 5 distinct splits to get 5 variations for the training
@@ -144,15 +147,15 @@
 # we can estimate the uncertainty of our model generalization performance. This
 # is the main advantage of cross-validation and can be crucial in practice, for
 # example when comparing different models to figure out whether one is better
-# than the other or whether our measures of the generalization performance of each
-# model are within the error bars of one-another.
+# than the other or whether our measures of the generalization performance of
+# each model are within the error bars of one-another.
 #
 # In this particular case, only the first 2 decimals seem to be trustworthy. If
 # you go up in this notebook, you can check that the performance we get with
 # cross-validation is compatible with the one from a single train-test split.
 
 # %% [markdown]
 # ## Notebook recap
-# 
+#
 # In this notebook we assessed the generalization performance of our model via
 # **cross-validation**.
@@ -24,6 +24,7 @@
 
 # %%
 import pandas as pd
+
 adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
 data = adult_census.drop(columns="class")
 target = adult_census["class"]
 
@@ -49,8 +49,7 @@
 # notebook.
 
 # %%
-numerical_columns = [
-    "age", "capital-gain", "capital-loss", "hours-per-week"]
+numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
 
 data_numeric = data[numerical_columns]