Skip to content

Commit a3f96ad

Browse files
ArturoAmorQArturoAmorQglemaitre
authored
MNT Use lint and black format (#693)
--------- Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
1 parent ccaee25 commit a3f96ad

File tree

85 files changed

+3377
-2597
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+3377
-2597
lines changed

.github/workflows/formatting.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: Formatting
2+
3+
on:
4+
push:
5+
branches:
6+
- "main"
7+
8+
pull_request:
9+
branches:
10+
- '*'
11+
12+
jobs:
13+
run-linters:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- uses: actions/checkout@v3
17+
18+
- name: Set up Python 3.11
19+
uses: actions/setup-python@v4
20+
with:
21+
python-version: "3.11"
22+
allow-prereleases: true
23+
24+
- name: Run the linters via pre-commit
25+
run: |
26+
python -m pip install pre-commit
27+
# only run pre-commit on the folder `python_scripts`
28+
pre-commit run --files python_scripts/*

.pre-commit-config.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.4.0
4+
hooks:
5+
- id: check-yaml
6+
exclude: doc/
7+
- id: end-of-file-fixer
8+
exclude: doc/
9+
- id: trailing-whitespace
10+
exclude: doc/
11+
- repo: https://github.com/psf/black
12+
rev: 23.1.0
13+
hooks:
14+
- id: black
15+
exclude: doc/
16+
- repo: https://github.com/pycqa/flake8
17+
rev: 4.0.1
18+
hooks:
19+
- id: flake8
20+
entry: pflake8
21+
additional_dependencies: [pyproject-flake8]
22+
types: [file, python]

pyproject.toml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
[tool.black]
2+
line-length = 79
3+
target_version = ['py38', 'py39', 'py310', 'py311']
4+
preview = true
5+
exclude = '''
6+
/(
7+
\.eggs # exclude a few common directories in the
8+
| \.git # root of the project
9+
| \.mypy_cache
10+
| \.vscode
11+
| build
12+
| dist
13+
)/
14+
'''
15+
16+
[tool.flake8]
17+
ignore = [
18+
'E402', # module level import not at top of file
19+
'F401', # imported but unused
20+
'E501', # line too long
21+
'E203', # whitespace before ':'
22+
'W503', # line break before binary operator
23+
'W504', # Line break occurred after a binary operator
24+
'E24',
25+
]

python_scripts/01_tabular_data_exploration.py

Lines changed: 39 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
# * looking at the variables in the dataset, in particular, differentiate
1616
# between numerical and categorical variables, which need different
1717
# preprocessing in most machine learning workflows;
18-
# * visualizing the distribution of the variables to gain some insights into
19-
# the dataset.
18+
# * visualizing the distribution of the variables to gain some insights into the
19+
# dataset.
2020

2121
# %% [markdown]
2222
# ## Loading the adult census dataset
@@ -50,9 +50,9 @@
5050
# %% [markdown]
5151
# ## The variables (columns) in the dataset
5252
#
53-
# The data are stored in a `pandas` dataframe. A dataframe is a type of structured
54-
# data composed of 2 dimensions. This type of data is also referred as tabular
55-
# data.
53+
# The data are stored in a `pandas` dataframe. A dataframe is a type of
54+
# structured data composed of 2 dimensions. This type of data is also referred
55+
# as tabular data.
5656
#
5757
# Each row represents a "sample". In the field of machine learning or
5858
# descriptive statistics, commonly used equivalent terms are "record",
@@ -71,27 +71,27 @@
7171
adult_census.head()
7272

7373
# %% [markdown]
74-
# The column named **class** is our target variable (i.e., the variable which
75-
# we want to predict). The two possible classes are `<=50K` (low-revenue) and
76-
# `>50K` (high-revenue). The resulting prediction problem is therefore a
77-
# binary classification problem as `class` has only two possible values.
78-
# We will use the left-over columns (any column other than `class`) as input
79-
# variables for our model.
74+
# The column named **class** is our target variable (i.e., the variable which we
75+
# want to predict). The two possible classes are `<=50K` (low-revenue) and
76+
# `>50K` (high-revenue). The resulting prediction problem is therefore a binary
77+
# classification problem as `class` has only two possible values. We will use
78+
# the left-over columns (any column other than `class`) as input variables for
79+
# our model.
8080

8181
# %%
8282
target_column = "class"
8383
adult_census[target_column].value_counts()
8484

8585
# %% [markdown]
8686
# ```{note}
87-
# Here, classes are slightly imbalanced, meaning there are more samples of one or
88-
# more classes compared to others. In this case, we have many more samples with
89-
# `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice
87+
# Here, classes are slightly imbalanced, meaning there are more samples of one
88+
# or more classes compared to others. In this case, we have many more samples
89+
# with `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice
9090
# and may need special techniques when building a predictive model.
9191
#
92-
# For example in a medical setting, if we are trying to predict whether
93-
# subjects will develop a rare disease, there will be a lot more healthy
94-
# subjects than ill subjects in the dataset.
92+
# For example in a medical setting, if we are trying to predict whether subjects
93+
# will develop a rare disease, there will be a lot more healthy subjects than
94+
# ill subjects in the dataset.
9595
# ```
9696

9797
# %% [markdown]
@@ -197,9 +197,9 @@
197197
# real life setting.
198198
#
199199
# We recommend our readers to refer to [fairlearn.org](https://fairlearn.org)
200-
# for resources on how to quantify and potentially mitigate fairness
201-
# issues related to the deployment of automated decision making
202-
# systems that rely on machine learning components.
200+
# for resources on how to quantify and potentially mitigate fairness issues
201+
# related to the deployment of automated decision making systems that rely on
202+
# machine learning components.
203203
#
204204
# Studying why the data collection process of this dataset lead to such an
205205
# unexpected gender imbalance is beyond the scope of this MOOC but we should
@@ -211,21 +211,24 @@
211211
adult_census["education"].value_counts()
212212

213213
# %% [markdown]
214-
# As noted above, `"education-num"` distribution has two clear peaks around 10 and
215-
# 13. It would be reasonable to expect that `"education-num"` is the number of
216-
# years of education.
214+
# As noted above, `"education-num"` distribution has two clear peaks around 10
215+
# and 13. It would be reasonable to expect that `"education-num"` is the number
216+
# of years of education.
217217
#
218218
# Let's look at the relationship between `"education"` and `"education-num"`.
219219
# %%
220-
pd.crosstab(index=adult_census["education"], columns=adult_census["education-num"])
220+
pd.crosstab(
221+
index=adult_census["education"], columns=adult_census["education-num"]
222+
)
221223

222224
# %% [markdown]
223225
# For every entry in `\"education\"`, there is only one single corresponding
224-
# value in `\"education-num\"`. This shows that `"education"` and `"education-num"`
225-
# give you the same information. For example, `"education-num"=2` is equivalent to
226-
# `"education"="1st-4th"`. In practice that means we can remove
227-
# `"education-num"` without losing information. Note that having redundant (or
228-
# highly correlated) columns can be a problem for machine learning algorithms.
226+
# value in `\"education-num\"`. This shows that `"education"` and
227+
# `"education-num"` give you the same information. For example,
228+
# `"education-num"=2` is equivalent to `"education"="1st-4th"`. In practice that
229+
# means we can remove `"education-num"` without losing information. Note that
230+
# having redundant (or highly correlated) columns can be a problem for machine
231+
# learning algorithms.
229232

230233
# %% [markdown]
231234
# ```{note}
@@ -299,7 +302,9 @@
299302
plt.axvline(x=age_limit, ymin=0, ymax=1, color="black", linestyle="--")
300303

301304
hours_per_week_limit = 40
302-
plt.axhline(y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--")
305+
plt.axhline(
306+
y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--"
307+
)
303308

304309
plt.annotate("<=50K", (17, 25), rotation=90, fontsize=35)
305310
plt.annotate("<=50K", (35, 20), fontsize=35)
@@ -322,10 +327,10 @@
322327
# will choose the "best" splits based on data without human intervention or
323328
# inspection. Decision trees will be covered more in detail in a future module.
324329
#
325-
# Note that machine learning is often used when creating rules by hand
326-
# is not straightforward. For example because we are in high dimension (many
327-
# features in a table) or because there are no simple and obvious rules that
328-
# separate the two classes as in the top-right region of the previous plot.
330+
# Note that machine learning is often used when creating rules by hand is not
331+
# straightforward. For example because we are in high dimension (many features
332+
# in a table) or because there are no simple and obvious rules that separate the
333+
# two classes as in the top-right region of the previous plot.
329334
#
330335
# To sum up, the important thing to remember is that in a machine-learning
331336
# setting, a model automatically creates the "rules" from the existing data in

python_scripts/02_numerical_pipeline_cross_validation.py

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# We will discuss the practical aspects of assessing the generalization
1414
# performance of our model via **cross-validation** instead of a single
1515
# train-test split.
16-
#
16+
#
1717
# ## Data preparation
1818
#
1919
# First, let's load the full adult census dataset.
@@ -79,10 +79,11 @@
7979
#
8080
# ```{note}
8181
# This figure shows the particular case of **K-fold** cross-validation strategy.
82-
# For each cross-validation split, the procedure trains a clone of model on all the red
83-
# samples and evaluate the score of the model on the blue samples.
84-
# As mentioned earlier, there is a variety of different cross-validation
85-
# strategies. Some of these aspects will be covered in more detail in future notebooks.
82+
# For each cross-validation split, the procedure trains a clone of model on all
83+
# the red samples and evaluate the score of the model on the blue samples. As
84+
# mentioned earlier, there is a variety of different cross-validation
85+
# strategies. Some of these aspects will be covered in more detail in future
86+
# notebooks.
8687
# ```
8788
#
8889
# Cross-validation is therefore computationally intensive because it requires
@@ -104,8 +105,10 @@
104105
# %% [markdown]
105106
# The output of `cross_validate` is a Python dictionary, which by default
106107
# contains three entries:
107-
# - (i) the time to train the model on the training data for each fold, `fit_time`
108-
# - (ii) the time to predict with the model on the testing data for each fold, `score_time`
108+
# - (i) the time to train the model on the training data for each fold,
109+
# `fit_time`
110+
# - (ii) the time to predict with the model on the testing data for each fold,
111+
# `score_time`
109112
# - (iii) the default score on the testing data for each fold, `test_score`.
110113
#
111114
# Setting `cv=5` created 5 distinct splits to get 5 variations for the training
@@ -144,15 +147,15 @@
144147
# we can estimate the uncertainty of our model generalization performance. This
145148
# is the main advantage of cross-validation and can be crucial in practice, for
146149
# example when comparing different models to figure out whether one is better
147-
# than the other or whether our measures of the generalization performance of each
148-
# model are within the error bars of one-another.
150+
# than the other or whether our measures of the generalization performance of
151+
# each model are within the error bars of one-another.
149152
#
150153
# In this particular case, only the first 2 decimals seem to be trustworthy. If
151154
# you go up in this notebook, you can check that the performance we get with
152155
# cross-validation is compatible with the one from a single train-test split.
153156

154157
# %% [markdown]
155158
# ## Notebook recap
156-
#
159+
#
157160
# In this notebook we assessed the generalization performance of our model via
158161
# **cross-validation**.

python_scripts/02_numerical_pipeline_ex_00.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424

2525
# %%
2626
import pandas as pd
27+
2728
adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
2829
data = adult_census.drop(columns="class")
2930
target = adult_census["class"]

python_scripts/02_numerical_pipeline_ex_01.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,7 @@
4949
# notebook.
5050

5151
# %%
52-
numerical_columns = [
53-
"age", "capital-gain", "capital-loss", "hours-per-week"]
52+
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
5453

5554
data_numeric = data[numerical_columns]
5655

0 commit comments

Comments
 (0)