|
15 | 15 | # * looking at the variables in the dataset, in particular, differentiate |
16 | 16 | # between numerical and categorical variables, which need different |
17 | 17 | # preprocessing in most machine learning workflows; |
18 | | -# * visualizing the distribution of the variables to gain some insights into |
19 | | -# the dataset. |
| 18 | +# * visualizing the distribution of the variables to gain some insights into the |
| 19 | +# dataset. |
20 | 20 |
|
21 | 21 | # %% [markdown] |
22 | 22 | # ## Loading the adult census dataset |
|
50 | 50 | # %% [markdown] |
51 | 51 | # ## The variables (columns) in the dataset |
52 | 52 | # |
53 | | -# The data are stored in a `pandas` dataframe. A dataframe is a type of structured |
54 | | -# data composed of 2 dimensions. This type of data is also referred as tabular |
55 | | -# data. |
| 53 | +# The data are stored in a `pandas` dataframe. A dataframe is a type of |
| 54 | +# structured data composed of 2 dimensions. This type of data is also referred |
| 55 | +# as tabular data. |
56 | 56 | # |
57 | 57 | # Each row represents a "sample". In the field of machine learning or |
58 | 58 | # descriptive statistics, commonly used equivalent terms are "record", |
|
71 | 71 | adult_census.head() |
72 | 72 |
|
73 | 73 | # %% [markdown] |
74 | | -# The column named **class** is our target variable (i.e., the variable which |
75 | | -# we want to predict). The two possible classes are `<=50K` (low-revenue) and |
76 | | -# `>50K` (high-revenue). The resulting prediction problem is therefore a |
77 | | -# binary classification problem as `class` has only two possible values. |
78 | | -# We will use the left-over columns (any column other than `class`) as input |
79 | | -# variables for our model. |
| 74 | +# The column named **class** is our target variable (i.e., the variable which we |
| 75 | +# want to predict). The two possible classes are `<=50K` (low-revenue) and |
| 76 | +# `>50K` (high-revenue). The resulting prediction problem is therefore a binary |
| 77 | +# classification problem as `class` has only two possible values. We will use |
| 78 | +# the left-over columns (any column other than `class`) as input variables for |
| 79 | +# our model. |
80 | 80 |
|
81 | 81 | # %% |
82 | 82 | target_column = "class" |
83 | 83 | adult_census[target_column].value_counts() |
84 | 84 |
|
85 | 85 | # %% [markdown] |
86 | 86 | # ```{note} |
87 | | -# Here, classes are slightly imbalanced, meaning there are more samples of one or |
88 | | -# more classes compared to others. In this case, we have many more samples with |
89 | | -# `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice |
| 87 | +# Here, classes are slightly imbalanced, meaning there are more samples of one |
| 88 | +# or more classes compared to others. In this case, we have many more samples |
| 89 | +# with `" <=50K"` than with `" >50K"`. Class imbalance happens often in practice |
90 | 90 | # and may need special techniques when building a predictive model. |
91 | 91 | # |
92 | | -# For example in a medical setting, if we are trying to predict whether |
93 | | -# subjects will develop a rare disease, there will be a lot more healthy |
94 | | -# subjects than ill subjects in the dataset. |
| 92 | +# For example in a medical setting, if we are trying to predict whether subjects |
| 93 | +# will develop a rare disease, there will be a lot more healthy subjects than |
| 94 | +# ill subjects in the dataset. |
95 | 95 | # ``` |
96 | 96 |
|
97 | 97 | # %% [markdown] |
|
197 | 197 | # real life setting. |
198 | 198 | # |
199 | 199 | # We recommend our readers to refer to [fairlearn.org](https://fairlearn.org) |
200 | | -# for resources on how to quantify and potentially mitigate fairness |
201 | | -# issues related to the deployment of automated decision making |
202 | | -# systems that rely on machine learning components. |
| 200 | +# for resources on how to quantify and potentially mitigate fairness issues |
| 201 | +# related to the deployment of automated decision making systems that rely on |
| 202 | +# machine learning components. |
203 | 203 | # |
204 | 204 | # Studying why the data collection process of this dataset lead to such an |
205 | 205 | # unexpected gender imbalance is beyond the scope of this MOOC but we should |
|
211 | 211 | adult_census["education"].value_counts() |
212 | 212 |
|
213 | 213 | # %% [markdown] |
214 | | -# As noted above, `"education-num"` distribution has two clear peaks around 10 and |
215 | | -# 13. It would be reasonable to expect that `"education-num"` is the number of |
216 | | -# years of education. |
| 214 | +# As noted above, `"education-num"` distribution has two clear peaks around 10 |
| 215 | +# and 13. It would be reasonable to expect that `"education-num"` is the number |
| 216 | +# of years of education. |
217 | 217 | # |
218 | 218 | # Let's look at the relationship between `"education"` and `"education-num"`. |
219 | 219 | # %% |
220 | | -pd.crosstab(index=adult_census["education"], columns=adult_census["education-num"]) |
| 220 | +pd.crosstab( |
| 221 | + index=adult_census["education"], columns=adult_census["education-num"] |
| 222 | +) |
221 | 223 |
|
222 | 224 | # %% [markdown] |
223 | 225 | # For every entry in `\"education\"`, there is only one single corresponding |
224 | | -# value in `\"education-num\"`. This shows that `"education"` and `"education-num"` |
225 | | -# give you the same information. For example, `"education-num"=2` is equivalent to |
226 | | -# `"education"="1st-4th"`. In practice that means we can remove |
227 | | -# `"education-num"` without losing information. Note that having redundant (or |
228 | | -# highly correlated) columns can be a problem for machine learning algorithms. |
| 226 | +# value in `\"education-num\"`. This shows that `"education"` and |
| 227 | +# `"education-num"` give you the same information. For example, |
| 228 | +# `"education-num"=2` is equivalent to `"education"="1st-4th"`. In practice that |
| 229 | +# means we can remove `"education-num"` without losing information. Note that |
| 230 | +# having redundant (or highly correlated) columns can be a problem for machine |
| 231 | +# learning algorithms. |
229 | 232 |
|
230 | 233 | # %% [markdown] |
231 | 234 | # ```{note} |
|
299 | 302 | plt.axvline(x=age_limit, ymin=0, ymax=1, color="black", linestyle="--") |
300 | 303 |
|
301 | 304 | hours_per_week_limit = 40 |
302 | | -plt.axhline(y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--") |
| 305 | +plt.axhline( |
| 306 | + y=hours_per_week_limit, xmin=0.18, xmax=1, color="black", linestyle="--" |
| 307 | +) |
303 | 308 |
|
304 | 309 | plt.annotate("<=50K", (17, 25), rotation=90, fontsize=35) |
305 | 310 | plt.annotate("<=50K", (35, 20), fontsize=35) |
|
322 | 327 | # will choose the "best" splits based on data without human intervention or |
323 | 328 | # inspection. Decision trees will be covered more in detail in a future module. |
324 | 329 | # |
325 | | -# Note that machine learning is often used when creating rules by hand |
326 | | -# is not straightforward. For example because we are in high dimension (many |
327 | | -# features in a table) or because there are no simple and obvious rules that |
328 | | -# separate the two classes as in the top-right region of the previous plot. |
| 330 | +# Note that machine learning is often used when creating rules by hand is not |
| 331 | +# straightforward. For example because we are in high dimension (many features |
| 332 | +# in a table) or because there are no simple and obvious rules that separate the |
| 333 | +# two classes as in the top-right region of the previous plot. |
329 | 334 | # |
330 | 335 | # To sum up, the important thing to remember is that in a machine-learning |
331 | 336 | # setting, a model automatically creates the "rules" from the existing data in |
|
0 commit comments