Skip to content

Commit 821895c

Browse files
committed
Do not leak information
1 parent 659f63d commit 821895c

File tree

1 file changed

+20
-19
lines changed

1 file changed

+20
-19
lines changed

notebooks/01 Feature Extraction and Selection.ipynb

+20-19
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,18 @@
157157
"source": [
158158
"Using the hypothesis tests implemented in `tsfresh` (see [here](https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) for more information) it is now possible to select only the relevant features out of this large dataset.\n",
159159
"\n",
160-
"`tsfresh` will do a hypothesis test for each of the features to check, if it is relevant for your given target."
160+
"`tsfresh` will do a hypothesis test for each of the features to check, if it is relevant for your given target.\n",
161+
"\n",
162+
"To not leak information between the train and the test set, we will only perform the selection on the train set"
163+
]
164+
},
165+
{
166+
"cell_type": "code",
167+
"execution_count": null,
168+
"metadata": {},
169+
"outputs": [],
170+
"source": [
171+
"X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)"
161172
]
162173
},
163174
{
@@ -166,7 +177,7 @@
166177
"metadata": {},
167178
"outputs": [],
168179
"source": [
169-
"X_filtered = select_features(X, y)"
180+
"X_filtered_train = select_features(X_full_train, y_train)"
170181
]
171182
},
172183
{
@@ -177,7 +188,7 @@
177188
},
178189
"outputs": [],
179190
"source": [
180-
"X_filtered.head()"
191+
"X_filtered_train.head()"
181192
]
182193
},
183194
{
@@ -186,7 +197,7 @@
186197
"source": [
187198
"<div class=\"alert alert-info\">\n",
188199
"\n",
189-
"Currently, 669 non-NaN features survive the feature selection given this target.\n",
200+
"Currently, 423 non-NaN features survive the feature selection given this target.\n",
190201
"Again, this number will vary depending on your data, your target and the `tsfresh` version.\n",
191202
" \n",
192203
"</div>"
@@ -214,8 +225,7 @@
214225
"metadata": {},
215226
"outputs": [],
216227
"source": [
217-
"X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4)\n",
218-
"X_filtered_train, X_filtered_test = X_full_train[X_filtered.columns], X_full_test[X_filtered.columns]"
228+
"X_filtered_train, X_filtered_test = X_full_train[X_filtered_train.columns], X_full_test[X_filtered_train.columns]"
219229
]
220230
},
221231
{
@@ -246,7 +256,7 @@
246256
"cell_type": "markdown",
247257
"metadata": {},
248258
"source": [
249-
"Compared to using all features (`classifier_full`), using only the relevant features (`classifier_filtered`) achieves better classification performance with less data."
259+
"Compared to using all features (`classifier_full`), using only the relevant features (`classifier_filtered`) achieves similar or better classification performance with much less data."
250260
]
251261
},
252262
{
@@ -284,17 +294,8 @@
284294
"metadata": {},
285295
"outputs": [],
286296
"source": [
287-
"X_filtered_2 = extract_relevant_features(df, y, column_id='id', column_sort='time',\n",
288-
" default_fc_parameters=extraction_settings)"
289-
]
290-
},
291-
{
292-
"cell_type": "code",
293-
"execution_count": null,
294-
"metadata": {},
295-
"outputs": [],
296-
"source": [
297-
"(X_filtered.columns == X_filtered_2.columns).all()"
297+
"extract_relevant_features(df, y, column_id='id', column_sort='time',\n",
298+
" default_fc_parameters=extraction_settings)"
298299
]
299300
}
300301
],
@@ -314,7 +315,7 @@
314315
"name": "python",
315316
"nbconvert_exporter": "python",
316317
"pygments_lexer": "ipython3",
317-
"version": "3.8.2"
318+
"version": "3.10.13"
318319
}
319320
},
320321
"nbformat": 4,

0 commit comments

Comments
 (0)