|
157 | 157 | "source": [
|
158 | 158 | "Using the hypothesis tests implemented in `tsfresh` (see [here](https://tsfresh.readthedocs.io/en/latest/text/feature_filtering.html) for more information) it is now possible to select only the relevant features out of this large dataset.\n",
|
159 | 159 | "\n",
|
160 |
| - "`tsfresh` will do a hypothesis test for each of the features to check, if it is relevant for your given target." |
| 160 | + "`tsfresh` will do a hypothesis test for each of the features to check, if it is relevant for your given target.\n", |
| 161 | + "\n", |
| 162 | + "To not leak information between the train and the test set, we will only perform the selection on the train set" |
| 163 | + ] |
| 164 | + }, |
| 165 | + { |
| 166 | + "cell_type": "code", |
| 167 | + "execution_count": null, |
| 168 | + "metadata": {}, |
| 169 | + "outputs": [], |
| 170 | + "source": [ |
| 171 | + "X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)" |
161 | 172 | ]
|
162 | 173 | },
|
163 | 174 | {
|
|
166 | 177 | "metadata": {},
|
167 | 178 | "outputs": [],
|
168 | 179 | "source": [
|
169 |
| - "X_filtered = select_features(X, y)" |
| 180 | + "X_filtered_train = select_features(X_full_train, y_train)" |
170 | 181 | ]
|
171 | 182 | },
|
172 | 183 | {
|
|
177 | 188 | },
|
178 | 189 | "outputs": [],
|
179 | 190 | "source": [
|
180 |
| - "X_filtered.head()" |
| 191 | + "X_filtered_train.head()" |
181 | 192 | ]
|
182 | 193 | },
|
183 | 194 | {
|
|
186 | 197 | "source": [
|
187 | 198 | "<div class=\"alert alert-info\">\n",
|
188 | 199 | "\n",
|
189 |
| - "Currently, 669 non-NaN features survive the feature selection given this target.\n", |
| 200 | + "Currently, 423 non-NaN features survive the feature selection given this target.\n", |
190 | 201 | "Again, this number will vary depending on your data, your target and the `tsfresh` version.\n",
|
191 | 202 | " \n",
|
192 | 203 | "</div>"
|
|
214 | 225 | "metadata": {},
|
215 | 226 | "outputs": [],
|
216 | 227 | "source": [
|
217 |
| - "X_full_train, X_full_test, y_train, y_test = train_test_split(X, y, test_size=.4)\n", |
218 |
| - "X_filtered_train, X_filtered_test = X_full_train[X_filtered.columns], X_full_test[X_filtered.columns]" |
| 228 | + "X_filtered_train, X_filtered_test = X_full_train[X_filtered_train.columns], X_full_test[X_filtered_train.columns]" |
219 | 229 | ]
|
220 | 230 | },
|
221 | 231 | {
|
|
246 | 256 | "cell_type": "markdown",
|
247 | 257 | "metadata": {},
|
248 | 258 | "source": [
|
249 |
| - "Compared to using all features (`classifier_full`), using only the relevant features (`classifier_filtered`) achieves better classification performance with less data." |
| 259 | + "Compared to using all features (`classifier_full`), using only the relevant features (`classifier_filtered`) achieves similar or better classification performance with much less data." |
250 | 260 | ]
|
251 | 261 | },
|
252 | 262 | {
|
|
284 | 294 | "metadata": {},
|
285 | 295 | "outputs": [],
|
286 | 296 | "source": [
|
287 |
| - "X_filtered_2 = extract_relevant_features(df, y, column_id='id', column_sort='time',\n", |
288 |
| - " default_fc_parameters=extraction_settings)" |
289 |
| - ] |
290 |
| - }, |
291 |
| - { |
292 |
| - "cell_type": "code", |
293 |
| - "execution_count": null, |
294 |
| - "metadata": {}, |
295 |
| - "outputs": [], |
296 |
| - "source": [ |
297 |
| - "(X_filtered.columns == X_filtered_2.columns).all()" |
| 297 | + "extract_relevant_features(df, y, column_id='id', column_sort='time',\n", |
| 298 | + " default_fc_parameters=extraction_settings)" |
298 | 299 | ]
|
299 | 300 | }
|
300 | 301 | ],
|
|
314 | 315 | "name": "python",
|
315 | 316 | "nbconvert_exporter": "python",
|
316 | 317 | "pygments_lexer": "ipython3",
|
317 |
| - "version": "3.8.2" |
| 318 | + "version": "3.10.13" |
318 | 319 | }
|
319 | 320 | },
|
320 | 321 | "nbformat": 4,
|
|
0 commit comments