-
Notifications
You must be signed in to change notification settings - Fork 544
fix(automl): fix data leakage in classification holdout validation #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
>> >> - Stop duplicating X_first in both train/val sets >> - Implement dynamic class balancing >> - Preserve original dataset size
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @commint-tian for the PR. Please check below my comments.
if len(first) < len(y_train_all) / 2: | ||
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete | ||
X_rest = ( | ||
np.delete(X_train_all, first, axis=0) | ||
if isinstance(X_train_all, np.ndarray) | ||
else X_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else X_train_all[rest] | ||
) | ||
y_rest = ( | ||
np.delete(y_train_all, first, axis=0) | ||
if isinstance(y_train_all, np.ndarray) | ||
else y_train_all.drop(first.tolist()) | ||
if data_is_df | ||
else y_train_all[rest] | ||
) | ||
else: | ||
X_rest = ( | ||
iloc_pandas_on_spark(X_train_all, rest) | ||
if is_spark_dataframe | ||
else X_train_all.iloc[rest] | ||
if data_is_df | ||
else X_train_all[rest] | ||
) | ||
y_rest = ( | ||
iloc_pandas_on_spark(y_train_all, rest) | ||
if is_spark_dataframe | ||
else y_train_all.iloc[rest] | ||
if data_is_df | ||
else y_train_all[rest] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the second way of getting X_rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh,I made a mistake about it. The second part should be kept here.
train_labels = np.unique(y_train) | ||
val_labels = np.unique(y_val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.unique doesn't work for psSeries or psDataFrame. Check an example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be modified like this:
if isinstance(y_train, (ps.Series, ps.DataFrame)):
train_labels = y_train.unique() if isinstance(y_train, ps.Series) else y_train.iloc[:, 0].unique()
else:
train_labels = np.unique(y_train)
if isinstance(y_val, (ps.Series, ps.DataFrame)):
val_labels = y_val.unique() if isinstance(y_val, ps.Series) else y_val.iloc[:, 0].unique()
else:
val_labels = np.unique(y_val)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try reusing existing functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check an example here.
You can reuse the function len_labels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before,thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the modified code:
train_labels= len_labels(y_train)
val_labels= len_labels(y_val)
mask = (y_rest == label) | ||
X_train = concat(X_rest[mask], X_train) | ||
y_train = concat(y_rest[mask], y_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_rest[mask] may not work for dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could work:
mask = (y_rest == label)
filtered_X_rest = X_rest.filter(mask)
filtered_y_rest = y_rest.filter(mask)
X_train = X_train.union(filtered_X_rest)
y_train = y_train.union(filtered_y_rest)
if missing_in_val: | ||
X_val = concat(X_first, X_val) | ||
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val]) | ||
#The training set supplements the missing categories with the remaining data. | ||
if missing_in_train: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if missing_in_val only has 1 value missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the code can be optimized.
if missing_in_val:
if len(label_set) == 1:
X_val = concat(X_first.iloc[[0]], X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
else:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I've learned it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if missing_in_val:
if data_is_df:
X_val = concat([X_first, X_val])
y_val = concat([label_set, y_val])
else:
X_val = np.concatenate([X_first, X_val], axis=0)
y_val = np.concatenate([label_set, y_val])
if missing_in_train:
if data_is_df:
X_train = concat([X_first, X_train])
y_train = concat([label_set, y_train])
else:
X_train = np.concatenate([X_first, X_train], axis=0)
y_train = np.concatenate([label_set, y_train])
common_labels = set(train_labels) & set(val_labels)
only_train_labels = set(train_labels) - set(val_labels)
only_val_labels = set(val_labels) - set(train_labels)
#将不在train或val中缺失的标签放到train中
for label in only_val_labels:
#从val中移除这些标签
mask = y_val != label
X_val = X_val[mask]
y_val = y_val[mask]
#将这些标签添加到train中
if data_is_df:
X_train = concat([X_train, X_first.loc[X_first[label_col] == label]])
y_train = concat([y_train, label_set.loc[label_set[label_col] == label]])
else:
X_train = np.concatenate([X_train, X_first[y_first == label]], axis=0)
y_train = np.concatenate([y_train, label_set[y_first == label]])
Hi @commint-tian , do you get a chance to revise the PR? Thanks. |
I've finished revising the code, please check it again. |
Hi @commint-tian , have you pushed your changes to github? I don't see it yet. |
Why are these changes needed?
In
generic_task.py
, the unconditional merging ofX_first
into both the training and validation sets leads to data duplication, causing the total row countlen(X_train) + len(X_val)
to exceed the original dataset size. More critically, this introduces a data leakage issue: since part of the validation data (X_first
) also appears in the training set, the evaluation results may become unreliable.To resolve this, we can dynamically check for missing classes using
missing_in_train
andmissing_in_val
to identify gaps in the training/validation sets, then selectively supplement data:X_first
(ensuring validation set class coverage takes priority).X_rest
(avoiding reuse ofX_first
).Related issue number
issue #1390