fix(automl): fix data leakage in classification holdout validation #1418

commint-tian · 2025-04-08T03:38:07Z

Stop duplicating X_first in both train/val sets

Implement dynamic class balancing

Preserve original dataset size

Why are these changes needed?

In generic_task.py, the unconditional merging of X_first into both the training and validation sets leads to data duplication, causing the total row count len(X_train) + len(X_val) to exceed the original dataset size. More critically, this introduces a data leakage issue: since part of the validation data (X_first) also appears in the training set, the evaluation results may become unreliable.

To resolve this, we can dynamically check for missing classes using missing_in_train and missing_in_val to identify gaps in the training/validation sets, then selectively supplement data:

Validation set only: Add X_first (ensuring validation set class coverage takes priority).
Training set: Fill missing classes by sampling from the remaining data X_rest (avoiding reuse of X_first).

Related issue number

issue #1390

>> >> - Stop duplicating X_first in both train/val sets >> - Implement dynamic class balancing >> - Preserve original dataset size

commint-tian · 2025-04-08T03:50:45Z

@microsoft-github-policy-service agree

thinkall

Thank you @commint-tian for the PR. Please check below my comments.

thinkall · 2025-04-08T08:07:41Z

flaml/automl/task/generic_task.py

-                if len(first) < len(y_train_all) / 2:
-                    # Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
-                    X_rest = (
-                        np.delete(X_train_all, first, axis=0)
-                        if isinstance(X_train_all, np.ndarray)
-                        else X_train_all.drop(first.tolist())
-                        if data_is_df
-                        else X_train_all[rest]
-                    )
-                    y_rest = (
-                        np.delete(y_train_all, first, axis=0)
-                        if isinstance(y_train_all, np.ndarray)
-                        else y_train_all.drop(first.tolist())
-                        if data_is_df
-                        else y_train_all[rest]
-                    )
-                else:
-                    X_rest = (
-                        iloc_pandas_on_spark(X_train_all, rest)
-                        if is_spark_dataframe
-                        else X_train_all.iloc[rest]
-                        if data_is_df
-                        else X_train_all[rest]
-                    )
-                    y_rest = (
-                        iloc_pandas_on_spark(y_train_all, rest)
-                        if is_spark_dataframe
-                        else y_train_all.iloc[rest]
-                        if data_is_df
-                        else y_train_all[rest]
-                    )


Why remove the second way of getting X_rest?

Oh，I made a mistake about it. The second part should be kept here.

thinkall · 2025-04-08T08:14:29Z

flaml/automl/task/generic_task.py

+                train_labels = np.unique(y_train)
+                val_labels = np.unique(y_val)


np.unique doesn't work for psSeries or psDataFrame. Check an example here.

it can be modified like this：
if isinstance(y_train, (ps.Series, ps.DataFrame)):
train_labels = y_train.unique() if isinstance(y_train, ps.Series) else y_train.iloc[:, 0].unique()
else:
train_labels = np.unique(y_train)

if isinstance(y_val, (ps.Series, ps.DataFrame)):
val_labels = y_val.unique() if isinstance(y_val, ps.Series) else y_val.iloc[:, 0].unique()
else:
val_labels = np.unique(y_val)

try reusing existing functions

this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use.

Check an example here.

You can reuse the function len_labels.

Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before，thank you！

This is the modified code：
train_labels= len_labels(y_train)
val_labels= len_labels(y_val)

thinkall · 2025-04-08T08:18:24Z

flaml/automl/task/generic_task.py

+                        mask = (y_rest == label)
+                        X_train = concat(X_rest[mask], X_train)
+                        y_train = concat(y_rest[mask], y_train)


X_rest[mask] may not work for dataframe.

Perhaps this could work：
mask = (y_rest == label)
filtered_X_rest = X_rest.filter(mask)
filtered_y_rest = y_rest.filter(mask)

X_train = X_train.union(filtered_X_rest)
y_train = y_train.union(filtered_y_rest)

thinkall · 2025-04-08T08:19:39Z

flaml/automl/task/generic_task.py

+                if missing_in_val:
+                    X_val = concat(X_first, X_val)
+                    y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
+                #The training set supplements the missing categories with the remaining data.
+                if missing_in_train:


What if missing_in_val only has 1 value missed?

Indeed, the code can be optimized.

if missing_in_val:
if len(label_set) == 1:
X_val = concat(X_first.iloc[[0]], X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
else:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])

a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train.

Indeed, I've learned it.

if missing_in_val:
if data_is_df:
X_val = concat([X_first, X_val])
y_val = concat([label_set, y_val])
else:
X_val = np.concatenate([X_first, X_val], axis=0)
y_val = np.concatenate([label_set, y_val])

if missing_in_train:
if data_is_df:
X_train = concat([X_first, X_train])
y_train = concat([label_set, y_train])
else:
X_train = np.concatenate([X_first, X_train], axis=0)
y_train = np.concatenate([label_set, y_train])

common_labels = set(train_labels) & set(val_labels)
only_train_labels = set(train_labels) - set(val_labels)
only_val_labels = set(val_labels) - set(train_labels)

#将不在train或val中缺失的标签放到train中
for label in only_val_labels:
#从val中移除这些标签
mask = y_val != label
X_val = X_val[mask]
y_val = y_val[mask]
#将这些标签添加到train中
if data_is_df:
X_train = concat([X_train, X_first.loc[X_first[label_col] == label]])
y_train = concat([y_train, label_set.loc[label_set[label_col] == label]])
else:
X_train = np.concatenate([X_train, X_first[y_first == label]], axis=0)
y_train = np.concatenate([y_train, label_set[y_first == label]])

thinkall · 2025-04-14T06:57:15Z

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

commint-tian · 2025-04-14T22:58:35Z

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

I've finished revising the code, please check it again.

thinkall · 2025-04-15T00:57:21Z

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

I've finished revising the code, please check it again.

Hi @commint-tian , have you pushed your changes to github? I don't see it yet.

fix(automl): fix data leakage in classification holdout validation

7396c7b

>> >> - Stop duplicating X_first in both train/val sets >> - Implement dynamic class balancing >> - Preserve original dataset size

thinkall reviewed Apr 8, 2025

View reviewed changes

Merge branch 'main' into main

7ca0b85

		train_labels = np.unique(y_train)
		val_labels = np.unique(y_val)

fix(automl): fix data leakage in classification holdout validation #1418

Are you sure you want to change the base?

fix(automl): fix data leakage in classification holdout validation #1418

Uh oh!

Conversation

commint-tian commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Uh oh!

commint-tian commented Apr 8, 2025

Uh oh!

thinkall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thinkall Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

commint-tian Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thinkall commented Apr 14, 2025

Uh oh!

commint-tian commented Apr 14, 2025

Uh oh!

thinkall commented Apr 15, 2025

Uh oh!

Uh oh!

commint-tian commented Apr 8, 2025 •

edited

Loading

thinkall Apr 9, 2025 •

edited

Loading

commint-tian Apr 14, 2025 •

edited

Loading