Skip to content

Conversation

commint-tian
Copy link

@commint-tian commint-tian commented Apr 8, 2025

  • Stop duplicating X_first in both train/val sets
  • Implement dynamic class balancing
  • Preserve original dataset size

Why are these changes needed?

In generic_task.py, the unconditional merging of X_first into both the training and validation sets leads to data duplication, causing the total row count len(X_train) + len(X_val) to exceed the original dataset size. More critically, this introduces a data leakage issue: since part of the validation data (X_first) also appears in the training set, the evaluation results may become unreliable.

To resolve this, we can dynamically check for missing classes using missing_in_train and missing_in_val to identify gaps in the training/validation sets, then selectively supplement data:

  • Validation set only: Add X_first (ensuring validation set class coverage takes priority).
  • Training set: Fill missing classes by sampling from the remaining data X_rest (avoiding reuse of X_first).

Related issue number

issue #1390

>>
>> - Stop duplicating X_first in both train/val sets
>> - Implement dynamic class balancing
>> - Preserve original dataset size
@commint-tian
Copy link
Author

@microsoft-github-policy-service agree

Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @commint-tian for the PR. Please check below my comments.

Comment on lines -517 to -547
if len(first) < len(y_train_all) / 2:
# Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
X_rest = (
np.delete(X_train_all, first, axis=0)
if isinstance(X_train_all, np.ndarray)
else X_train_all.drop(first.tolist())
if data_is_df
else X_train_all[rest]
)
y_rest = (
np.delete(y_train_all, first, axis=0)
if isinstance(y_train_all, np.ndarray)
else y_train_all.drop(first.tolist())
if data_is_df
else y_train_all[rest]
)
else:
X_rest = (
iloc_pandas_on_spark(X_train_all, rest)
if is_spark_dataframe
else X_train_all.iloc[rest]
if data_is_df
else X_train_all[rest]
)
y_rest = (
iloc_pandas_on_spark(y_train_all, rest)
if is_spark_dataframe
else y_train_all.iloc[rest]
if data_is_df
else y_train_all[rest]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the second way of getting X_rest?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh,I made a mistake about it. The second part should be kept here.

Comment on lines +536 to +537
train_labels = np.unique(y_train)
val_labels = np.unique(y_val)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique doesn't work for psSeries or psDataFrame. Check an example here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be modified like this:
if isinstance(y_train, (ps.Series, ps.DataFrame)):
train_labels = y_train.unique() if isinstance(y_train, ps.Series) else y_train.iloc[:, 0].unique()
else:
train_labels = np.unique(y_train)

if isinstance(y_val, (ps.Series, ps.DataFrame)):
val_labels = y_val.unique() if isinstance(y_val, ps.Series) else y_val.iloc[:, 0].unique()
else:
val_labels = np.unique(y_val)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try reusing existing functions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty big project, and I'm not very familiar with it yet. I'm not sure if there are already existing functions that I can use.

Copy link
Collaborator

@thinkall thinkall Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check an example here.

You can reuse the function len_labels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed this function after seeing your correction. I hadn't paid attention to it before,thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the modified code:
train_labels= len_labels(y_train)
val_labels= len_labels(y_val)

Comment on lines +548 to +550
mask = (y_rest == label)
X_train = concat(X_rest[mask], X_train)
y_train = concat(y_rest[mask], y_train)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_rest[mask] may not work for dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could work:
mask = (y_rest == label)
filtered_X_rest = X_rest.filter(mask)
filtered_y_rest = y_rest.filter(mask)

X_train = X_train.union(filtered_X_rest)
y_train = y_train.union(filtered_y_rest)

Comment on lines +542 to +546
if missing_in_val:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
#The training set supplements the missing categories with the remaining data.
if missing_in_train:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if missing_in_val only has 1 value missed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the code can be optimized.

if missing_in_val:
if len(label_set) == 1:
X_val = concat(X_first.iloc[[0]], X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
else:
X_val = concat(X_first, X_val)
y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better way is to fill the missing labels in both train and val with first. For those not missing in either train or val, put them in train.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I've learned it.

Copy link
Author

@commint-tian commint-tian Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if missing_in_val:
if data_is_df:
X_val = concat([X_first, X_val])
y_val = concat([label_set, y_val])
else:
X_val = np.concatenate([X_first, X_val], axis=0)
y_val = np.concatenate([label_set, y_val])

if missing_in_train:
if data_is_df:
X_train = concat([X_first, X_train])
y_train = concat([label_set, y_train])
else:
X_train = np.concatenate([X_first, X_train], axis=0)
y_train = np.concatenate([label_set, y_train])

common_labels = set(train_labels) & set(val_labels)
only_train_labels = set(train_labels) - set(val_labels)
only_val_labels = set(val_labels) - set(train_labels)

#将不在train或val中缺失的标签放到train中
for label in only_val_labels:
#从val中移除这些标签
mask = y_val != label
X_val = X_val[mask]
y_val = y_val[mask]
#将这些标签添加到train中
if data_is_df:
X_train = concat([X_train, X_first.loc[X_first[label_col] == label]])
y_train = concat([y_train, label_set.loc[label_set[label_col] == label]])
else:
X_train = np.concatenate([X_train, X_first[y_first == label]], axis=0)
y_train = np.concatenate([y_train, label_set[y_first == label]])

@thinkall
Copy link
Collaborator

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

@commint-tian
Copy link
Author

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

I've finished revising the code, please check it again.

@thinkall
Copy link
Collaborator

Hi @commint-tian , do you get a chance to revise the PR? Thanks.

I've finished revising the code, please check it again.

Hi @commint-tian , have you pushed your changes to github? I don't see it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants