Poor results on new unseen test data using UMAP #1165

muriloasouza · 2024-11-27T11:11:55Z

I got a timeseries multiclass classification problem with 12 classes (balanced classed). I have been trying to use UMAP lower space embedding as input to my model and got fairly good results at validation set. But, the model can't generalize well to new unseen data (test set).

My model is overffiting and i am using earlystop to prevent that.

Here are some curves of accuracy and loss during training (in green) and validation (in red):

You can see that my model almost achieves 0,9 accuracy before overfitting and it is saved at that point to be used later for predictions. The thing is, if I use the test set, i only get 0,67 accuracy, far lower than what i was expecting.

This is the code i am using, consider my dataframe df with shape (41364, 53):

# Import libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Concatenate
from keras.layers import Dense
from keras.layers import Input
from keras.layers import LSTM
from keras.models import Model

# Load data and split inputs from outputs

df = pd.read_csv('file.csv')
x = df.iloc[:, :-1].values
y = df['FLAG'].values

# Train/test split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Split LSTM and MLP inputs

x_train_lstm = x_train[:, 0:48]
x_train_static = x_train[:, 48:]
x_test_lstm = x_test[:, 0:48]
x_test_static = x_test[:, 48:]

# UMAP

from umap import UMAP
umap_emb = UMAP(n_neighbors=15, n_components=8, random_state=42)
x_train_umap = umap_emb.fit_transform(x_train_lstm, y_train)
from joblib import dump
dump(umap_emb, 'UMAP_emb.sav')

# Reshape inputs to be used in Keras Model

x_train_lstm = x_train_lstm.reshape(x_train_lstm.shape[0], x_train_lstm.shape[1], 1)
x_train_static = np.concatenate((x_train_static, x_train_umap), axis=1)

# Keras Model

lstm_input = Input(shape=(x_train_lstm.shape[1], 1), name='LSTM_Input_Layer')
static_input = Input(shape=(x_train_static.shape[1], ), name='Static_Input_Layer')
lstm_layer_1 = LSTM(units=128, activation='tanh', return_sequences=False, name='1_LSTM_Layer')(lstm_input)
static_layer_1 = Dense(units=64, activation='relu', name='1_Static_Layer')(static_input)
static_layer_2 = Dense(units=128, activation='relu', name='2_Static_Layer')(static_layer_1)
concatenar = Concatenate(axis=1, name='Concatenate')([lstm_layer_1, static_layer_2])
dense_1 = Dense(units=4*len(np.unique(y_train)), activation='relu', name='1_Dense_Layer')(concatenar)
dense_2 = Dense(units=2*len(np.unique(y_train)), activation='relu', name='2_Dense_Layer')(dense_1)
saida = Dense(units=len(np.unique(y_train)), activation='softmax', name='Output_Layer')(dense_2)
model = Model(inputs=[lstm_input, static_input], outputs=[saida])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Checkpoint and Earlystop

cp = ModelCheckpoint(
    filepath='model.hdf5',
    monitor='val_loss',
    verbose=1,
    save_best_only=True,
    mode='min')
earlystop = EarlyStopping(monitor='val_loss',
                          min_delta=1e-4,
                          patience=50,
                          verbose=1,
                          mode='min',
                          restore_best_weights=True)

# Fit on train data

val_size = 0.2
history = model.fit([x_train_lstm, x_train_static],
                                y_train,
                                batch_size=512,
                                epochs=300,
                                callbacks=[cp, earlystop],
                                validation_split=val_size)

# Test set (poor results!!)

x_test_umap = umap_emb.transform(x_test_lstm)
x_test_lstm = x_test_lstm.reshape(x_test_lstm.shape[0], x_test_lstm.shape[1], 1)  
x_test_static = np.concatenate((x_test_static, x_test_umap), axis=1)
y_pred = model.predict([x_test_lstm, x_test_static])
print(classification_report(y_test, np.argmax(y_pred, axis=1)))

Here is the classification report on test set:

              precision    recall  f1-score   support

           0       0.23      0.23      0.23       724
           1       0.35      0.32      0.34       716
           2       0.98      0.90      0.94       692
           3       0.60      0.54      0.57       665
           4       0.33      0.32      0.33       682
           5       0.86      0.97      0.91       695
           6       0.53      0.80      0.64       703
           7       0.98      0.98      0.98       696
           8       0.75      0.84      0.79       659
           9       0.97      1.00      0.98       674
          10       0.88      0.64      0.74       686
          11       0.72      0.56      0.63       681

    accuracy                           0.67      8273
   macro avg       0.68      0.68      0.67      8273
weighted avg       0.68      0.67      0.67      8273

Is this the correct procedure to apply UMAP to new unseen data? Or am I doing something wrong? Why my validation score is much higher than the test one? Any ideas what should I be looking for to correct that?

One last thing to note, if I don't use UMAP embedding as part of the the x_train_static, the results from validation and test set are the same (lower, but the same, as i would expect in "normal" conditions"), like this:

And here the classification report on test set without UMAP embedding as part of my inputs:

              precision    recall  f1-score   support

           0       0.45      0.49      0.47       724
           1       0.57      0.50      0.53       716
           2       0.99      0.99      0.99       692
           3       0.70      0.85      0.76       665
           4       0.57      0.47      0.51       682
           5       0.94      0.97      0.95       695
           6       0.89      0.85      0.87       703
           7       0.97      0.99      0.98       696
           8       0.89      0.87      0.88       659
           9       1.00      1.00      1.00       674
          10       0.88      0.86      0.87       686
          11       0.79      0.81      0.80       681

    accuracy                           0.80      8273
   macro avg       0.80      0.80      0.80      8273
weighted avg       0.80      0.80      0.80      8273

Both sets got around 0,8 accuracy.

The text was updated successfully, but these errors were encountered:

zsxkib · 2024-12-18T15:04:48Z

This makes sense, you want to use parametric UMAP see https://umap-learn.readthedocs.io/en/latest/transform_landmarked_pumap.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor results on new unseen test data using UMAP #1165

Poor results on new unseen test data using UMAP #1165

muriloasouza commented Nov 27, 2024 •

edited

Loading

zsxkib commented Dec 18, 2024

Poor results on new unseen test data using UMAP #1165

Poor results on new unseen test data using UMAP #1165

Comments

muriloasouza commented Nov 27, 2024 • edited Loading

zsxkib commented Dec 18, 2024

muriloasouza commented Nov 27, 2024 •

edited

Loading