Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor results on new unseen test data using UMAP #1165

Open
muriloasouza opened this issue Nov 27, 2024 · 1 comment
Open

Poor results on new unseen test data using UMAP #1165

muriloasouza opened this issue Nov 27, 2024 · 1 comment

Comments

@muriloasouza
Copy link

muriloasouza commented Nov 27, 2024

I got a timeseries multiclass classification problem with 12 classes (balanced classed). I have been trying to use UMAP lower space embedding as input to my model and got fairly good results at validation set. But, the model can't generalize well to new unseen data (test set).

My model is overffiting and i am using earlystop to prevent that.

Here are some curves of accuracy and loss during training (in green) and validation (in red):

accuracy_0 01

loss_0 01

You can see that my model almost achieves 0,9 accuracy before overfitting and it is saved at that point to be used later for predictions. The thing is, if I use the test set, i only get 0,67 accuracy, far lower than what i was expecting.

This is the code i am using, consider my dataframe df with shape (41364, 53):

# Import libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Concatenate
from keras.layers import Dense
from keras.layers import Input
from keras.layers import LSTM
from keras.models import Model

# Load data and split inputs from outputs

df = pd.read_csv('file.csv')
x = df.iloc[:, :-1].values
y = df['FLAG'].values

# Train/test split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Split LSTM and MLP inputs

x_train_lstm = x_train[:, 0:48]
x_train_static = x_train[:, 48:]
x_test_lstm = x_test[:, 0:48]
x_test_static = x_test[:, 48:]

# UMAP

from umap import UMAP
umap_emb = UMAP(n_neighbors=15, n_components=8, random_state=42)
x_train_umap = umap_emb.fit_transform(x_train_lstm, y_train)
from joblib import dump
dump(umap_emb, 'UMAP_emb.sav')

# Reshape inputs to be used in Keras Model

x_train_lstm = x_train_lstm.reshape(x_train_lstm.shape[0], x_train_lstm.shape[1], 1)
x_train_static = np.concatenate((x_train_static, x_train_umap), axis=1)

# Keras Model

lstm_input = Input(shape=(x_train_lstm.shape[1], 1), name='LSTM_Input_Layer')
static_input = Input(shape=(x_train_static.shape[1], ), name='Static_Input_Layer')
lstm_layer_1 = LSTM(units=128, activation='tanh', return_sequences=False, name='1_LSTM_Layer')(lstm_input)
static_layer_1 = Dense(units=64, activation='relu', name='1_Static_Layer')(static_input)
static_layer_2 = Dense(units=128, activation='relu', name='2_Static_Layer')(static_layer_1)
concatenar = Concatenate(axis=1, name='Concatenate')([lstm_layer_1, static_layer_2])
dense_1 = Dense(units=4*len(np.unique(y_train)), activation='relu', name='1_Dense_Layer')(concatenar)
dense_2 = Dense(units=2*len(np.unique(y_train)), activation='relu', name='2_Dense_Layer')(dense_1)
saida = Dense(units=len(np.unique(y_train)), activation='softmax', name='Output_Layer')(dense_2)
model = Model(inputs=[lstm_input, static_input], outputs=[saida])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

# Checkpoint and Earlystop

cp = ModelCheckpoint(
    filepath='model.hdf5',
    monitor='val_loss',
    verbose=1,
    save_best_only=True,
    mode='min')
earlystop = EarlyStopping(monitor='val_loss',
                          min_delta=1e-4,
                          patience=50,
                          verbose=1,
                          mode='min',
                          restore_best_weights=True)

# Fit on train data

val_size = 0.2
history = model.fit([x_train_lstm, x_train_static],
                                y_train,
                                batch_size=512,
                                epochs=300,
                                callbacks=[cp, earlystop],
                                validation_split=val_size)

# Test set (poor results!!)

x_test_umap = umap_emb.transform(x_test_lstm)
x_test_lstm = x_test_lstm.reshape(x_test_lstm.shape[0], x_test_lstm.shape[1], 1)  
x_test_static = np.concatenate((x_test_static, x_test_umap), axis=1)
y_pred = model.predict([x_test_lstm, x_test_static])
print(classification_report(y_test, np.argmax(y_pred, axis=1)))

Here is the classification report on test set:

              precision    recall  f1-score   support

           0       0.23      0.23      0.23       724
           1       0.35      0.32      0.34       716
           2       0.98      0.90      0.94       692
           3       0.60      0.54      0.57       665
           4       0.33      0.32      0.33       682
           5       0.86      0.97      0.91       695
           6       0.53      0.80      0.64       703
           7       0.98      0.98      0.98       696
           8       0.75      0.84      0.79       659
           9       0.97      1.00      0.98       674
          10       0.88      0.64      0.74       686
          11       0.72      0.56      0.63       681

    accuracy                           0.67      8273
   macro avg       0.68      0.68      0.67      8273
weighted avg       0.68      0.67      0.67      8273

Is this the correct procedure to apply UMAP to new unseen data? Or am I doing something wrong? Why my validation score is much higher than the test one? Any ideas what should I be looking for to correct that?

One last thing to note, if I don't use UMAP embedding as part of the the x_train_static, the results from validation and test set are the same (lower, but the same, as i would expect in "normal" conditions"), like this:

accuracy_english_0 01

And here the classification report on test set without UMAP embedding as part of my inputs:

              precision    recall  f1-score   support

           0       0.45      0.49      0.47       724
           1       0.57      0.50      0.53       716
           2       0.99      0.99      0.99       692
           3       0.70      0.85      0.76       665
           4       0.57      0.47      0.51       682
           5       0.94      0.97      0.95       695
           6       0.89      0.85      0.87       703
           7       0.97      0.99      0.98       696
           8       0.89      0.87      0.88       659
           9       1.00      1.00      1.00       674
          10       0.88      0.86      0.87       686
          11       0.79      0.81      0.80       681

    accuracy                           0.80      8273
   macro avg       0.80      0.80      0.80      8273
weighted avg       0.80      0.80      0.80      8273

Both sets got around 0,8 accuracy.

@zsxkib
Copy link

zsxkib commented Dec 18, 2024

This makes sense, you want to use parametric UMAP see https://umap-learn.readthedocs.io/en/latest/transform_landmarked_pumap.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants