You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got a timeseries multiclass classification problem with 12 classes (balanced classed). I have been trying to use UMAP lower space embedding as input to my model and got fairly good results at validation set. But, the model can't generalize well to new unseen data (test set).
My model is overffiting and i am using earlystop to prevent that.
Here are some curves of accuracy and loss during training (in green) and validation (in red):
You can see that my model almost achieves 0,9 accuracy before overfitting and it is saved at that point to be used later for predictions. The thing is, if I use the test set, i only get 0,67 accuracy, far lower than what i was expecting.
This is the code i am using, consider my dataframe df with shape (41364, 53):
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Concatenate
from keras.layers import Dense
from keras.layers import Input
from keras.layers import LSTM
from keras.models import Model
# Load data and split inputs from outputs
df = pd.read_csv('file.csv')
x = df.iloc[:, :-1].values
y = df['FLAG'].values
# Train/test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
# Split LSTM and MLP inputs
x_train_lstm = x_train[:, 0:48]
x_train_static = x_train[:, 48:]
x_test_lstm = x_test[:, 0:48]
x_test_static = x_test[:, 48:]
# UMAP
from umap import UMAP
umap_emb = UMAP(n_neighbors=15, n_components=8, random_state=42)
x_train_umap = umap_emb.fit_transform(x_train_lstm, y_train)
from joblib import dump
dump(umap_emb, 'UMAP_emb.sav')
# Reshape inputs to be used in Keras Model
x_train_lstm = x_train_lstm.reshape(x_train_lstm.shape[0], x_train_lstm.shape[1], 1)
x_train_static = np.concatenate((x_train_static, x_train_umap), axis=1)
# Keras Model
lstm_input = Input(shape=(x_train_lstm.shape[1], 1), name='LSTM_Input_Layer')
static_input = Input(shape=(x_train_static.shape[1], ), name='Static_Input_Layer')
lstm_layer_1 = LSTM(units=128, activation='tanh', return_sequences=False, name='1_LSTM_Layer')(lstm_input)
static_layer_1 = Dense(units=64, activation='relu', name='1_Static_Layer')(static_input)
static_layer_2 = Dense(units=128, activation='relu', name='2_Static_Layer')(static_layer_1)
concatenar = Concatenate(axis=1, name='Concatenate')([lstm_layer_1, static_layer_2])
dense_1 = Dense(units=4*len(np.unique(y_train)), activation='relu', name='1_Dense_Layer')(concatenar)
dense_2 = Dense(units=2*len(np.unique(y_train)), activation='relu', name='2_Dense_Layer')(dense_1)
saida = Dense(units=len(np.unique(y_train)), activation='softmax', name='Output_Layer')(dense_2)
model = Model(inputs=[lstm_input, static_input], outputs=[saida])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
# Checkpoint and Earlystop
cp = ModelCheckpoint(
filepath='model.hdf5',
monitor='val_loss',
verbose=1,
save_best_only=True,
mode='min')
earlystop = EarlyStopping(monitor='val_loss',
min_delta=1e-4,
patience=50,
verbose=1,
mode='min',
restore_best_weights=True)
# Fit on train data
val_size = 0.2
history = model.fit([x_train_lstm, x_train_static],
y_train,
batch_size=512,
epochs=300,
callbacks=[cp, earlystop],
validation_split=val_size)
# Test set (poor results!!)
x_test_umap = umap_emb.transform(x_test_lstm)
x_test_lstm = x_test_lstm.reshape(x_test_lstm.shape[0], x_test_lstm.shape[1], 1)
x_test_static = np.concatenate((x_test_static, x_test_umap), axis=1)
y_pred = model.predict([x_test_lstm, x_test_static])
print(classification_report(y_test, np.argmax(y_pred, axis=1)))
Is this the correct procedure to apply UMAP to new unseen data? Or am I doing something wrong? Why my validation score is much higher than the test one? Any ideas what should I be looking for to correct that?
One last thing to note, if I don't use UMAP embedding as part of the the x_train_static, the results from validation and test set are the same (lower, but the same, as i would expect in "normal" conditions"), like this:
And here the classification report on test set without UMAP embedding as part of my inputs:
I got a timeseries multiclass classification problem with 12 classes (balanced classed). I have been trying to use UMAP lower space embedding as input to my model and got fairly good results at validation set. But, the model can't generalize well to new unseen data (test set).
My model is overffiting and i am using earlystop to prevent that.
Here are some curves of accuracy and loss during training (in green) and validation (in red):
You can see that my model almost achieves 0,9 accuracy before overfitting and it is saved at that point to be used later for predictions. The thing is, if I use the test set, i only get 0,67 accuracy, far lower than what i was expecting.
This is the code i am using, consider my dataframe
df
with shape(41364, 53)
:Here is the classification report on test set:
Is this the correct procedure to apply UMAP to new unseen data? Or am I doing something wrong? Why my validation score is much higher than the test one? Any ideas what should I be looking for to correct that?
One last thing to note, if I don't use UMAP embedding as part of the the
x_train_static
, the results from validation and test set are the same (lower, but the same, as i would expect in "normal" conditions"), like this:And here the classification report on test set without UMAP embedding as part of my inputs:
Both sets got around 0,8 accuracy.
The text was updated successfully, but these errors were encountered: