Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not benefiting from checkpointing #297

Open
mstfldmr opened this issue Sep 27, 2022 · 6 comments
Open

Not benefiting from checkpointing #297

mstfldmr opened this issue Sep 27, 2022 · 6 comments
Assignees
Labels
component:losses Issues related to support additional metric learning technqiues (e.g loss) component:model type:bug Something isn't working

Comments

@mstfldmr
Copy link

mstfldmr commented Sep 27, 2022

Hello,

I save checkpoints with:

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
  filepath = checkpoint_path + '/epoch-{epoch:02d}/',
  monitor = 'val_loss',
  save_freq = 'epoch',
  save_weights_only = False,
  save_best_only = False,
  mode = 'auto')

After loading the latest checkpoint and continuing training, I would expect the loss value to be around the loss value in the last checkpoint.

    model = tf.keras.models.load_model(
        model_path,
        custom_objects={"SimilarityModel": tfsim.models.SimilarityModel,
                        'MyOptimizer': tfa.optimizers.RectifiedAdam})

    model.load_index(model_path)

    model.fit(
        datasampler,
        callbacks = callbacks,
        epochs = args.epochs,
        initial_epoch=initial_epoch_number,
        steps_per_epoch = N_TRAIN_SAMPLES ,
        verbose=2
    )

However, the loss value does not continue from where it left. It looks like it's simply starting the training from scratch and not benefiting from checkpoints.

@mstfldmr mstfldmr changed the title Loss is too high after loading checkpoints and continuing training Not benefiting from checkpointing Sep 27, 2022
@owenvallis owenvallis self-assigned this Oct 24, 2022
@owenvallis owenvallis added type:bug Something isn't working component:model component:losses Issues related to support additional metric learning technqiues (e.g loss) labels Oct 24, 2022
@owenvallis
Copy link
Collaborator

Thanks for submitting the issue @mstfldmr. Do you have a simple example I can use to try and repro the issue? I can also try and repro this using our basic example, but it might be good to get closer to your current set up as well.

@mstfldmr
Copy link
Author

mstfldmr commented Nov 5, 2022

@owenvallis I'm sorry, I can't share the full code because it has some confidential pieces we developed. This was how I configured checkpointing:

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='loss', save_freq='epoch', save_weights_only=False, save_best_only=False)

and how I loaded a checkpoint back:

    resumed_model_from_checkpoints = tf.keras.models.load_model(f'{checkpoint_path}/{max_epoch_filename}')

@mstfldmr
Copy link
Author

@owenvallis could you reproduce it?

@owenvallis
Copy link
Collaborator

Hi @mstfldmr, sorry for the delay here. I'll try and get to this this week.

@owenvallis
Copy link
Collaborator

owenvallis commented Dec 2, 2022

Looking into this now as it also looks like there is a breaking change in 2.8 where they removed Optimizer.get_weights() (see keras-team/tf-keras#442). That issue also mentions that SavedModel didn't properly save the weights for certain optimizers in the past (see tensorflow/tensorflow#44670).

Which optimizer were you using? Was it Adam?

@mstfldmr
Copy link
Author

mstfldmr commented Dec 5, 2022

@owenvallis yes, it was tfa.optimizers.RectifiedAdam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:losses Issues related to support additional metric learning technqiues (e.g loss) component:model type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants