Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining from scratch yields worse results #32

Open
nkcr opened this issue Oct 29, 2018 · 5 comments
Open

Retraining from scratch yields worse results #32

nkcr opened this issue Oct 29, 2018 · 5 comments

Comments

@nkcr
Copy link
Contributor

nkcr commented Oct 29, 2018

Hello,

As written in the paper (2.2 Deriving Architectures) I tried to re-retrain from scratch the best derived model, but it surprisingly gives worse result when I retrain it from scratch than if I would keep the original (shared) weights.
I expected training the best model (dag) from scratch to be faster and eventually have a better perplexity, but it's not the case.

I do the following:

  1. Launch ENAS with the --load_path argument, which loads a previous run, and the --mode test, which will call a custom test method inside the trainer class
  2. (In the test method) I reset the shared weights with self.shared.reset_parameters()
  3. I derive the best model (dag)
  4. Then I train this model from scratch, iterating over the train set for N epochs (like in the train_shared method)

The following picture shows the loss and ppl during the "normal" training (first slope) and after reseting the shared weights (second slope). The second slope only trains the same best model (dag).

shared-loss-ppl4

Has anyone any idea about why resetting the shared weight and re-training from scratch is so bad?

@philtomson
Copy link

philtomson commented Oct 30, 2018

Just to make sure I'm understanding: when you say first slope you're referring to the part of the graph from 0 to 60.00K on the x axis? And the 2nd slope starts at 80.00k? So you did the self.shared.reset_parameters() at 60.00K, correct?

I guess I'm not too surprised at this result. There's very little information in the paper about this retraining step - it's all in section 2.2 under Deriving Architectures: "We then take only the model with the highest reward to re-train from scratch" - AFAICT that's it, that's the whole description of the retraining step. If you look at the Tensorflow enas implementation from the paper authors (https://github.com/melodyguan/enas) you'll see that there are two scripts: ptb_search.sh and ptb_final.sh. The latter script is used to retrain the best found dag (and in fact they've hard-coded the best found dag to be exactly the one found in the paper). Doing a comparison between them I notice that several parameters are different between the two: The lstm_hidden_size is 720 in ptb_search.sh while it's 748 in ptb_final.sh, for example, and the parameters related to learning rate are very different as well. Perhaps you could try retraining using their parameter values from the ptb_final.sh?

@nkcr
Copy link
Contributor Author

nkcr commented Oct 31, 2018

Yes, correct. The second part starts at around 80k.

Interesting, I will try to use the same parameters and compare the results, thanks for the suggestion.

@nkcr
Copy link
Contributor Author

nkcr commented Nov 6, 2018

#35 Provides a way to train a single, given, dag

@philtomson
Copy link

@nkcr were you able to better results with the parameters from the TF code?

@nkcr
Copy link
Contributor Author

nkcr commented Nov 19, 2018

No, not really. I didn't investigate much but with a quick matching from the tensorflow implementation I got worse results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants