Retraining from scratch yields worse results #32

nkcr · 2018-10-29T13:47:04Z

Hello,

As written in the paper (2.2 Deriving Architectures) I tried to re-retrain from scratch the best derived model, but it surprisingly gives worse result when I retrain it from scratch than if I would keep the original (shared) weights.
I expected training the best model (dag) from scratch to be faster and eventually have a better perplexity, but it's not the case.

I do the following:

Launch ENAS with the --load_path argument, which loads a previous run, and the --mode test, which will call a custom test method inside the trainer class
(In the test method) I reset the shared weights with self.shared.reset_parameters()
I derive the best model (dag)
Then I train this model from scratch, iterating over the train set for N epochs (like in the train_shared method)

The following picture shows the loss and ppl during the "normal" training (first slope) and after reseting the shared weights (second slope). The second slope only trains the same best model (dag).

Has anyone any idea about why resetting the shared weight and re-training from scratch is so bad?

The text was updated successfully, but these errors were encountered:

philtomson · 2018-10-30T15:49:41Z

Just to make sure I'm understanding: when you say first slope you're referring to the part of the graph from 0 to 60.00K on the x axis? And the 2nd slope starts at 80.00k? So you did the self.shared.reset_parameters() at 60.00K, correct?

I guess I'm not too surprised at this result. There's very little information in the paper about this retraining step - it's all in section 2.2 under Deriving Architectures: "We then take only the model with the highest reward to re-train from scratch" - AFAICT that's it, that's the whole description of the retraining step. If you look at the Tensorflow enas implementation from the paper authors (https://github.com/melodyguan/enas) you'll see that there are two scripts: ptb_search.sh and ptb_final.sh. The latter script is used to retrain the best found dag (and in fact they've hard-coded the best found dag to be exactly the one found in the paper). Doing a comparison between them I notice that several parameters are different between the two: The lstm_hidden_size is 720 in ptb_search.sh while it's 748 in ptb_final.sh, for example, and the parameters related to learning rate are very different as well. Perhaps you could try retraining using their parameter values from the ptb_final.sh?

nkcr · 2018-10-31T08:40:18Z

Yes, correct. The second part starts at around 80k.

Interesting, I will try to use the same parameters and compare the results, thanks for the suggestion.

nkcr · 2018-11-06T10:31:07Z

#35 Provides a way to train a single, given, dag

philtomson · 2018-11-19T19:52:32Z

@nkcr were you able to better results with the parameters from the TF code?

nkcr · 2018-11-19T20:16:44Z

No, not really. I didn't investigate much but with a quick matching from the tensorflow implementation I got worse results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retraining from scratch yields worse results #32

Retraining from scratch yields worse results #32

nkcr commented Oct 29, 2018

philtomson commented Oct 30, 2018 •

edited

Loading

nkcr commented Oct 31, 2018

nkcr commented Nov 6, 2018 •

edited

Loading

philtomson commented Nov 19, 2018

nkcr commented Nov 19, 2018

Retraining from scratch yields worse results #32

Retraining from scratch yields worse results #32

Comments

nkcr commented Oct 29, 2018

philtomson commented Oct 30, 2018 • edited Loading

nkcr commented Oct 31, 2018

nkcr commented Nov 6, 2018 • edited Loading

philtomson commented Nov 19, 2018

nkcr commented Nov 19, 2018

philtomson commented Oct 30, 2018 •

edited

Loading

nkcr commented Nov 6, 2018 •

edited

Loading