Skip to content

1603Transfer

Petr Baudis edited this page May 14, 2016 · 19 revisions

Transfer Learning Experiments

We try to train our model on a task with a lot of data, then try to apply it to a task with smaller dataset. The final goal is to train a universal model for sentence-based semantic reasoning...

Bigvocab ubuntu Dialog RNN

Our new base model is ubuntu-rnn-5b6844b9cfacfe6a-01 (best val) of job 11288547.arien.ics.muni.cz.R_u_rnn80BV_EP100_d0s2p1dot inp_e_dropout=0 dropout=0 sdim=2 pdim=1 spad=80 ptscorer=B.dot_ptscorer:

{"Ddim": "2", "balance_class": "False", "batch_size": "192", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "embicase": "False", "embprune": "100", "epoch_fract": "0.25", "f_add_kw": "False", "fix_layers": "[]", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "nb_epoch": "16", "nb_runs": "4", "opt": "adam", "pact": "tanh", "pdim": "1", "prescoring": "None", "prescoring_input": "None", "prescoring_prune": "None", "project": "True", "ptscorer": "<function dot_ptscorer at 0xa0728c0>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "2", "spad": "80", "vocabf": "data/anssel/ubuntu/v2-vocab.pickle"}
data/anssel/ubuntu/v2-valset.pickle MRR: 0.778320
data/anssel/ubuntu/v2-valset.pickle 2-R@1: 0.905521
data/anssel/ubuntu/v2-valset.pickle 10-R@1: 0.662321  10-R@2: 0.793661  10-R@5: 0.948773

For hypev (embdim=50), we use ubuntu-rnn-37d0427cb1ad18c4-00 of job 11299592.arien.ics.muni.cz.R_u_rnn80BV_EP100E50_d0s2p1dot:

{"Ddim": "2", "balance_class": "False", "batch_size": "192", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "50", "embicase": "False", "embprune": "100", "epoch_fract": "0.25", "f_add_kw": "False", "fix_layers": "[]", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "nb_epoch": "16", "nb_runs": "3", "opt": "adam", "pact": "tanh", "pdim": "1", "prescoring": "None", "prescoring_input": "None", "prescoring_prune": "None", "project": "True", "ptscorer": "<function dot_ptscorer at 0xaa330c8>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "2", "spad": "80", "vocabf": "data/anssel/ubuntu/v2-vocab.pickle"}
data/anssel/ubuntu/v2-valset.pickle MRR: 0.753427
data/anssel/ubuntu/v2-valset.pickle 2-R@1: 0.890031
data/anssel/ubuntu/v2-valset.pickle 10-R@1: 0.629806  10-R@2: 0.766155  10-R@5: 0.933640

Ubuntu Dialog RNN on Anssel

Train RNN on Ubuntu Dialog, apply it in other contexts...

Our base model is 10648965.arien.ics.muni.cz.0rn8_tAs2_d0i0 - 80-token RNN of tdot sdim=2 pdim=1:

({"Ddim": "2", "batch_size": "128", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "pact": "tanh", "pdim": "1", "project": "True", "ptscorer": "<function dot_ptscorer at 0xcf05a28>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "2"})

The retraining on anssel looks like:

{"Ddim": "2", "balance_class": "True", "batch_size": "64", "dropout": "0.5", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "epoch_fract": "0.25", "fix_layers": "[]", "inp_e_dropout": "0.5", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "<function ranknet at 0xca41c80>", "mlpsum": "sum", "nb_epoch": "16", "pact": "linear", "pdim": "1", "project": "True", "ptscorer": "<function dot_ptscorer at 0xca41578>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "2"}

(Therefore, classes are balanced; different loss function is used.)

For now, our target is yodaqa/curatedv2.

pact=linear

Baseline (AnsselRNN) trained on yodaqa only, with dropout 1/2 and pact=linear (we forgot this in a few followup experiments) - TODO

16x uay10648965rn8_i12d12 (pact=tanh missing!) - 0.406567 (95% [0.396726, 0.416408]):

10720680.arien.ics.muni.cz.uay10648965rn8_i12d12 etc.
[0.434111, 0.425123, 0.440753, 0.402140, 0.405344, 0.385531, 0.400927, 0.419800, 0.385345, 0.393832, 0.422197, 0.429121, 0.394412, 0.387759, 0.394547, 0.384130, ]

This is a significant improvement! Note that a typical behavior is initial very low val MRR which then however recovers quickly.

16x uay10648965rn8_i23d23 (pact=tanh missing!) - 0.375708 (95% [0.357788, 0.393628]):

10720664.arien.ics.muni.cz.uay10648965rn8_i23d23 etc.
[0.391461, 0.380260, 0.366748, 0.322265, 0.418886, 0.324551, 0.356732, 0.367108, 0.364346, 0.356275, 0.402762, 0.410457, 0.454514, 0.375668, 0.379674, 0.339624, ]

16x uay10648965rn8_i23d23_bce (pact=tanh missing!) (loss binary crossentropy, i.e. without changing the loss compared to Ubuntu) - 0.354747 (95% [0.323420, 0.386074]):

10720697.arien.ics.muni.cz.uay10648965rn8_i23d23_bce etc.
[0.428772, 0.376948, 0.424435, 0.372013, 0.348488, 0.351694, 0.343715, 0.279636, 0.408026, 0.175946, 0.328287, 0.355549, 0.366263, 0.348897, 0.406161, 0.361124, ]

(the sharp initial dip in val mrr still occurs).

12x uay10648965rn8_i23d23_ef8 (pact=tanh missing!) (epoch_fract=1/8, based on observations of overfitting behavior) - 0.364901 (95% [0.341884, 0.387918]):

10720738.arien.ics.muni.cz.uay10648965rn8_i23d23_ef8 etc.
[0.438845, 0.401883, 0.397337, 0.334052, 0.354876, 0.326908, 0.360647, 0.369588, 0.375161, 0.384821, 0.329601, 0.305097, ]

pact=tanh

Models below have been retrained with pact=tanh as it should be.

Baseline (AnsselRNN) trained on yodaqa only, with dropout 0 ay_1rnnd0_s2p1tdot - 0.332909 (95% [0.317992, 0.347826]):

Baseline (AnsselRNN) trained on yodaqa only, with dropout 1/2 ay_1rnn_i12d12_s2p1tdot - 0.351717 (95% [0.343524, 0.359910]).

8x (TODO 8 more, then update paper) Transfer learning baseline (trained on yodaqa only), dropout 4/5 like normal config ay_1rnn_i45d45_s2p1tdot - 0.350665 (95% [0.336821, 0.364508]):

10783017.arien.ics.muni.cz.ay_1rnn_i45d45_s2p1tdot etc.
[0.348711, 0.317360, 0.363682, 0.342676, 0.358955, 0.377087, 0.354354, 0.342493, ]

16x 10811065.arien.ics.muni.cz.fay_1rnn_i45d45_s2p1tdot:

Model trainAllMRR devMRR testMAP testMRR settings
rnn 0.416619 0.362691 0.212844 0.291209 inp_e_dropout=4/5 dropout=4/5 pact='tanh' ptscorer=B.dot_ptscorer sdim=2 pdim=1
±0.034087 ±0.009861 ±0.003946 ±0.008295

12x uay10648965rn8_i12d12t - 0.423714 (95% [0.401674, 0.445754]):

10736960.arien.ics.muni.cz.uay10648965rn8_i12d12t etc.
[0.400878, 0.384704, 0.443747, 0.452339, 0.404232, 0.412879, 0.432941, 0.432817, 0.343167, 0.469059, 0.445414, 0.462392, ]

(this also solves the sharp initial dip in val mrr).

16x uay10648965rn8_i23d23t - 0.393816 (95% [0.373901, 0.413730]):

10737066.arien.ics.muni.cz.uay10648965rn8_i23d23t etc.
[0.394896, 0.359615, 0.426518, 0.408441, 0.368428, 0.431601, 0.353829, 0.298714, 0.406918, 0.399931, 0.385726, 0.402983, 0.408956, 0.375196, 0.407691, 0.471606, ]

16x uay10648965rn8_i0d0t - 0.471582 (95% [0.454885, 0.488279]):

10776891.arien.ics.muni.cz.uay10648965rn8_i0d0t etc.
[0.465015, 0.448980, 0.479760, 0.478839, 0.424187, 0.513608, 0.553039, 0.454582, 0.443496, 0.429073, 0.468391, 0.457913, 0.480083, 0.506280, 0.467149, 0.474921, ]

4x (TODO 16x) uay10648965rn8_i0d0t_rmsprop - 0.493405 (95% [0.470097, 0.516714]):

10785252.arien.ics.muni.cz.uay10648965rn8_i0d0t_rmsprop etc.
[0.495191, 0.487384, 0.475401, 0.515646, ]

Learned: Transfer learning dramatically improves model, we are probably training the RNN model really powerfully! But very importantly, the trained weights either are or aren't conditioned to deal with dropout - originally trained model must retain dropout.

What is we disable retraining of some pieces? It's not helpful...

1x "fix_layers": "['emb']" - 0.518032

1x "fix_layers": "['emb', 'rnnf', 'rnnb']" - 0.455826

1x "fix_layers": "['emb', 'proj']" - 0.494696

1x "fix_layers": "['rnnf', 'rnnb']" - 0.471188

wang: baseline MRR 0.842155

8x uaw10648965rn8_i0d0t_rmsprop - 0.872871 (95% [0.868261, 0.877481]):

10785261.arien.ics.muni.cz.uaw10648965rn8_i0d0t_rmsprop etc.
[0.878205, 0.873077, 0.869231, 0.870147, 0.866667, 0.866667, 0.875641, 0.883333, ]

Reproducing with transfer.py

NOTE: 10649016 in these job names really means 10648965 - we use the rnn--23fa2eff7cda310d weights, not a51 weights.

All jobs below include adapt_ubuntu=True implicitly.

Baseline (also having spad=80):

{"Ddim": "2", "balance_class": "True", "batch_size": "64", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "epoch_fract": "0.25", "fix_layers": "[]", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "<function ranknet at 0x9ff6230>", "mlpsum": "sum", "nb_epoch": "16", "pact": "tanh", "pdim": "1", "project": "True", "ptscorer": "<function dot_ptscorer at 0x9feeaa0>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "2"}

3x uay10649016rnnd0_bal_bs64 batch_size=64- 0.415885 (95% [0.355578, 0.476192]):

10844362.arien.ics.muni.cz.uay10649016rnnd0_bal_bs64 etc.
[0.446976, 0.387728, 0.412951, ]

4x uay10649016rnnd0_bal balance_class=True - 0.443608 (95% [0.410259, 0.476958]):

10844363.arien.ics.muni.cz.uay10649016rnnd0_bal etc.
[0.414324, 0.439570, 0.473074, 0.447466, ]

4x uay10649016rnnd0 - 0.431194 (95% [0.389786, 0.472601]):

10844364.arien.ics.muni.cz.uay10649016rnnd0 etc.
[0.442428, 0.455617, 0.439347, 0.387382, ]

4x uay10649016rnn_i12d12 - 0.390339 (95% [0.372871, 0.407807]):

10844365.arien.ics.muni.cz.uay10649016rnn_i12d12 etc.
[0.407950, 0.381887, 0.380351, 0.391167, ]

8x R8_uay10649016rnnd0_bal - (pre-training 0.394852) 0.454314 (95% [0.443093, 0.465535]):

10846044.arien.ics.muni.cz.R8_uay10649016rnnd0_bal etc.
[0.437152, 0.449015, 0.445558, 0.477646, 0.448559, 0.471816, 0.461064, 0.443704, ]

8x R8_uay10649016rnnd0_bal_adaptF - (pre-training 0.422342) 0.450973 (95% [0.437645, 0.464302]):

10846045.arien.ics.muni.cz.R8_uay10649016rnnd0_bal_adaptF etc.
[0.438729, 0.450851, 0.423810, 0.463053, 0.477556, 0.453520, 0.462072, 0.438196, ]

16x R_uay10649016rnnd0_bal_rmsprop - 0.493167 (95% [0.477519, 0.508814]):

10854143.arien.ics.muni.cz.R_uay10649016rnnd0_bal_rmsprop etc.
[0.545419, 0.497820, 0.443712, 0.492631, 0.486939, 0.479826, 0.468843, 0.518997, 0.519491, 0.436954, 0.503086, 0.538933, 0.494675, 0.486779, 0.510493, 0.466071, ]

8x R8_uay10649016rn80d0_bal - (pre-training 0.400210) 0.469884 (95% [0.460301, 0.479466]):

10846046.arien.ics.muni.cz.R8_uay10649016rn80d0_bal etc.
[0.468906, 0.478233, 0.451514, 0.474760, 0.488235, 0.471804, 0.453362, 0.472254, ]

Aside of rmsprop (which should do a better job not to destabilize the weights too much), it seems that a lot of the effect might have been just changing spad=60 to spad=80, let's see:

16x R_ay_2rnn_i45d45_p1dot (also THE DEFINITE NON-TRANSFER BASELINE) - 0.343247 (95% [0.331021, 0.355473]):

10854177.arien.ics.muni.cz.R_ay_2rnn_i45d45_p1dot etc.
[0.331217, 0.380730, 0.352043, 0.347724, 0.352377, 0.308311, 0.328785, 0.374485, 0.331558, 0.376365, 0.320055, 0.337803, 0.312199, 0.376987, 0.324900, 0.336408, ]

8x R_ay_2rn80_i45d45_p1dot - 0.330107 (95% [0.319037, 0.341176]):

10854945.arien.ics.muni.cz.R_ay_2rn80_i45d45_p1dot etc.
[0.305120, 0.351810, 0.345520, 0.329639, 0.332048, 0.323997, 0.326613, 0.326106, ]

Not confirmed. 80-token spad is beneficial only in transfer learning.

Is rmsprop generally better?

8x R_ay_2rnn_i45d45_p1dot_rmsprop - 0.346303 (95% [0.331153, 0.361454]):

10866178.arien.ics.muni.cz.R_ay_2rnn_i45d45_p1dot_rmsprop etc.
[0.391169, 0.335013, 0.347051, 0.345706, 0.343806, 0.345807, 0.332839, 0.329036, ]

8x R_ay_2rn80_i45d45_p1dot_rmsprop - 0.343307 (95% [0.333689, 0.352925]):

10866179.arien.ics.muni.cz.R_ay_2rn80_i45d45_p1dot_rmsprop etc.
[0.333199, 0.355873, 0.344276, 0.338909, 0.348613, 0.344083, 0.321559, 0.359942, ]

Not confirmed either, seems like ADAM is fine outside of transfer learning.

Final combination - rmsprop + rn80!

16x R_uay10649016rn80d0_bal_rmsprop - 0.511896 (95% [0.503424, 0.520367]):

10866180.arien.ics.muni.cz.R_uay10649016rn80d0_bal_rmsprop etc.
[0.505906, 0.513924, 0.492741, 0.533375, 0.518594, 0.515935, 0.512592, 0.465411, 0.525206, 0.517539, 0.519326, 0.520393, 0.509710, 0.501880, 0.533287, 0.504509, ]

8x quick check - fixing rnnf, rnnb R_uay10649016rn80d0_bal_rmsprop_Frnn - 0.453139 (95% [0.428919, 0.477359]):

10873241.arien.ics.muni.cz.R_uay10649016rn80d0_bal_rmsprop_Frnn etc.
[0.397404, 0.507999, 0.441072, 0.445672, 0.466000, 0.448610, 0.467485, 0.450872, ]

8x fixing emb R_uay10649016rn80d0_bal_rmsprop_Femb - 0.488446 (95% [0.471109, 0.505783]):

10874990.arien.ics.muni.cz.R_uay10649016rn80d0_bal_rmsprop_Femb etc.
[0.451154, 0.496743, 0.510462, 0.508151, 0.485284, 0.468028, 0.511508, 0.476238, ]

Final eval - curatedv2

For final evaluation, we will consider:

  • R_ay_2rnn_i45d45_p1dot
  • R8_uay10649016rnnd0_bal_rmsprop
  • R_uay10649016rn80d0_bal_rmsprop

It's an open question whether we should mud things down with rn80 for publication.

| rnn | 0.389533 | 0.343247 | 0.207937 | 0.281507 | inp_e_dropout=4/5 dropout=4/5 ptscorer=B.dot_ptscorer pdim=1 (R_ay_2rnn_i45d45_p1dot) | |±0.036806 |±0.012226 |±0.005489 |±0.009162 | | rnn | 0.409872 | 0.330107 | 0.214150 | 0.286707 | inp_e_dropout=4/5 dropout=4/5 ptscorer=B.dot_ptscorer pdim=1 (E_ay_2rn80_i45d45_p1dot) | |±0.055389 |±0.011069 |±0.006655 |±0.012274 | | rnn | 0.600532 | 0.493167 | 0.300700 | 0.463808 | vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True (R8_uay10649016rnnd0_bal_rmsprop) | |±0.045585 |±0.015647 |±0.007871 |±0.011789 | | rnn | 0.619427 | 0.511896 | 0.310194 | 0.473334 | vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True (R_uay10649016rn80d0_bal_rmsprop) | |±0.022799 |±0.008472 |±0.004359 |±0.007336 |

Final eval - large2470

8x baseline R_al_2rnn_p1dot - 0.334123 (95% [0.323487, 0.344758]):

10880633.arien.ics.muni.cz.R_al_2rnn_p1dot etc.
[0.327379, 0.346812, 0.348506, 0.353200, 0.320986, 0.316812, 0.331236, 0.328051, ]

8x spad=80 baseline R_al_2rn80_p1dot - 0.330066 (95% [0.319459, 0.340673]):

10875832.arien.ics.muni.cz.R_al_2rn80_p1dot etc.
[0.358279, 0.314384, 0.333608, 0.324758, 0.319437, 0.325398, 0.326847, 0.337815, ]

TODO rnn inst. of rn80?

16x E_ual10649016rn80d0_bal_rmsprop (pre-training MRR 0.358823) - 0.517763 (95% [0.510039, 0.525488]):

10882856.arien.ics.muni.cz.E_ual10649016rn80d0_bal_rmsprop etc.
[0.517961, 0.514413, 0.510016, 0.543145, 0.529109, 0.500883, 0.503938, 0.528978, 0.541020, 0.513588, 0.490448, 0.526633, 0.523861, 0.499127, 0.528698, 0.512395, ]

Wow, much better than anything trained directly...

| rnn | 0.623209 | 0.517763 | 0.359331 | 0.539284 | vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True opt='rmsprop' (E_ual10649016rn80d0_bal_rmsprop) | |±0.014351 |±0.007724 |±0.003003 |±0.005755 |

Final eval - wang

16x baseline R_aw_rnn_p1dot - 0.721649 (95% [0.713574, 0.729723]):

10875824.arien.ics.muni.cz.R_aw_rnn_p1dot etc.
[0.707072, 0.716432, 0.732280, 0.705007, 0.704594, 0.731748, 0.698169, 0.750606, 0.731919, 0.732654, 0.722460, 0.725043, 0.694525, 0.732902, 0.731994, 0.728971, ]

16x spad=80 baseline R_aw_rn80_p1dot - 0.725796 (95% [0.716423, 0.735169]):

10875822.arien.ics.muni.cz.R_aw_rn80_p1dot etc.
[0.717863, 0.746007, 0.731490, 0.701957, 0.723101, 0.728483, 0.746683, 0.683786, 0.717452, 0.714565, 0.704367, 0.736625, 0.740824, 0.743188, 0.732873, 0.743471, ]

16x R_uaw10649016rn80d0_bal_rmsprop - 0.872205 (95% [0.867770, 0.876640]):

10873213.arien.ics.muni.cz.R_uaw10649016rn80d0_bal_rmsprop etc.
[0.856410, 0.884249, 0.869231, 0.879487, 0.879487, 0.885897, 0.861538, 0.861538, 0.871795, 0.870513, 0.864103, 0.869231, 0.869744, 0.876923, 0.880769, 0.874359, ]

| rnn | 0.895331 | 0.872205 | 0.731038 | 0.814410 | vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True opt='rmsprop' | |±0.006360 |±0.004435 |±0.007483 |±0.008340 |

16x fixing rnnf, rnnb R_uaw10649016rn80d0_bal_rmsprop_Frnn - 0.858575 (95% [0.855310, 0.861839]):

10873243.arien.ics.muni.cz.R_uaw10649016rn80d0_bal_rmsprop_Frnn etc.
[0.865385, 0.858974, 0.851282, 0.856410, 0.857692, 0.856923, 0.861538, 0.864103, 0.853846, 0.858333, 0.840256, 0.861172, 0.861538, 0.865385, 0.860256, 0.864103, ]

Ubuntu Dialog attn1511 on Anssel

cdim=1/2

Our base model is 10649016.arien.ics.muni.cz.a51d0_sCdot_c12 - 160(?)-token attn1511 no-dropout cdim=1/2 ptscorer=B.dot_ptscorer sdim=1/2:

    RunID: attn1511--56ec61ba4b2fffb5  ({"Ddim": "2", "adim": "0.5", "attn_mode": "sum", "batch_size": "192", "cdim": "0.5", "cfiltlen": "3", "cnnact": "tanh", "cnninit": "glorot_uniform", "dropout": "0", "e_add_flags": "True", "embdim": "300", "focus_act": "softmax", "inp_e_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "pool_layer": "<class 'keras.layers.convolutional.MaxPooling1D'>", "project": "True", "ptscorer": "<function dot_ptscorer at 0xd9fa050>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "0.5"})

The retraining on anssel looks like:

{"Ddim": "2", "adim": "0.5", "attn_mode": "sum", "balance_class": "True", "batch_size": "64", "cdim": "0.5", "cfiltlen": "3", "cnnact": "tanh", "cnninit": "glorot_uniform", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "epoch_fract": "0.25", "fix_layers": "[]", "focus_act": "softmax", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "<function ranknet at 0xcc0b230>", "mlpsum": "sum", "nb_epoch": "16", "pool_layer": "<class 'keras.layers.convolutional.MaxPooling1D'>", "project": "True", "ptscorer": "<function dot_ptscorer at 0xcc04aa0>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "0.5"}

(Therefore, classes are balanced; different loss function is used.)

For now, our target is yodaqa/curatedv2.

Basically always, the best epoch is the first one (i.e. after 1/4 of dataset!).

Baseline (without pretraining) - TODO

2x inp_e_dropout=0 dropout=0 cdim=1/2 ptscorer=B.dot_ptscorer sdim=1/2 (attn1511--2045af5185f9a68a, attn1511--7d0d91aacf3fd15c) - [0.445205, 0.39]

1x inp_e_dropout=0 dropout=0 cdim=1/2 ptscorer=B.dot_ptscorer sdim=1/2 "fix_layers=['emb']" (attn1511-f6db6998c0117eb) - 0.409512

1x above + "fix_layers=['emb', 'rnnf', 'rnnb']" (attn1511--2eaab74e1f83ef7c) - 0.443038

1x above + rmsprop attn1511-db7b6f8014ed1e - 0.503849 (in fifth epoch! less overfit?)

8x uay10649016a51d0_rmsprop - 0.503346 (95% [0.486315, 0.520376]):

10785304.arien.ics.muni.cz.uay10649016a51d0_rmsprop etc.
[0.524046, 0.488576, 0.514498, 0.511442, 0.526407, 0.513901, 0.481409, 0.466486, ]

wang untrained MRR 0.768242

1x above (inp_e_dropout=0 dropout=0 cdim=1/2 ptscorer=B.dot_ptscorer sdim=1/2 "opt='rmsprop'") on wang (attn1511-4033621cc636bab1) - 0.822308

1x above + epoch_fract=1/8 (attn1511-3389ef1cd1a79ca4) - 0.842308

1x above (w/ epoch_fract=1/8 ), (attn1511--82ea9b99d6401ac) - 0.834872

1x above (w/ epoch_fract=1/8 ), (attn1511--204bfdeed994d170, attn1511--204bfdeed994d170) - 0.833077, 0.851795

1x above (w/ epoch_fract=1/8 ), (attn1511-39d39c15dc15bda2) - 0.835641

1x above (w/ epoch_fract=1/8 ), adam (attn1511--309a8fcfd0b885cb) - 0.872308

8x uaw10649016a51d0_rmsprop - 0.855185 (95% [0.846702, 0.863668]):

10785296.arien.ics.muni.cz.uaw10649016a51d0_rmsprop etc.
[0.857326, 0.857326, 0.842821, 0.870256, 0.854487, 0.867051, 0.854121, 0.838095, ]

cdim=1

A different base model 10649015.arien.ics.muni.cz.a51d0_sCdot (attn1511--156abb54ad7724db) is wider with inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer sdim=1/2 cdim=1:

{"Ddim": "2", "adim": "0.5", "attn_mode": "sum", "batch_size": "192", "cdim": "1", "cfiltlen": "3", "cnnact": "tanh", "cnninit": "glorot_uniform", "dropout": "0", "e_add_flags": "True", "embdim": "300", "focus_act": "softmax", "inp_e_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "pool_layer": "<class 'keras.layers.convolutional.MaxPooling1D'>", "project": "True", "ptscorer": "<function dot_ptscorer at 0xcba4050>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "0.5"}

Baseline (without pretraining) - TODO (not wip)

4x Retrained, no dropout uay1078460_2a51d0_s12c1 - 0.479510 (95% [0.444088, 0.514932]):

10792075.arien.ics.muni.cz.uay1078460_2a51d0_s12c1 etc.
[0.490090, 0.482423, 0.443009, 0.502519, ]

4x "fix_layers=['emb', 'rnnf', 'rnnb']" uay1078460_2a51d0_s12c1_Fembrnn - 0.461159 (95% [0.407030, 0.515288]):

10792077.arien.ics.muni.cz.uay1078460_2a51d0_s12c1_Fembrnn etc.
[0.469608, 0.469617, 0.499304, 0.406107, ]

above + rmsprop - 0.503849

cdim=3, MLP

A yet another base model that should be closer to good performance on ay - 10791244.arien.ics.muni.cz.u2a51d0_s12c3a1D1 (attn1511--6d309aea043d0d0a):

{"Ddim": "1", "adim": "1", "attn_mode": "sum", "batch_size": "192", "cdim": "3", "cfiltlen": "3", "cnnact": "tanh", "cnninit": "glorot_uniform", "dropout": "0", "dropoutfix_inp": "0", "dropoutfix_rec": "0", "e_add_flags": "True", "embdim": "300", "epoch_fract": "0.25", "focus_act": "softmax", "inp_e_dropout": "0", "inp_w_dropout": "0", "l2reg": "0.0001", "loss": "binary_crossentropy", "mlpsum": "sum", "nb_epoch": "16", "pool_layer": "<class 'keras.layers.convolutional.MaxPooling1D'>", "project": "True", "ptscorer": "<function mlp_ptscorer at 0x8e6c488>", "rnn": "<class 'keras.layers.recurrent.GRU'>", "rnnact": "tanh", "rnnbidi": "True", "rnnbidi_mode": "sum", "rnninit": "glorot_uniform", "rnnlevels": "1", "sdim": "0.5"}

4x baseline (whtout pretraining) ay_2a51d0_s12c3a1D1 - 0.473913 (95% [0.446784, 0.501043]):

10792018.arien.ics.muni.cz.ay_2a51d0_s12c3a1D1 etc.
[0.502280, 0.472003, 0.462177, 0.459193, ]

4x retrained uay10791244_2a51d0_s12c3a1D1 - 0.453239 (95% [0.424928, 0.481551]):

10811029.arien.ics.muni.cz.uay10791244_2a51d0_s12c3a1D1 etc.
[0.477960, 0.427921, 0.450896, 0.456180, ]

Ubuntu Dialog RNN on Para

Baselines

8x R8_pm_rnnd0_p1dot - 0.825014 (95% [0.823182, 0.826846]):

10846052.arien.ics.muni.cz.R8_pm_rnnd0_p1dot etc.
[0.822115, 0.828331, 0.825511, 0.823245, 0.824519, 0.827918, 0.822542, 0.825930, ]

8x R8_pm_rnnd0_p1dot_bal - 0.825015 (95% [0.821727, 0.828304]):

10846053.arien.ics.muni.cz.R8_pm_rnnd0_p1dot_bal etc.
[0.827503, 0.822249, 0.826667, 0.826245, 0.825980, 0.831902, 0.820950, 0.818627, ]

8x R_pm_rnn_i12d12_p1dot - 0.825791 (95% [0.824419, 0.827162]):

10868745.arien.ics.muni.cz.R_pm_rnn_i12d12_p1dot etc.
[0.825359, 0.823952, 0.824940, 0.827086, 0.824940, 0.829327, 0.826347, 0.824373, ]

8x R8_pm_rnn_p1dot - 0.820158 (95% [0.817793, 0.822523]):

10846054.arien.ics.muni.cz.R8_pm_rnn_p1dot etc.
[0.823810, 0.817967, 0.817967, 0.817967, 0.817967, 0.823810, 0.817967, 0.823810, ]

Transfer

NOTE: 10649016 in these job names really means 10648965 - we use the rnn--23fa2eff7cda310d weights, not a51 weights.

Pre-training Transfer Evaluation data/para/msr/msr-para-train.tsv Accuracy: raw 0.673098 (y=0 0.000000, y=1 1.000000), bal 0.500000; F-Score: 0.804613 data/para/msr/msr-para-val.tsv Accuracy: raw 0.692000 (y=0 0.000000, y=1 1.000000), bal 0.500000; F-Score: 0.817967

7x R8_upm10649016rnnd0_bal - 0.807646 (95% [0.794320, 0.820972]):

10846086.arien.ics.muni.cz.R8_upm10649016rnnd0_bal etc.
[0.823529, 0.821138, 0.787795, 0.814815, 0.792666, 0.819974, 0.793605, ]

6x R8_upm10649016rnnd0_bal_adaptF - 0.824848 (95% [0.821708, 0.827989]):

10846085.arien.ics.muni.cz.R8_upm10649016rnnd0_bal_adaptF etc.
[0.823810, 0.824791, 0.822249, 0.823245, 0.823671, 0.831325, ]

14x R8_upm10649016rnnd0 - 0.815770 (95% [0.808625, 0.822915]):

10846087.arien.ics.muni.cz.R8_upm10649016rnnd0 etc.
[0.816000, 0.820996, 0.823681, 0.815217, 0.820513, 0.811111, 0.831633, 0.825316, 0.821138, 0.791304, 0.814714, 0.824447, 0.819718, 0.784993, ]

32x Rf_upm10649016rnnd0_adaptF - 0.823404 (95% [0.822550, 0.824258]):

10854137.arien.ics.muni.cz.Rf_upm10649016rnnd0_adaptF etc.
[0.822830, 0.822830, 0.825359, 0.818074, 0.825776, 0.821853, 0.826506, 0.822830, 0.823671, 0.825776, 0.820878, 0.821853, 0.827503, 0.824242, 0.826087, 0.822830, 0.819277, 0.823810, 0.823810, 0.824940, 0.824373, 0.823389, 0.827751, 0.826087, 0.821818, 0.825776, 0.820452, 0.821429, 0.821853, 0.819048, 0.822830, 0.823389, ]

16x Rf_upm10649016rnnd0_adaptF_rmsprop (BEST) - 0.828242 (95% [0.826436, 0.830047]):

10866217.arien.ics.muni.cz.Rf_upm10649016rnnd0_adaptF_rmsprop etc.
[0.832328, 0.827751, 0.827751, 0.820878, 0.824791, 0.830732, 0.823389, 0.835351, 0.829916, 0.829736, 0.826762, 0.830732, 0.826347, 0.828331, 0.827338, 0.829736, ]

TODO rn80

Final ev.:

  • R_pm_rnn_i12d12_p1dot
  • Rf_upm10649016rnnd0_adaptF_rmsprop

| rnn | 0.740772 | 0.838468 | 0.708000 | 0.824969 | 0.676594 | 0.803391 | dropout=1/2 inp_e_dropout=1/2 pdim=1 ptscorer=B.dot_ptscorer (R_pm_rnn_i12d12_p1dot) | |±0.007241 |±0.003786 |±0.001846 |±0.000927 |±0.000873 |±0.000479 | | rnn | 0.714014 | 0.824742 | 0.713375 | 0.828242 | 0.679783 | 0.805176 | vocabt='ubuntu' pdim=1 ptscorer=B.dot_ptscorer dropout=0 inp_e_dropout=0 adapt_ubuntu=False opt='rmsprop' (upm10649016rnnd0_adaptF_rmsprop) | |±0.011274 |±0.005751 |±0.003706 |±0.001805 |±0.002587 |±0.001112 |

We can conclude that a very tiny effect exists.

Ubuntu Dialog RNN on STS

NOTE: 10649016 in these job names really means 10648965 - we use the rnn--23fa2eff7cda310d weights, not a51 weights.

Non-transfer baselines

Interestingly, they seem to be a lot better than the default configurations. Should we also rebenchmark the whole STS without dropout?

8x R_si_rnn_p1 - 0.684407 (95% [0.663003, 0.705811]):

10854139.arien.ics.muni.cz.R_si_rnn_p1 etc.
[0.713723, 0.709019, 0.649515, 0.678928, 0.707875, 0.693157, 0.640828, 0.682211, ]

16x R_si_rnn_i12d12_p1 - 0.773660 (95% [0.768490, 0.778829]):

10866213.arien.ics.muni.cz.R_si_rnn_i12d12_p1 etc.
[0.764883, 0.785849, 0.775949, 0.782981, 0.783536, 0.766112, 0.779866, 0.780997, 0.772369, 0.770506, 0.771188, 0.753303, 0.785453, 0.761458, 0.782449, 0.761655, ]

8x R_si_rnnd0_p1 - 0.765223 (95% [0.760652, 0.769794]):

10854140.arien.ics.muni.cz.R_si_rnnd0_p1 etc.
[0.771554, 0.762041, 0.760762, 0.772671, 0.760071, 0.757881, 0.765865, 0.770939, ]

Transfer learning

12x (problem - training often gets stuck at nan) R8_usi10649016rnnd0 (pre-training 0.539569) - 0.789525 (95% [0.778667, 0.800384]):

10846084.arien.ics.muni.cz.R8_usi10649016rnnd0 etc.
[0.803276, 0.761930, 0.777618, 0.797924, 0.749926, 0.783393, 0.802999, 0.795240, 0.799502, 0.805458, 0.796659, 0.800379, ]

16x R8_usi10649016rnnd0_adaptF - 0.792281 (95% [0.782798, 0.801764]):

10846083.arien.ics.muni.cz.R8_usi10649016rnnd0_adaptF etc.
[0.810088, 0.766893, 0.802535, 0.815865, 0.799513, 0.796788, 0.777599, 0.766096, 0.804764, 0.767805, 0.808219, 0.784989, 0.776701, 0.774213, 0.807502, 0.816926, ]

Promising but not that dramatic.

12x R8_usi10649016rnnd0_rmsprop - 0.754339 (95% [0.745182, 0.763495]):

10866215.arien.ics.muni.cz.R8_usi10649016rnnd0_rmsprop etc.
[0.773434, 0.744246, 0.739576, 0.757936, 0.741220, 0.737015, 0.769254, 0.761268, 0.744715, 0.783167, 0.756925, 0.743308, ]

Interesting, RMSprop seems to work well for anssel but not for STS?!

TODO: Try rn80.

Final comparison:

  • R_si_rnn_i12d12_p1
  • R8_usi10649016rnnd0_adaptF

| rnn | 0.879834 | 0.773660 | 0.768794 | pdim=1 nb_runs=8 (si_rnn_i12d12_p1) | |±0.012633 |±0.005170 |±0.003780 | | rnn | 0.946294 | 0.792281 | 0.799129 | vocabt='ubuntu' pdim=1 ptscorer=B.dot_ptscorer dropout=0 inp_e_dropout=0 adapt_ubuntu=False (R8_usi10649016rnnd0_adaptF) | |±0.018979 |±0.009483 |±0.009060 |

SNLI Transfer

| rnn + BM25 | 0.496494 | 0.456415 | 0.276863 | 0.486928 | | |±0.015167 |±0.007189 |±0.003576 |±0.008479 |

9x R_SalR0rnn_preBM25P20_rmspropef1bal - 0.438719 (95% [0.427916, 0.449522]):

11242344.arien.ics.muni.cz.R_SalR0rnn_preBM25P20_rmspropef1bal etc.  
[0.454512, 0.439386, 0.425402, 0.433293, 0.457624, 0.457358, 0.427631, 0.416604, 0.436659, ]