Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce the results of the paper? #9

Open
pfllo opened this issue Jul 20, 2017 · 9 comments
Open

How to reproduce the results of the paper? #9

pfllo opened this issue Jul 20, 2017 · 9 comments

Comments

@pfllo
Copy link

pfllo commented Jul 20, 2017

Hi, thanks for the great code.

I tried running your code on the ATIS data in https://github.com/yvchen/JointSLU/tree/master/data, and got accuracy 96.75 and F1 94.42 after training for 8400 steps. (I replaced the digits in the text with digit*n, where n is the length of the digit sequence)
However, there is still a gap between this result and the published results.

My questions are:
Is this result reasonable for the published code?
What else should I do to reproduce the published results, except for implementing the tag dependency? Are there any important tricks?
Should I use the default hyper-parameters in the code, or another set of hyper-parameters?

Thanks a lot.

@HadoopIt
Copy link
Owner

Hi @pfllo, thanks for trying out the code!

The sample code here demonstrates how we introduced attention to tagging task. As mentioned in "generate_embedding_RNN_output.py", this published code here does not include the modeling of slot label dependencies. The benefit of modeling label dependencies depends on task and dataset. On ATIS, connecting emitted slot labels back to RNN state will improve the F1 score by a small margin. Alternatively, one may also add CRF layer on top of the RNN outputs for sequence level optimization.

Training the model for more epochs may also likely to improve the test F1 score and classification accuracy.
...
global step 16500 step-time 0.19. Training perplexity 1.01
Test accuracy: 97.87 874/893
Test f1-score: 95.71
global step 16800 step-time 0.17. Training perplexity 1.01
Test accuracy: 98.10 876/893
Test f1-score: 95.87
global step 17100 step-time 0.20. Training perplexity 1.01
Test accuracy: 97.98 875/893
Test f1-score: 95.57
...

Moreover, we also tried using pre-trained word embeddings. On ATIS, the improvement with pre-trained word embedding is limited, but we did see improvement on some other datasets.

Hope that the above clarification helps.

@MansteinLiliang
Copy link

Hi, i'm trying this great code. But as you say above, modeling the slot label dependency is benefit for the F1 score, but the code produces low intention classification accuracy as reported in your paper about 96.75...
Did you mean that adding the slot label dependency promote to increase classification accuracy to 98+?
Will you open source the code containing that dependency?

@HadoopIt
Copy link
Owner

HadoopIt commented Aug 4, 2017

Thanks for the interest in our work. I believe the intent classification accuracy is not likely to be improved much with the modeling of label dependencies. The above results posted is obtained by directly running the published code once, and the classification accuracy achieved is around 98.10%, comparing to the best value (98.21%) reported in the paper for "Attention BiRNN".

The label dependency modeling codes we used need much cleaning and formatting, and we don't really have enough bandwidth to handle it at the moment. The implementation we have follows the TensorFlow seq2seq.py decoder example. Thanks.

@MansteinLiliang
Copy link

I think you do more trick on the data preprocessing step or you just use the whole training set which not leave aside a validation data set(for I don't recognize an Eval result above). Your code contains a validation step and I got reasonable result on the label tagging task, but always fail to reproduce the result for intent classification task. The result I ran the code is as follow(the highest Test accuracy is 96.98):
...
global step 16300 step-time 0.12. Training perplexity 1.01
Eval accuracy: 97.60 488/500
Eval f1-score: 96.83
Test accuracy: 96.53 862/893
Test f1-score: 95.48
global step 16400 step-time 0.12. Training perplexity 1.01
Eval accuracy: 97.40 487/500
Eval f1-score: 96.75
Test accuracy: 96.98 866/893
...
What do you think of it?

@HadoopIt
Copy link
Owner

HadoopIt commented Aug 5, 2017

We used cross validation for hyper-parameter tuning. During final model evaluation, we used the full training set (4978 training samples) from the original ATIS train/test splits. During data preprocessing, we replaced all digits with "DIGIT", e.g. the example in https://github.com/HadoopIt/rnn-nlu/blob/master/data/ATIS_samples/valid/valid.seq.in

@MansteinLiliang
Copy link

Thanks for your reply, but it seems that some utterance may have more than one intent, (like flight+flight_fare). And according to the paper you referred(JOINT SEMANTIC UTTERANCE CLASSIFICATION AND SLOT FILLING WITH RECURSIVE NEURAL NETWORKS). Result was evaluated on the 17 intent. But in your code, you seems to classify sequence including intent that contains multiple unit intent. Which may hurt the classification performance, how do you deal with it?

@HadoopIt
Copy link
Owner

HadoopIt commented Aug 6, 2017

This is a good point. We simply used the first intent label as the true label during data preprocessing, for both model training and testing. There are in total 15 out of 893 utterances in the test set that have multiple intents. We processed the ATIS corpus from the original .tsv format, not from Vivian's https://github.com/yvchen/JointSLU/tree/master/data. They are the same ATIS corpus though.

@bringtree
Copy link

@HadoopIt . Hello, We try to reproduce the results of the paper. and we meet some question.
1 After you adjust the parameters. Are you training all the data directly into the model?
2 When training, what indicators should be used to judge whether the model needs to stop training? Make the model not overfit.

@cherryyue
Copy link

@HadoopIt Hi, when doing final model evaluation, you used the full training set (4978 samples). Can you tell me how you get your final result? Are you training your model once and run it many times then choose the best result from them or use the average results after the model has converged?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants