You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.
I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
File "bigbird/summarization/run_summarization.py", line 534, in <module>
app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)
I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:
I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?
I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?
Hi @prathameshk, can I ask how do you finetune the model on your custom dataset? I was thinking replace data_dir by path_contains_tfrecords, but I got error:
(0) Invalid argument: Feature: document (data type: string) is required but could not be found.
[[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]]
[[Mean/_19475]]
Updates:
I solved this problem by replacing the name_to_features fields with the actual fields in the tfrecord file.
I am using the script
roberta_base.sh
to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:
The text was updated successfully, but these errors were encountered: