Error in PubMed evaluation using run_summarization.py #15

Amit-GH · 2021-04-16T18:21:23Z

I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.

I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "bigbird/summarization/run_summarization.py", line 534, in <module>
    app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)

I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:

python3 bigbird/summarization/run_summarization.py \
  --data_dir="tfds://scientific_papers/pubmed" \
  --output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
  --attention_type=block_sparse \
  --couple_encoder_decoder=True \
  --max_encoder_length=3072 \
  --max_decoder_length=256 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=2 \
  --eval_batch_size=4 \
  --num_train_steps=1000 \
  --do_train=True \
  --do_eval=True \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --save_checkpoints_steps=1000 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0

The text was updated successfully, but these errors were encountered:

prathameshk · 2021-05-10T06:27:41Z

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

gymbeijing · 2021-08-05T05:07:07Z

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

Hi @prathameshk, can I ask how do you finetune the model on your custom dataset? I was thinking replace data_dir by path_contains_tfrecords, but I got error:

(0) Invalid argument: Feature: document (data type: string) is required but could not be found.
          [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
          [[MultiDeviceIteratorGetNextFromShard]]
          [[RemoteCall]] 
         [[IteratorGetNext]]
          [[Mean/_19475]]

Updates:
I solved this problem by replacing the name_to_features fields with the actual fields in the tfrecord file.

Amit-GH · 2021-08-09T01:24:06Z

If you haven't already, then check out the HuggingFace implementation of BigBird. That can be easier to use and integrate with your project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in PubMed evaluation using run_summarization.py #15

Error in PubMed evaluation using run_summarization.py #15

Amit-GH commented Apr 16, 2021

prathameshk commented May 10, 2021

gymbeijing commented Aug 5, 2021 •

edited

Loading

Amit-GH commented Aug 9, 2021

Error in PubMed evaluation using run_summarization.py #15

Error in PubMed evaluation using run_summarization.py #15

Comments

Amit-GH commented Apr 16, 2021

prathameshk commented May 10, 2021

gymbeijing commented Aug 5, 2021 • edited Loading

Amit-GH commented Aug 9, 2021

gymbeijing commented Aug 5, 2021 •

edited

Loading