Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in PubMed evaluation using run_summarization.py #15

Open
Amit-GH opened this issue Apr 16, 2021 · 3 comments
Open

Error in PubMed evaluation using run_summarization.py #15

Amit-GH opened this issue Apr 16, 2021 · 3 comments

Comments

@Amit-GH
Copy link

Amit-GH commented Apr 16, 2021

I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.

I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "bigbird/summarization/run_summarization.py", line 534, in <module>
    app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)

I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:

python3 bigbird/summarization/run_summarization.py \
  --data_dir="tfds://scientific_papers/pubmed" \
  --output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
  --attention_type=block_sparse \
  --couple_encoder_decoder=True \
  --max_encoder_length=3072 \
  --max_decoder_length=256 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=2 \
  --eval_batch_size=4 \
  --num_train_steps=1000 \
  --do_train=True \
  --do_eval=True \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --save_checkpoints_steps=1000 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0
@prathameshk
Copy link

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

@gymbeijing
Copy link

gymbeijing commented Aug 5, 2021

I am also facing similar issue on my custom dataset. Evaluation works if the use_tpu is made false and code is run on GPU or CPU. But it takes way longer. Any thoughts on how to resolve this ?

Hi @prathameshk, can I ask how do you finetune the model on your custom dataset? I was thinking replace data_dir by path_contains_tfrecords, but I got error:

(0) Invalid argument: Feature: document (data type: string) is required but could not be found.
          [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
          [[MultiDeviceIteratorGetNextFromShard]]
          [[RemoteCall]] 
         [[IteratorGetNext]]
          [[Mean/_19475]]

Updates:
I solved this problem by replacing the name_to_features fields with the actual fields in the tfrecord file.

@Amit-GH
Copy link
Author

Amit-GH commented Aug 9, 2021

If you haven't already, then check out the HuggingFace implementation of BigBird. That can be easier to use and integrate with your project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants