Potential issues with HF GPT2 Models

Hello,

I am using the GPT2 models available in HF, and running into a few issues. Firstly, there seems to be an issue with the tokenizer. Trying to calculate perplexity using the evaluate module, as follows:

```
from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=["Hola, como estas?"], model_id="PlanTL-GOB-ES/gpt2-base-bne", device="cpu")
```
Gives the following error:

```
 ...
  File "/ikerlariak/aormazabal024/PhD/Poetry-Generation/demo/poetry-env-traganarru/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self`
```
This seems to be related to the special tokens for `<pad>, <s>, </s>` and` <unk>` not being properly set (but are used by the evaluate module), as the only special token added in the tokenizer is `<|endoftext|>`. One can manually fix it for the local snapshot:

```
tokenizer.pad_token = '<pad>'
tokenizer.bos_token = '</s>'
tokenizer.eos_token = '</s>'
tokenizer.unk_token = '<unk>'
tokenizer.save_pretrained('[snapshot-path]')

```
However, even after fixing this, I am getting quite high perplexities compared to the 10-13 reported in the paper for all sentences I am trying (assuming per-word-perplexity is reported). Is it possible there was an issue when converting from fairseq to HF, and are the original fairseq models available somewhere to compare? Or maybe I am making a mistake when calculating the ppl, was there any tokenization done to the text apart from BPE (i.e. replacing newlines with </s>, which is pretty standard in fairseq)? 





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential issues with HF GPT2 Models #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential issues with HF GPT2 Models #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions