You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:
So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off add_special_tokens when going from text -> chunks? Is that correct?
The text was updated successfully, but these errors were encountered:
@mitchellgordon95 hey Mitchell, yes indeed you spotted a problem i knew about but did not address. however, my take is that multiple CLS tokens shouldn't harm things too much (could be totally wrong about that though)
yes, you are correct that add_special_tokens controls the addition of [cls] and [sep] (which I use as the end-of-string/document)
I noticed when we tokenize, we set
add_special_tokens
to True here:https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L72
which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:
https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L240
So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off
add_special_tokens
when going from text -> chunks? Is that correct?The text was updated successfully, but these errors were encountered: