This repository has been archived by the owner on Nov 22, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Start, End index calculations fix for unicode characters. (#1171)
Summary: Pull Request resolved: #1171 The existing GPT2BPETokenizer incorrectly calculates the start and end indices for unicode characters. This is because for multi-byte characters, we need to additionally use the byte decoder on the decoded bytes to get back the original token that was encoded. Reviewed By: chenyangyu1988 Differential Revision: D18697646 fbshipit-source-id: 8f4d32a1caa40d8d06e7be31dfd4a6846692531a
- Loading branch information