Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize statistical unigram tokenizer decode_forward #63

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aria42
Copy link

@aria42 aria42 commented Dec 30, 2021

Was testing out the SentencePieceModel tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at the code for the decode_forward pass it seems you are considering candidate spans (char_start, char_end) with arbitrary length, even though the vocabulary will have some maximum length element. Constraining decoding to consider spans of at most this max length yields the same result, since no longer substring will be present in vocab. This change has a dramatic impact for tokenizing longer pieces of text. This PR addresses the problem by computing a max_vocab_codeunit_len for the SentencePieceModel to cache the longest code unit length for any vocabulary element. This field is used to truncate search for decoding.

This gist below highlights the performance gap. There is existing test coverage for this code and those tests still pass, but happy to add more if there's something else to test in the implementation.

using HTTP
using WordTokenizers

spm = load(ALBERT_V1)
# Download Hamlet text and truncate to roughly first 5k bytes
long_text = String(HTTP.get("https://dlg.usg.edu/record/dlg_zlgb_gb5027/fulltext.text").body)
max_len = 5000
long_text = long_text[begin:thisind(long_text, max_len)]
@time spm(long_text)

Before this PR: 7.098319 seconds (195.33 k allocations: 11.746 MiB, 1.41% compilation time)
With this PR: 0.016252 seconds (8.23 k allocations: 1.026 MiB)

@aria42
Copy link
Author

aria42 commented Jan 7, 2022

Thanks @aviks assuming this means I can merge PR, or do I need another approver?

@aria42
Copy link
Author

aria42 commented Jan 24, 2022

@aviks I think you need to merge since I can't do it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants