Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC][Spec Decoding] Stop speculating when confidence is low #5914

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

njhill
Copy link
Member

@njhill njhill commented Jun 27, 2024

Just a quick experiment, this gives an additional ~3% speedup in a greedy bs=1 example.

This just "pads" out proposed tokens up to the num_predict_tokens with the final predicted token. Would be good if we could change the framework to allow the proposer model to return a variable number of tokens.

I'm not sure how this will work in non-greedy case - we'd still probably want to evaluate the cutoff based on a greedily sampled token.

@cadedaniel @LiuXiaoxuanPKU I guess this is related to DSD

Just an experiment, this gives an additional ~3% speedup in a greedy bs=1 example.

This just "pads" out proposed tokens up to the num_predict_tokens with the final predicted token. Would be good if we could change the framework to allow the proposer model to return a variable number of tokens.

I'm not sure how this will work in non-greedy case - we'd still probably want to evaluate the cutoff based on a greedily sampled token.
@cadedaniel
Copy link
Collaborator

I am a little hesitant to merge bs=1 optimizations like this until we have a picture of how QPS vs ITL looks. In particular this one will trade off QPS to achieve higher ITL reduction (a fine tradeoff if the user wants it).

Can we put it behind a flag?

@njhill
Copy link
Member Author

njhill commented Jun 27, 2024

@cadedaniel yes this is just exploration right now, have no intention of merging until further testing/hardening and understanding the implications.

I'm not sure I follow the QPS vs ITL tradeoff point though, I'll give it some thought. I think all this optimization does really is to increase the acceptance rate (per proposed token). It would be less effective for bs > 1 because I'm only aborting when the probs for all sequences are below the cutoff value (will happen much less frequently), but it shouldn't be detrimental in theory.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 28, 2024

To speed up the CI queue for #5905, I've cancelled the distributed tests for the latest CI run in this PR since they won't pass anyway until #5905 has been merged. Please merge main into your branch after that happens so that the CI can pass once again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants