[PoC][Spec Decoding] Stop speculating when confidence is low #5914

njhill · 2024-06-27T16:35:30Z

Just a quick experiment, this gives an additional ~3% speedup in a greedy bs=1 example.

This just "pads" out proposed tokens up to the num_predict_tokens with the final predicted token. Would be good if we could change the framework to allow the proposer model to return a variable number of tokens.

I'm not sure how this will work in non-greedy case - we'd still probably want to evaluate the cutoff based on a greedily sampled token.

@cadedaniel @LiuXiaoxuanPKU I guess this is related to DSD

Just an experiment, this gives an additional ~3% speedup in a greedy bs=1 example. This just "pads" out proposed tokens up to the num_predict_tokens with the final predicted token. Would be good if we could change the framework to allow the proposer model to return a variable number of tokens. I'm not sure how this will work in non-greedy case - we'd still probably want to evaluate the cutoff based on a greedily sampled token.

cadedaniel · 2024-06-27T22:35:22Z

I am a little hesitant to merge bs=1 optimizations like this until we have a picture of how QPS vs ITL looks. In particular this one will trade off QPS to achieve higher ITL reduction (a fine tradeoff if the user wants it).

Can we put it behind a flag?

njhill · 2024-06-27T23:06:15Z

@cadedaniel yes this is just exploration right now, have no intention of merging until further testing/hardening and understanding the implications.

I'm not sure I follow the QPS vs ITL tradeoff point though, I'll give it some thought. I think all this optimization does really is to increase the acceptance rate (per proposed token). It would be less effective for bs > 1 because I'm only aborting when the probs for all sequences are below the cutoff value (will happen much less frequently), but it shouldn't be detrimental in theory.

DarkLight1337 · 2024-06-28T06:24:02Z

To speed up the CI queue for #5905, I've cancelled the distributed tests for the latest CI run in this PR since they won't pass anyway until #5905 has been merged. Please merge main into your branch after that happens so that the CI can pass once again.

njhill force-pushed the spec-token-cutoff branch from 0f379df to 789a608 Compare June 27, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC][Spec Decoding] Stop speculating when confidence is low #5914

[PoC][Spec Decoding] Stop speculating when confidence is low #5914

njhill commented Jun 27, 2024

cadedaniel commented Jun 27, 2024

njhill commented Jun 27, 2024

DarkLight1337 commented Jun 28, 2024 •

edited

Loading

[PoC][Spec Decoding] Stop speculating when confidence is low #5914

Are you sure you want to change the base?

[PoC][Spec Decoding] Stop speculating when confidence is low #5914

Conversation

njhill commented Jun 27, 2024

cadedaniel commented Jun 27, 2024

njhill commented Jun 27, 2024

DarkLight1337 commented Jun 28, 2024 • edited Loading

DarkLight1337 commented Jun 28, 2024 •

edited

Loading