llama-server : implement universal assisted decoding #12635

g2mt · 2025-03-28T23:17:43Z

This pull request implements universal assisted decoding in llama-server. This is a method for performing speculative decoding given a draft model whose tokenizer is incompatible with the main model, by decoding and reencoding the generated text between the two models.

It currently works, but some improvements can be made.

Token healing can be done to fix any weirdness that may occur if the draft model generates tokens that doesn't lie on a word boundary (not sure how much this affects performance).
The translation process can be cached to improve sampling time, however it might require substantial refactoring.

jukofyork · 2025-04-03T21:53:30Z

This looks really interesting! It's surprising how much crossover there is between many models' tokenisers.

llama-server : implement universal assisted decoding

6f96269

g2mt requested a review from ngxson as a code owner March 28, 2025 23:17

github-actions bot added examples server labels Mar 28, 2025

Merge branch 'master' into master

6f74c9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-server : implement universal assisted decoding #12635

llama-server : implement universal assisted decoding #12635

g2mt commented Mar 28, 2025

jukofyork commented Apr 3, 2025

llama-server : implement universal assisted decoding #12635

Are you sure you want to change the base?

llama-server : implement universal assisted decoding #12635

Conversation

g2mt commented Mar 28, 2025

jukofyork commented Apr 3, 2025