Support for speculative decoding, draft models in llamafile? #632

dagbdagb · 2024-11-19T19:13:19Z

dagbdagb
Nov 19, 2024

I have been playing with tabbyAPI and its support for draft models. In short, the performance benefit is very, very obvious. And I wonder what it could mean for inference on the CPU. Or even a mixed CPU/NPU/GPU setup.

Intuitively (which may be very wrong, of course) I think this could allow for 14B models being more available to the GPU poor. In a practical sense.

See some ballpark numbers without speculative decoding here:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for speculative decoding, draft models in llamafile? #632

{{title}}

Replies: 0 comments

Select a reply

Support for speculative decoding, draft models in llamafile? #632

dagbdagb Nov 19, 2024

Replies: 0 comments

dagbdagb
Nov 19, 2024