Questions about continuous batching, streamed responses #307

bjj · 2024-01-30T02:57:22Z

bjj
Jan 30, 2024

I made an OpenAI API compatible server: https://github.com/bjj/exllamav2-openai-server . As the first caveat in the README says, I have no idea what I'm doing. I have a few questions

(for context, if you want to look at the main loop, it's here: https://github.com/bjj/exllamav2-openai-server/blob/master/server.py#L248 )

I'm currently using one cache per request, which is initialized when it joins the batch (based on examples/multiple_caches.py). There are some things that make me think I'm supposed to be sharing a cache (comments in the code, the existence of the batch_size parameter to the cache). Should I be trying to use that? Can it be made to work with continuous batching? I don't really understand the relationship between the cache and the actual forward/sample/decode loop (e.g. why the cache knows current_seq_len).

Along those lines, one big serial step is the preprocess_only part of initializing a new stream. Is there a smarter way to do that? I assume this is directly related to how I am (mis)using caches.

I originally did sample and tokenizer.decode before I realized what a minefield streaming decode was, and then I contrived to use ExLlamaV2StreamingGenerator even though it is not designed for that (see patch_gen_single_token). Would you be interested in a refactor of that to split out the "smart streaming sample+decode" part from the "model.forward" part?

I am assuming config.max_batch_size is the max number of sequences I can stack in model.forward while ExLlamaV2Cache(batch_size=) only applies to sharing caches. Is that correct?

Thanks for any comments!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about continuous batching, streamed responses #307

{{title}}

Replies: 0 comments

Select a reply

Questions about continuous batching, streamed responses #307

bjj Jan 30, 2024

Replies: 0 comments

bjj
Jan 30, 2024