You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently using one cache per request, which is initialized when it joins the batch (based on examples/multiple_caches.py). There are some things that make me think I'm supposed to be sharing a cache (comments in the code, the existence of the batch_size parameter to the cache). Should I be trying to use that? Can it be made to work with continuous batching? I don't really understand the relationship between the cache and the actual forward/sample/decode loop (e.g. why the cache knows current_seq_len).
Along those lines, one big serial step is the preprocess_only part of initializing a new stream. Is there a smarter way to do that? I assume this is directly related to how I am (mis)using caches.
I originally did sample and tokenizer.decode before I realized what a minefield streaming decode was, and then I contrived to use ExLlamaV2StreamingGenerator even though it is not designed for that (see patch_gen_single_token). Would you be interested in a refactor of that to split out the "smart streaming sample+decode" part from the "model.forward" part?
I am assuming config.max_batch_size is the max number of sequences I can stack in model.forward while ExLlamaV2Cache(batch_size=) only applies to sharing caches. Is that correct?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I made an OpenAI API compatible server: https://github.com/bjj/exllamav2-openai-server . As the first caveat in the README says, I have no idea what I'm doing. I have a few questions
(for context, if you want to look at the main loop, it's here: https://github.com/bjj/exllamav2-openai-server/blob/master/server.py#L248 )
I'm currently using one cache per request, which is initialized when it joins the batch (based on
examples/multiple_caches.py
). There are some things that make me think I'm supposed to be sharing a cache (comments in the code, the existence of thebatch_size
parameter to the cache). Should I be trying to use that? Can it be made to work with continuous batching? I don't really understand the relationship between the cache and the actual forward/sample/decode loop (e.g. why the cache knowscurrent_seq_len
).Along those lines, one big serial step is the
preprocess_only
part of initializing a new stream. Is there a smarter way to do that? I assume this is directly related to how I am (mis)using caches.I originally did
sample
andtokenizer.decode
before I realized what a minefield streaming decode was, and then I contrived to useExLlamaV2StreamingGenerator
even though it is not designed for that (see patch_gen_single_token). Would you be interested in a refactor of that to split out the "smart streaming sample+decode" part from the "model.forward" part?I am assuming
config.max_batch_size
is the max number of sequences I can stack inmodel.forward
whileExLlamaV2Cache(batch_size=)
only applies to sharing caches. Is that correct?Thanks for any comments!
Beta Was this translation helpful? Give feedback.
All reactions