Drop-in support for libraries expected a HF Transformer/Tokenizer? #66
-
Hi all, First off, big kudos. It was easy to have this running locally! I'm interfacing with a library that expects objects that present as a Huggingface model or tokenizer - i.e., Is it feasible to have some wrapper type that could apply that interface on top of the ExLlama and ExLlamaTokenizer classes? Or would that be extremely difficult, since the HF types are so large now? Or is there some easy way to do this already, and I'm just missing it? Thanks in advance for any advice :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Basically, no, there's no easy way to do that. It would be about as involved as using a GGML model in Transformers, because there's very little of the original HF structure left. ExLlama relies on controlling the datatype and stride of the hidden state throughout the forward pass, for instance. That alone is a major obstacle unless you want to slow it down by constantly reshaping and converting tensors. And it wouldn't be possible to just replace any module with an equivalent, compatible module, like what AutoGPTQ and GPTQ-for-LLaMa rely on. PEFT wouldn't work because it relies on hooks, and most of ExLlama's operations are fused in one way or another. You can pretty trivially make a wrapper that kind of makes it look like a And of course it would be still be a Llama model, not an |
Beta Was this translation helpful? Give feedback.
Basically, no, there's no easy way to do that. It would be about as involved as using a GGML model in Transformers, because there's very little of the original HF structure left. ExLlama relies on controlling the datatype and stride of the hidden state throughout the forward pass, for instance. That alone is a major obstacle unless you want to slow it down by constantly reshaping and converting tensors. And it wouldn't be possible to just replace any module with an equivalent, compatible module, like what AutoGPTQ and GPTQ-for-LLaMa rely on. PEFT wouldn't work because it relies on hooks, and most of ExLlama's operations are fused in one way or another.
You can pretty trivially make a wrap…