Drop-in support for libraries expected a HF Transformer/Tokenizer? #66

andysalerno · 2023-06-18T00:33:19Z

andysalerno
Jun 18, 2023

Hi all,

First off, big kudos. It was easy to have this running locally!

I'm interfacing with a library that expects objects that present as a Huggingface model or tokenizer - i.e., AutoModelForCausalLM or LlamaTokenizer. Today I'm using qwopqwop200/GPTQ-for-LLaMa to get a quantized model object that works for all places a HF model is expected. But I'm interested in switching to exllama to benefit from the perf improvements.

Is it feasible to have some wrapper type that could apply that interface on top of the ExLlama and ExLlamaTokenizer classes? Or would that be extremely difficult, since the HF types are so large now?

Or is there some easy way to do this already, and I'm just missing it? Thanks in advance for any advice :)

Answered by turboderp

Jun 18, 2023

Basically, no, there's no easy way to do that. It would be about as involved as using a GGML model in Transformers, because there's very little of the original HF structure left. ExLlama relies on controlling the datatype and stride of the hidden state throughout the forward pass, for instance. That alone is a major obstacle unless you want to slow it down by constantly reshaping and converting tensors. And it wouldn't be possible to just replace any module with an equivalent, compatible module, like what AutoGPTQ and GPTQ-for-LLaMa rely on. PEFT wouldn't work because it relies on hooks, and most of ExLlama's operations are fused in one way or another.

You can pretty trivially make a wrap…

View full answer

turboderp · 2023-06-18T11:40:36Z

turboderp
Jun 18, 2023
Maintainer

Basically, no, there's no easy way to do that. It would be about as involved as using a GGML model in Transformers, because there's very little of the original HF structure left. ExLlama relies on controlling the datatype and stride of the hidden state throughout the forward pass, for instance. That alone is a major obstacle unless you want to slow it down by constantly reshaping and converting tensors. And it wouldn't be possible to just replace any module with an equivalent, compatible module, like what AutoGPTQ and GPTQ-for-LLaMa rely on. PEFT wouldn't work because it relies on hooks, and most of ExLlama's operations are fused in one way or another.

You can pretty trivially make a wrapper that kind of makes it look like a LlamaForCausalLM module that takes input IDs and outputs logits. This is more or less how KoboldAI uses it. You could also replace ExLlamaTokenizer with a HF tokenizer fairly easily since it's more or less equivalent. But you would have to forgo any use of gradients, and even in eval mode you'd have to discard many of the arguments going into the forward pass, such as position IDs and attention masks. Also it would fail to produce some of the outputs other parts of the ecosystem might want to rely on, like attention weights. Not to mention you'd need some acrobatics to deal with the cache, which incorporates state that you can't easily add to a past_key_values list of tensor pairs.

And of course it would be still be a Llama model, not an AutoModelForCausalLM that conveniently turns into any architecture the framework supports. That's really the strength of Transformers, but you also pay dearly for it in performance. Ultimately, ExLlama is fast because it isn't a PyTorch/Transformers module.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop-in support for libraries expected a HF Transformer/Tokenizer? #66

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Drop-in support for libraries expected a HF Transformer/Tokenizer? #66

andysalerno Jun 18, 2023

Replies: 1 comment

turboderp Jun 18, 2023 Maintainer

andysalerno
Jun 18, 2023

turboderp
Jun 18, 2023
Maintainer