Running a mistral finetuned adapter with llama.cpp server #957

samikrc · 2023-12-14T16:12:11Z

samikrc
Dec 14, 2023

I have finetuned OpenHermes 2.5 using axolotl library, with qlora method and have got a bin file. Now I am trying to run the server with the lora adapter loaded, but it is erroring out with the following message: llama_apply_lora_from_file_internal: unsupported file version.

Here is the log portion that shows up on the screen:

...............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 4232.07 MiB (model: 4095.06 MiB, context: 137.00 MiB)
llama_apply_lora_from_file_internal: applying lora adapter from 'models/7B/qlora-out/adapter_model.bin' - please wait ...
llama_apply_lora_from_file_internal: unsupported file version
llama_init_from_gpt_params: error: failed to apply lora adapter
main: error: unable to load model

Here is the command line I am using:

~/llama.cpp$ ./build/bin/server -m ./models/7B/openhermes-2.5-mistral-7b.Q4_K_M.gguf --lora ./models/7B/qlora-out/adapter_model.bin -t 10 -a "openhermes-2.5" -c 10240 -ngl 99 -n 512 --host 0.0.0.0 --port 8585

How can I host the model with adapter using server? Does the adapter need to be in gguf as well? This link has an example where the main model is gguf and adapter is a bin file, so I tried it out.

Do I need any conversion of the adapter .bin file post the finetuning step, before running with llama.cpp server? I do not want to merge immediately before testing, and possibly retain multiple adapters in separate folders alongwith base model.

Thanks for any pointers.

samikrc · 2023-12-15T07:16:34Z

samikrc
Dec 15, 2023
Author

Made some progress: had to convert the model using llama.cpp script convert-lora-to-ggml.py, and then use it with server like so:

llama.cpp$ ./build/bin/server -m ./models/7B/openhermes-2.5-mistral-7b.Q4_K_M.gguf --lora ./models/7B/qlora-out/ggml-adapter-model.bin -t 10 -a "openhermes-2.5" -c 4352 -ngl 99 -n 512 -np 2 --host 0.0.0.0 --port 8585

However got the error: failed to apply lora adapter: llama_apply_lora_from_file_internal: error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models. dest_t->type: 12. Some more detailed log:

llama_apply_lora_from_file_internal: applying lora adapter from './models/7B/qlora-out/ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file_internal: r = 8, alpha = 16, scaling = 2.00
llama_model_apply_lora_from_file: failed to apply lora adapter: llama_apply_lora_from_file_internal: error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models. dest_t->type: 12
llama_init_from_gpt_params: error: failed to apply lora adapter

Can I update some parameter in my qlora.yml file for finetuning, or this is completely post-processing step? Do I need to change the convert-lora-to-ggml.py to write out in f16, or is there some config I need to use for loading with server? I am trying to avoid merging the models in order to having the flexibility of switching out one adapter with another etc. Thanks for any pointer!

3 replies

NanoCode012 Feb 23, 2024
Collaborator

I'm not familiar with that conversion file, but one alternative is to merge the lora to the main model and use that instead?

samikrc Feb 24, 2024
Author

Yes, that is a possibility, but was trying to avoid that so that I can switch in and out adapters as needed.
I ended up sticking to the base model only for now - have to look into this some more.

NanoCode012 Feb 28, 2024
Collaborator

May I ask if you’ve tried reporting to llama cpp repo? Does it only occur if trained using axolotl?

Have you tried running public adapters through the same method?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running a mistral finetuned adapter with llama.cpp server #957

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Running a mistral finetuned adapter with llama.cpp server #957

samikrc Dec 14, 2023

Replies: 1 comment · 3 replies

samikrc Dec 15, 2023 Author

NanoCode012 Feb 23, 2024 Collaborator

samikrc Feb 24, 2024 Author

NanoCode012 Feb 28, 2024 Collaborator

samikrc
Dec 14, 2023

Replies: 1 comment 3 replies

samikrc
Dec 15, 2023
Author

NanoCode012 Feb 23, 2024
Collaborator

samikrc Feb 24, 2024
Author

NanoCode012 Feb 28, 2024
Collaborator