Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to load VLM #141

Open
Samjith888 opened this issue Nov 12, 2024 · 10 comments
Open

Not able to load VLM #141

Samjith888 opened this issue Nov 12, 2024 · 10 comments

Comments

@Samjith888
Copy link

Samjith888 commented Nov 12, 2024

Thanks for adding VLM support to textgrad.

This doc describe how to use textgrad to do the autoprompt for gpt-4o.

I would like to try non gpt-4o models such as Qwen2 VLM/ Llama 3.2 9B VLM and generate the prompt automatically from a base prompt. Found this script which you added recently as a part of VLM integratioin.

import io
from PIL import Image
import textgrad as tg

# differently from the past tutorials, we now need a multimodal LLM call instead of a standard one!
from textgrad.autograd import MultimodalLLMCall
from textgrad.loss import ImageQALoss
tg.set_backward_engine("meta-llama/Meta-Llama-3-8B-Instruct")
import httpx

image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_data = httpx.get(image_url).content

image_variable = tg.Variable(image_data, role_description="image to answer a question about", requires_grad=False)

question_variable = tg.Variable("What do you see in this image?", role_description="question", requires_grad=False)
response = MultimodalLLMCall("meta-llama/Meta-Llama-3-8B-Instruct")([image_variable, question_variable])
response

Error:
image

@vinid
Copy link
Collaborator

vinid commented Nov 12, 2024

Can you give me more details? Not sure how to reproduce.

You can also use the more recent engines based on litellm by installing the source

@Samjith888
Copy link
Author

Can you give me more details? Not sure how to reproduce.

You can also use the more recent engines based on litellm by installing the source

Updated. @vinid

@vinid
Copy link
Collaborator

vinid commented Nov 12, 2024

Unfortunately set_backward_engine does not directly support vllm models.

If you want to use VLLM you need to import the ChatVLLM interface:

import os
import logging
import pytest
from typing import Union, List

from textgrad import Variable, BlackboxLLM, TextLoss
from textgrad.optimizer import TextualGradientDescent
from textgrad.engine.vllm import ChatVLLM

vllm_engine = ChatVLLM(model_string="meta-llama/Meta-Llama-3-8B-Instruct")

def test_simple_forward_pass_engine():
    text = Variable("Hello", role_description="A variable")
    engine = BlackboxLLM(engine=vllm_engine)
    response = engine(text)

    assert response

def test_primitives():
    """
    Test the basic functionality of the Variable class.
    """
    x = Variable("A sntence with a typo", role_description="The input sentence", requires_grad=True)
    system_prompt = Variable("Evaluate the correctness of this sentence", role_description="The system prompt")
    loss = TextLoss(system_prompt, engine=vllm_engine)
    optimizer = TextualGradientDescent(parameters=[x], engine=vllm_engine)

    l = loss(x)
    l.backward(vllm_engine)
    optimizer.step()

    assert x.value == "A sentence with a typo"

@Samjith888
Copy link
Author

Thanks for the quick response.

Getting memory error. I ran this in a A100 machine.

import os
import logging
import pytest
from typing import Union, List

from textgrad import Variable, BlackboxLLM, TextLoss
from textgrad.optimizer import TextualGradientDescent
from textgrad.engine.vllm import ChatVLLM

vllm_engine = ChatVLLM(model_string="meta-llama/Llama-3.2-11B-Vision-Instruct")

def test_simple_forward_pass_engine():
    text = Variable("Hello", role_description="A variable")
    engine = BlackboxLLM(engine=vllm_engine)
    response = engine(text)

    assert response

def test_primitives():
    """
    Test the basic functionality of the Variable class.
    """
    x = Variable("A sntence with a typo", role_description="The input sentence", requires_grad=True)
    system_prompt = Variable("Evaluate the correctness of this sentence", role_description="The system prompt")
    loss = TextLoss(system_prompt, engine=vllm_engine)
    optimizer = TextualGradientDescent(parameters=[x], engine=vllm_engine)

    l = loss(x)
    l.backward(vllm_engine)
    optimizer.step()

    assert x.value == "A sentence with a typo"

if __name__ == "__main__":
    out = test_primitives()
    print(out)

Error:
image

In the tutorial, its mentioned the below way to load the image. Since you mentioned the set_backward_engine is not supported for VLM, how to load the image and run the VLM model on it.

import httpx

image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_data = httpx.get(image_url).content

image_variable = tg.Variable(image_data, role_description="image to answer a question about", requires_grad=False)

question_variable = tg.Variable("What do you see in this image?", role_description="question", requires_grad=False)
response = MultimodalLLMCall("meta-llama/Meta-Llama-3-8B-Instruct")([image_variable, question_variable])
response

@vinid
Copy link
Collaborator

vinid commented Nov 13, 2024

Memory error is likely due to the vllm, can you send the strack trace of the actual call?

@Samjith888
Copy link
Author

WARNING 11-13 11:05:56 arg_utils.py:967] The model has a long context length (131072). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache spac e. Consider setting --max-model-len to a smaller value. INFO 11-13 11:05:56 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision -Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto , quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_t ime=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefi x_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None) INFO 11-13 11:05:57 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-13 11:05:57 selector.py:115] Using XFormers backend. /data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstractin a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-13 11:05:58 model_runner.py:1056] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
INFO 11-13 11:05:59 selector.py:115] Using XFormers backend.
INFO 11-13 11:05:59 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:29<01:57, 29.32s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:56<01:23, 27.97s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [01:04<00:38, 19.08s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [01:24<00:19, 19.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:46<00:00, 20.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:46<00:00, 21.33s/it]

INFO 11-13 11:07:47 model_runner.py:1067] Loading model weights took 19.9073 GB
INFO 11-13 11:07:47 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/home/samjith/autoprompt_textgrad.py", line 36, in
[rank0]: vllm_engine = ChatVLLM(model_string="meta-llama/Llama-3.2-11B-Vision-Instruct")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/textgrad/engine/vllm.py", line 28, in init
[rank0]: self.client = LLM(self.model_string)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 573, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 348, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: self._initialize_kv_caches() [20/65]
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 359, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/worker/enc_dec_model_runner.py", line 203, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1254, in forward
[rank0]: cross_attention_states = self.get_cross_attention_states(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 1143, in get_cross_attention_states
[rank0]: cross_attention_states = self.vision_model(pixel_values,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 546, in forward
[rank0]: hidden_state = self.apply_class_embedding(hidden_state)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/home/samjith/anaconda3/envs/textgrad/lib/python3.12/site-packages/vllm/model_executor/models/mllama.py", line 515, in apply_class_embedding
[rank0]: hidden_state = torch.cat([class_embedding, hidden_state], dim=1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 39.39 GiB of which 3.77 GiB is free. Including non-PyTorch memory, this process has 35.61 GiB memory
in use. Of the allocated memory 34.95 GiB is allocated by PyTorch, and 176.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab
le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

`

@vinid
Copy link
Collaborator

vinid commented Nov 13, 2024

This is sus:

WARNING 11-13 11:05:56 arg_utils.py:967] The model has a long context length (131072)

Does vllm work without textgrad? Like what happens if you call it directly

@Samjith888
Copy link
Author

I tried it with hugging face codes. It’s working fine too.

@aabbhishekksr
Copy link

@vinid I had this error with vLLM for Llama3.2 vision. Later I was able to run via vLLM with a different script.
Can you share how you are using vllm here?

@vinid
Copy link
Collaborator

vinid commented Dec 1, 2024

I don't have access to a vLLM instance, what I would suggest you do is to either:

  1. Look at what is missing from this engine:
try:
    from vllm import LLM, SamplingParams
except ImportError:
    raise ImportError(
        "If you'd like to use VLLM models, please install the vllm package by running `pip install vllm` or `pip install textgrad[vllm]."
    )

import os
import platformdirs
from .base import EngineLM, CachedEngine


class ChatVLLM(EngineLM, CachedEngine):
    # Default system prompt for VLLM models
    DEFAULT_SYSTEM_PROMPT = ""

    def __init__(
        self,
        model_string="meta-llama/Meta-Llama-3-8B-Instruct",
        system_prompt=DEFAULT_SYSTEM_PROMPT,
        **llm_config,
    ):
        root = platformdirs.user_cache_dir("textgrad")
        cache_path = os.path.join(root, f"cache_vllm_{model_string}.db")
        super().__init__(cache_path=cache_path)

        self.model_string = model_string
        self.system_prompt = system_prompt
        self.client = LLM(self.model_string, **llm_config)
        self.tokenizer = self.client.get_tokenizer()

    def generate(
        self, prompt, system_prompt=None, temperature=0, max_tokens=2000, top_p=0.99
    ):
        sys_prompt_arg = system_prompt if system_prompt else self.system_prompt
        cache_or_none = self._check_cache(sys_prompt_arg + prompt)
        if cache_or_none is not None:
            return cache_or_none

        # The chat template ignores the system prompt;
        conversation = []
        if sys_prompt_arg:
            conversation = [{"role": "system", "content": sys_prompt_arg}]

        conversation += [{"role": "user", "content": prompt}]
        chat_str = self.tokenizer.apply_chat_template(conversation, tokenize=False)

        sampling_params = SamplingParams(
            temperature=temperature, max_tokens=max_tokens, top_p=top_p, n=1
        )

        response = self.client.generate([chat_str], sampling_params)
        response = response[0].outputs[0].text

        self._save_cache(sys_prompt_arg + prompt, response)

        return response

    def __call__(self, prompt, **kwargs):
        return self.generate(prompt, **kwargs)

You might be able to copy paste code from your script into this engine class and then use it to run vLLM.

  1. you should be able to use the new litellm engines, see how they work here.

Let me know if any of these options work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants