You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering an unexpected performance issue while using the MInference library with the Qwen/Qwen2-7B-Instruct model. I have followed the example provided in run_hf.py and made minimal changes to adapt it to the Qwen model. Here is the modified code snippet:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from minference import MInference
prompt = "Hello, my name is"
model_name = "Qwen/Qwen2-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="cuda",
)
minference_patch = MInference("minference", model_name)
model = minference_patch(model)
start_time = time.time()
batch_inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Corrected 'cu da' to 'cuda'
outputs = model.generate(**batch_inputs, max_length=10)
end_time = time.time()
elapsed_time = end_time - start_time
tokens_per_second = len(outputs[0]) / elapsed_time
print(f"Tokens per second with MInference: {tokens_per_second}")
when the MInference patch line is commented out, the tokens per second increase to approximately 26 tokens/s.
When I measure the tokens per second with MInference enabled, the rate is approximately 10 tokens/s. However, when I disable the MInference patch (commenting out the line model = minference_patch(model)), the rate increases to about 26 tokens/s. I am seeking clarification on why there is such a significant performance drop when using MInference, and whether there might be a mistake in my usage or a potential bug in the library.
I would appreciate any guidance or insights you can provide to help me resolve this issue.
Thank you for your time and assistance.
The text was updated successfully, but these errors were encountered:
Describe the issue
Hello,
I am encountering an unexpected performance issue while using the MInference library with the Qwen/Qwen2-7B-Instruct model. I have followed the example provided in run_hf.py and made minimal changes to adapt it to the Qwen model. Here is the modified code snippet:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from minference import MInference
prompt = "Hello, my name is"
model_name = "Qwen/Qwen2-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="cuda",
)
minference_patch = MInference("minference", model_name)
model = minference_patch(model)
start_time = time.time()
batch_inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Corrected 'cu da' to 'cuda'
outputs = model.generate(**batch_inputs, max_length=10)
end_time = time.time()
elapsed_time = end_time - start_time
tokens_per_second = len(outputs[0]) / elapsed_time
print(f"Tokens per second with MInference: {tokens_per_second}")
when the MInference patch line is commented out, the tokens per second increase to approximately 26 tokens/s.
When I measure the tokens per second with MInference enabled, the rate is approximately 10 tokens/s. However, when I disable the MInference patch (commenting out the line model = minference_patch(model)), the rate increases to about 26 tokens/s. I am seeking clarification on why there is such a significant performance drop when using MInference, and whether there might be a mistake in my usage or a potential bug in the library.
I would appreciate any guidance or insights you can provide to help me resolve this issue.
Thank you for your time and assistance.
The text was updated successfully, but these errors were encountered: