You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across your statement in the paper where you mentioned:
"When serving LLaMA-3-8B on a single A100 machine, the model would keep users waiting for 6 minutes to finish the pre-filling stage given a prompt of 300K tokens, and this number increases to 30 minutes for a prompt of 1M tokens."
However, I am also running on a single A100 (80GB) and using Hugging Face's implementation of LLaMA in SDPA mode. With a 50k token context, the pre-fill time is around 2.5 seconds, but when using 100k tokens, I run into an "Out of Memory" issue.
Could you clarify why there is such a significant discrepancy between your results and mine? Is there something I might be missing or misunderstanding?
Thanks for your help!
The text was updated successfully, but these errors were encountered:
First, I apologize for the error in the Introduction section of our paper. The sentence should read: "3 minutes to finish the pre-filling stage given a prompt of 300K tokens," not 6 minutes. You can also verify this in Figure 1(b). We will update the arXiv and NeurIPS versions ASAP. Thank you for pointing this out!
The TTFT should be around 7.5 seconds. This result can also be cross-verified with the vLLM implementation:
Lastly, the original HF implementation does not support very large context windows. As stated in Appendix C.3, we detail the optimization steps we performed. You can use --attn_type minference_with_dense with our optimized implementation or leverage vLLM to achieve longer context windows.
Thanks again for raising these points, and please let me know if you have further questions!
Describe the issue
I came across your statement in the paper where you mentioned:
"When serving LLaMA-3-8B on a single A100 machine, the model would keep users waiting for 6 minutes to finish the pre-filling stage given a prompt of 300K tokens, and this number increases to 30 minutes for a prompt of 1M tokens."
However, I am also running on a single A100 (80GB) and using Hugging Face's implementation of LLaMA in SDPA mode. With a 50k token context, the pre-fill time is around 2.5 seconds, but when using 100k tokens, I run into an "Out of Memory" issue.
Could you clarify why there is such a significant discrepancy between your results and mine? Is there something I might be missing or misunderstanding?
Thanks for your help!
The text was updated successfully, but these errors were encountered: