OpenAI GPT OSS 20B Model Problems #7186
Replies: 1 comment
-
Either I have some major misconfiguration of transformers or it just sucks bigtime. With the setup I described above, my token rate is 0.2 tokens/second. On the other hand, I can run the 6-bit extended quant of the 120B gpt-oss model (GGUF) at 6 tokens/second with Llama.cpp, 30X faster, even though the model size on disk is 60GB (compared to the 20B gpt-oss model in FP16 format at 12.8GB disk footprint). The 20B gpt-oss (GGUF) FP16 model (run with Lllama.cpp) has roughly the same disk footprint as the straight 20B FP16 gpt-oss model (both from hugging face), but on my rig it runs @ 50 tokens per second. That's 250X the speed of transformers. Strange. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What instruction template do I need to use? It selects "none" by default and I'm using Chat-Instruct, and it gets stuck in a response loop sometimes printing a lot of ...? type output.
I'm loading the original (i.e., FP16, not GGUF) model, using transformers.
Also, it's very slow even though it only has a 12.8GB disk footprint, and I have a 4060 and 5060 (16GB VRAM x2).
Do I have something misconfigured? I use 10,10 split with 100GB of CPU backup. I can't go higher on the GPU allocation or it will fail during loading due to out of memory.
Beta Was this translation helpful? Give feedback.
All reactions