OpenAI GPT OSS 20B Model Problems #7186

CalculonPrime · 2025-08-10T01:18:49Z

CalculonPrime
Aug 10, 2025

What instruction template do I need to use? It selects "none" by default and I'm using Chat-Instruct, and it gets stuck in a response loop sometimes printing a lot of ...? type output.

I'm loading the original (i.e., FP16, not GGUF) model, using transformers.

Also, it's very slow even though it only has a 12.8GB disk footprint, and I have a 4060 and 5060 (16GB VRAM x2).

Do I have something misconfigured? I use 10,10 split with 100GB of CPU backup. I can't go higher on the GPU allocation or it will fail during loading due to out of memory.

CalculonPrime · 2025-08-10T04:24:13Z

CalculonPrime
Aug 10, 2025
Author

Either I have some major misconfiguration of transformers or it just sucks bigtime. With the setup I described above, my token rate is 0.2 tokens/second.

On the other hand, I can run the 6-bit extended quant of the 120B gpt-oss model (GGUF) at 6 tokens/second with Llama.cpp, 30X faster, even though the model size on disk is 60GB (compared to the 20B gpt-oss model in FP16 format at 12.8GB disk footprint).

The 20B gpt-oss (GGUF) FP16 model (run with Lllama.cpp) has roughly the same disk footprint as the straight 20B FP16 gpt-oss model (both from hugging face), but on my rig it runs @ 50 tokens per second. That's 250X the speed of transformers.

Strange.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenAI GPT OSS 20B Model Problems #7186

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

OpenAI GPT OSS 20B Model Problems #7186

Uh oh!

Uh oh!

CalculonPrime Aug 10, 2025

Replies: 1 comment

Uh oh!

Uh oh!

CalculonPrime Aug 10, 2025 Author

CalculonPrime
Aug 10, 2025

CalculonPrime
Aug 10, 2025
Author