Replies: 25 comments
-
We are seeing ca. 30% speedup at int4 vs. fp16 but nowhere near the benchmarks listed in the readme. I notice that @turboderp is using a 12900K which has phenomenal single threaded performance. So it appears that we are still CPU bound for most people, which I find perplexing. |
Beta Was this translation helpful? Give feedback.
-
@pineking: The inference speed at least theoretically is 3-4x faster than FP16 once you're bandwidth-limited, since all that ends up mattering is how fast your GPU can read through every parameter of the model once per token. De-quantizing the weights on the fly is cheap compared to the memory access and should pipeline just fine, with the CUDA cores easily doing all the required computation on one batch of weights while waiting for the next batch to load from VRAM. I should be getting a 3090-Ti tomorrow that has somewhat slower (and fewer) cores than the 4090 but the same memory bandwidth, so I should be able to confirm that it performs about as well as the 4090. Already I've confirmed that a 3070-Ti with half the bandwidth of the 4090 also gets about half the performance. @qeternity: It's really difficult to profile a CPU bottleneck that I can't see because my CPU is too fast. :) I might try underclocking it at some point, or see if I can't force it to only use E-cores somehow. You just can't meaningfully measure the time it takes a PyTorch operation to complete since everything actually runs in the CUDA queue. It's also kind of unpredictable the way PyTorch manages resources under the hood. All the automation is very convenient, and the overhead becomes negligible when you're just offloading large batches of computation to a GPU (the typical scenario during training, which appears to be the intended use case for PyTorch), but when going token by token many of the operations end up being quite small, so the overhead per-operation from using a complex framework in an interpreted language becomes significant. I try to work around that by bypassing PyTorch as much as I can. But this creates another problem of having to manually tune everything and that's a lot of work made much harder by the fact that I don't have 20 different hardware configurations to test on. The approach I'm going for is to make it as fast as possible on my setup while keeping a catalog of all the methods I've experimented with. Then at some point (or gradually) I can add some of those as options that may work better on systems with slower CPUs, slower system RAM, older GPUs, power-limited GPUs, who knows. Keeping it all in the codebase makes it unmanageable, though. There are already a lot of permutations to validate, and the number doubles with every option added. I'm still curious what you mean by "nowhere near", though. Exactly what speeds are you getting, and what hardware setup is producing those poor results? What specific models? |
Beta Was this translation helpful? Give feedback.
-
Hi @turboderp - amazing work you've done! Didn't mean to imply otherwise. Thanks very much! My observation re: CPU bottleneck is not directed at your work, it's something I see in most transformer implementations. And I haven't done any work to assess why this is the case, but your comments make sense to me. With a 3090 Ti and an Epyc 7282 (very slow), we are getting ca. 34 t/s on a 7b models and 26 t/s on 13b models. |
Beta Was this translation helpful? Give feedback.
-
There must be something else going on, then. That's a little less than half the single-core performance of the 12900K, but you're seeing much less than half the performance. 3090-Ti should be comparable to the 4090 (I'll know for sure tomorrow when it arrives) so there has to be something else slowing it down. Is anything else using the GPU at the same time? I've noticed that it's very sensitive to that, and even a tiny little bit of animation going on in some other window can have a big impact, presumably because it relies heavily on caching. Come to think of it, that might be a reason for going back to using SMEM... I'll experiment some more. :) |
Beta Was this translation helpful? Give feedback.
-
I've just tried it on a 3060 and getting the same perf as a 3090. FWIW this is on Tensor Dock marketplace just for R&D purposes. |
Beta Was this translation helpful? Give feedback.
-
Been playing around with this a bit more, I can get 45-47 t/s with a 7b model and 36-37 t/s on a 13b model on a 4090 using latest drivers in an Ubuntu container w/ AMD 3990X. |
Beta Was this translation helpful? Give feedback.
-
@qeternity Could you elaborate on the hardware and software setup? 3990X is quite slow single-threaded, but not that slow. I wouldn't expect less than half the performance of a 12900K. How is it containerized? What's the host system? |
Beta Was this translation helpful? Give feedback.
-
Hypervisor hardware I can't be sure about, as this is a cloud GPU. I would be more than happy to fund your access to GPU time if you're interested. When I did some profiling last week, it seems that most CPU time is spent shuffling between CPU and GPU in |
Beta Was this translation helpful? Give feedback.
-
CPU profiling is a little tricky with this. I've run into the same thing when profiling, and it's caused by the fact that There is one other place where a bit of data is moved across, because the embedding table resides in system RAM. You can move it to VRAM as a test by replacing Then again, it could be that this virtualized environment just has really, really poor bandwidth between system RAM and VRAM. I guess you would measure it (crudely) with a script like this:
I'm getting 4-8 GB/s on the 4090 (PCIe 4.0 x16) and 2-3 GB/s for the 3090 (PCIe 3.0 x4). It oddly depends a lot on the shape of the tensor (!) and is far from the theoretical bus speed, but given that the data copied is on the order of 100 kB per token, I really doubt it's a bottleneck in any case. I don't really have time to get into optimizing on a cloud instance right now. But I'm rewriting the CUDA backend at the moment, with a bunch of switchable options for CUDA kernels etc. and more code moved to C++ where the performance is more predictable and it will be easier to profile. So there are improvements coming, don't worry. And I will find that CPU bottleneck if it kills me. Because really the CPU shouldn't matter at all here. |
Beta Was this translation helpful? Give feedback.
-
Many thanks for all your hard work. Running the above, I am actually getting 9-9.5 GB/s on a 3090 4.0x16 Here is the cProfile of the benchmark segments in You can see
|
Beta Was this translation helpful? Give feedback.
-
And with pytorch profiler:
|
Beta Was this translation helpful? Give feedback.
-
Another data point: Lambda Labs A100 40GB using AMD EPYC 7J13 does 40 t/s on 7b Interestingly the GPU util is still hovering around 50% which is higher than I would have expected, and suggests a GPU-bound of 80 t/s. FWIW I have been testing with TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g |
Beta Was this translation helpful? Give feedback.
-
As a point of reference for the benchmark proposed earlier: I am not sure about the generation of pcie on my second system because it should be 3.0, but that joke of a motherboard has the worst wiring possible for all lanes. |
Beta Was this translation helpful? Give feedback.
-
7b 4bit on a 4090 + 3900x is now up to 75 t/s! GPU usage hovering around 50% so this could genuinely just be down to 12900K single threaded performance. EDIT: 13b 4bit is doing 62 t/s with GPU ca. 70%. |
Beta Was this translation helpful? Give feedback.
-
@qeternity : It seems likely that it's still CPU-bound somewhere. You're getting a little under half my performance on 7B, then 2/3 on 13B which would spend longer in each CUDA kernel. And with higher GPU utilization too, there's no other way to interpret that. As for what to do about it, though... Well. So far relying less on PyTorch has been working out, so I can keep doing that. Also I think the overhead from kernel launches is starting to become a bottleneck, so I'm looking into tail launching with CUDA graphs, fusing kernels where it makes sense, and batching operations like cudaMemset (of all things). Multiple streams might also help in some places. There's a way forward at least for addressing the CPU bottleneck. Python aside, a function like |
Beta Was this translation helpful? Give feedback.
-
@turboderp I think these results are fantastic fwiw. Aside from binning off PyTorch entirely, or being able to JIT the Torch graph (which isn't currently performant due to the custom CUDA kernels afaict) there's probably not a whole lot of juice left to squeeze. Edited to add: I say this because being CPU bound on a 4090 I am able to hit 50% GPU util which happens to be ca. 50% of your 4090 results. I suspect that your 4090 utilization is at or near 100% thanks to the 12900K. |
Beta Was this translation helpful? Give feedback.
-
I'm pretty sure there's quite a bit more to squeeze, though. Thing is, you shouldn't be CPU bound for this, because the CPU isn't doing anything during inference. It's pretty much just launching kernels one after another. At 5 us per launch it adds up, apparently, but fusing kernels or tail-launching graphs could reduce it by a significant amount. Also I'm getting bottlenecks in the stupidest places. Like, it also takes a (implicit) kernel launch to write a single float value of zero to global memory... just to initialize the accumulator for the norm of a single row. Initializing a cache of 10,000 zeroes would take the same time, so that's a straightforward optimization. And while I'm getting close to 100% GPU utilization, that doesn't mean it's using the GPU optimally. One thing I haven't really gotten to yet is optimizing for the L1 cache and SMEM. Even L2 cache is worth looking at, since some of the matrices in 65B are about four times as big as the L2 cache on the 4090. Then of course there's the fact that generating text produces a very small hidden state (16 kB for the 65B model), which means there isn't that much to synchronize between GPUs if you attempted to use both of them at once. At least for matmuls. Self-attention would be more difficult to split, but still, something like 50-75% speedup sounds realistic for two identical GPUs. |
Beta Was this translation helpful? Give feedback.
-
@turboderp Do you see more or less equal speeds on 4090 and 3090 ti with a 7B? The table on the README shows up to 160t/s. I tested a 3090 on an EPYC 7302 system and got to 80 t/s. I'm wondering if the CPU is the bottleneck or the GPU. |
Beta Was this translation helpful? Give feedback.
-
It's not quite equal yet. There might be some optimization to do on the kernels, but I am getting 127 t/s on the 3090. The EPYC is very slow, though, less than half the single-threaded performance of the 12900K, so that's probably what you're running into. Despite the fact that the CPU "isn't doing anything" during inference, Python is still really slow, and then Torch's underlying C++ libraries add a little overhead as well. You can see what's happening in this trace: The "CUDA API" row shows kernel launches. The first 8 launches (from "rms_norm_...") launched one after another in the C++ extension. The launches after that are from PyTorch, and the difference is pretty obvious. It's not that there's any processing happening, it's just Python being slow compared to compiled C++ code. Here is a complete forward pass for a single token: During the big red "cudaMemcpyAsync" at the end of the pass the CPU is just waiting in a busy-loop for the CUDA queue to finish so the logits are ready to copy to system RAM. It should ideally be much longer, though. The fact that it's only some 30% of the whole forward pass means that if my CPU were 30% slower, the CPU would become the bottleneck. That was yesterday, though. Here's where I'm at with the latest version that I pushed a few minutes ago: The CPU finishes queuing up all the CUDA operations much faster. There's a lot more that could be done, but it involves some headache-inducing strided matmuls that I'm not keen on tackling right now. After that there's also graphs, which can cut the kernel launch overhead to about a third, but I'm hoping this is fast enough for now so I can focus on sampling or something. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I tested your latest version and it got me from 80 t/s to 110 t/s on 7B, great work! |
Beta Was this translation helpful? Give feedback.
-
which commit do you test? is this commit 7805e2b ? |
Beta Was this translation helpful? Give feedback.
-
what's the name of the tool in the screenshot? |
Beta Was this translation helpful? Give feedback.
-
test with 7805e2b and
|
Beta Was this translation helpful? Give feedback.
-
@pineking : It's NVIDIA Nsight Systems. It's a free as in beer, and the EULA is what you'd expect from NVIDIA, but it is pretty awesome for debugging and profiling CUDA code. They also have Nsight Compute which is more of a kernel profiler. |
Beta Was this translation helpful? Give feedback.
-
@pineking Yes, that commit. On ab81db1 I had 80 t/s on 7B on EPYC 7302 and RTX 3090, on 7805e2b I get 110 t/s. |
Beta Was this translation helpful? Give feedback.
-
does someone compared the inference speed of 4bit quantized model with the origin FP16 model?
is it faster than the origin FP16 model?
Beta Was this translation helpful? Give feedback.
All reactions