-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue on WSL2? #30
Comments
python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "Explain to me the second Newton's law" -n 20 system_info: n_threads = 8 (n_threads_batch = 8) / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampler seed: 2649544873 Explain to me the second Newton's law of motion in one sentence. llama_perf_sampler_print: sampling time = 378.61 ms / 210 runs ( 1.80 ms per token, 554.67 tokens per second) |
As far as i can see your clang version is too old. (clang 14 vs clang 18). the REAME states clang>=18 as requirement your log output says clang 14 It would be cool it the install/build process would check this requirement and issue an error. |
I have BitNet running in native Windows10 and via WSL. Sample: Windows 10: WSL 2.0: |
Wow! thanks a lot for pinpointing the root of issue! |
Recompiled with clang-18 and got much better performance python3 run_inference.py -m Llama3-8B-1.58-100B-tokens-TQ2_0.gguf -p "What is the answer to life, universe and eve system_info: n_threads = 2 (n_threads_batch = 2) / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampler seed: 1939064987 What is the answer to life, universe and everything? llama_perf_sampler_print: sampling time = 53.11 ms / 414 runs ( 0.13 ms per token, 7795.73 tokens per second) |
I got it running, but its performance on the WSL2 setup (i7-8565U 1.80GHz CPU, 16GB RAM) is nearly unusable - 3 to 5 seconds per word, 40% CPU load. Are there any compiler optimization missing?
The text was updated successfully, but these errors were encountered: