Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more perfo with llamafile tinyblas on x86_64. #10714

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented Dec 8, 2024

ikawrakow/ik_llama.cpp#71 have a good idea.

I'll figure to add it in llamafile/tinyblas sgemm (and a litle more) and id work great:

  • AMD Ryzen 9 7940HS (zen4)

Mistral-Nemo-Instruct-2407.BF16.gguf +kv@bf16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.52 2.51 1.00
llama 13B BF16 pp2 5.00 4.74 0.95
llama 13B BF16 pp3 7.44 7.23 0.97
llama 13B BF16 pp4 9.75 9.91 1.02
llama 13B BF16 pp5 11.95 12.37 1.04
llama 13B BF16 pp6 13.92 14.78 1.06
llama 13B BF16 pp7 15.64 17.09 1.09
llama 13B BF16 pp8 17.24 19.41 1.13
llama 13B BF16 pp9 18.35 21.63 1.18
llama 13B BF16 pp10 19.47 24.02 1.23
llama 13B BF16 pp11 20.48 26.30 1.28
llama 13B BF16 pp12 21.04 28.43 1.35
llama 13B BF16 pp13 21.49 29.41 1.37
llama 13B BF16 pp14 23.10 31.56 1.37
llama 13B BF16 pp15 23.65 33.51 1.42
llama 13B BF16 pp16 23.99 35.87 1.50
llama 13B BF16 pp30 24.19 51.09 2.11
llama 13B BF16 pp32 24.56 51.04 2.08
llama 13B BF16 pp64 25.66 57.23 2.23
llama 13B BF16 pp65 24.57 57.33 2.33
llama 13B BF16 pp120 25.73 65.51 2.55
llama 13B BF16 pp128 25.77 65.81 2.55
llama 13B BF16 pp130 26.86 66.31 2.47
llama 13B BF16 pp240 27.76 70.40 2.54
llama 13B BF16 pp255 26.03 70.32 2.70
llama 13B BF16 pp256 26.04 70.34 2.70
llama 13B BF16 pp510 25.73 68.26 2.65
llama 13B BF16 pp512 25.74 67.97 2.64
llama 13B BF16 pp1024 25.27 66.76 2.64
llama 13B BF16 pp1025 25.04 64.84 2.59
llama 13B BF16 pp2048 24.63 63.96 2.60
llama 13B BF16 tg128 2.52 2.52 1.00

Mistral-Nemo-Instruct-2407.FP16.gguf +kv@fp16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.50 2.50 1.00
llama 13B F16 pp2 4.94 4.81 0.97
llama 13B F16 pp3 7.41 7.19 0.97
llama 13B F16 pp4 9.81 9.82 1.00
llama 13B F16 pp5 12.23 12.25 1.00
llama 13B F16 pp6 7.92 14.58 1.84
llama 13B F16 pp7 9.19 16.52 1.80
llama 13B F16 pp8 10.47 18.76 1.79
llama 13B F16 pp9 11.76 20.86 1.77
llama 13B F16 pp10 22.58 22.80 1.01
llama 13B F16 pp11 14.19 24.90 1.75
llama 13B F16 pp12 15.41 26.70 1.73
llama 13B F16 pp13 16.66 26.99 1.62
llama 13B F16 pp14 17.88 28.74 1.61
llama 13B F16 pp15 31.97 29.67 0.93
llama 13B F16 pp16 19.79 31.23 1.58
llama 13B F16 pp30 38.31 36.86 0.96
llama 13B F16 pp32 29.11 36.60 1.26
llama 13B F16 pp64 32.15 38.95 1.21
llama 13B F16 pp65 38.72 38.91 1.00
llama 13B F16 pp120 39.14 40.36 1.03
llama 13B F16 pp128 35.44 40.19 1.13
llama 13B F16 pp130 39.49 40.24 1.02
llama 13B F16 pp240 36.90 40.76 1.10
llama 13B F16 pp255 35.87 40.66 1.13
llama 13B F16 pp256 33.51 40.43 1.21
llama 13B F16 pp510 27.96 40.09 1.43
llama 13B F16 pp512 27.41 40.08 1.46
llama 13B F16 pp1024 27.27 39.03 1.43
llama 13B F16 pp1025 25.91 38.50 1.49
llama 13B F16 pp2048 26.75 37.95 1.42
llama 13B F16 tg128 2.50 2.51 1.00
  • on AMD Ryzen 9 5950X 16-Core Processor (znver3) (AVX2)

Mistral-Nemo-Instruct-2407.BF16.gguf +kv@bf16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.21 2.21 1.00
llama 13B BF16 pp2 4.36 4.31 0.99
llama 13B BF16 pp3 6.44 6.44 1.00
llama 13B BF16 pp4 8.42 8.58 1.02
llama 13B BF16 pp5 10.29 10.71 1.04
llama 13B BF16 pp6 12.00 12.78 1.07
llama 13B BF16 pp7 13.53 14.86 1.10
llama 13B BF16 pp8 14.72 16.92 1.15
llama 13B BF16 pp9 15.61 18.93 1.21
llama 13B BF16 pp10 16.30 20.92 1.28
llama 13B BF16 pp11 16.93 23.01 1.36
llama 13B BF16 pp12 17.35 24.89 1.43
llama 13B BF16 pp13 17.69 26.94 1.52
llama 13B BF16 pp14 17.95 28.78 1.60
llama 13B BF16 pp15 18.21 30.64 1.68
llama 13B BF16 pp16 18.37 32.45 1.77
llama 13B BF16 pp30 19.20 42.87 2.23
llama 13B BF16 pp32 19.36 43.14 2.23
llama 13B BF16 pp64 19.85 45.05 2.27
llama 13B BF16 pp65 19.46 44.94 2.31
llama 13B BF16 pp120 19.98 46.27 2.32
llama 13B BF16 pp128 20.14 46.11 2.29
llama 13B BF16 pp130 19.97 45.93 2.30
llama 13B BF16 pp240 20.23 46.50 2.30
llama 13B BF16 pp255 20.24 46.54 2.30
llama 13B BF16 pp256 20.19 46.40 2.30
llama 13B BF16 pp510 20.09 46.01 2.29
llama 13B BF16 pp512 20.17 45.81 2.27
llama 13B BF16 pp1024 19.94 45.05 2.26
llama 13B BF16 pp1025 19.74 44.18 2.24
llama 13B BF16 pp2048 19.48 43.68 2.24
llama 13B BF16 tg128 2.21 2.21 1.00

Mistral-Nemo-Instruct-2407.FP16.gguf +kv@fp16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.19 2.19 1.00
llama 13B F16 pp2 4.30 4.28 1.00
llama 13B F16 pp3 6.46 6.41 0.99
llama 13B F16 pp4 4.84 8.53 1.76
llama 13B F16 pp5 6.01 10.64 1.77
llama 13B F16 pp6 12.90 12.71 0.99
llama 13B F16 pp7 8.50 14.81 1.74
llama 13B F16 pp8 9.64 16.88 1.75
llama 13B F16 pp9 19.25 18.90 0.98
llama 13B F16 pp10 12.25 20.88 1.70
llama 13B F16 pp11 13.39 22.94 1.71
llama 13B F16 pp12 25.40 24.89 0.98
llama 13B F16 pp13 15.89 26.87 1.69
llama 13B F16 pp14 17.02 28.74 1.69
llama 13B F16 pp15 30.82 30.66 0.99
llama 13B F16 pp16 19.45 32.55 1.67
llama 13B F16 pp30 34.23 54.23 1.58
llama 13B F16 pp32 27.34 55.37 2.03
llama 13B F16 pp64 30.66 58.46 1.91
llama 13B F16 pp65 31.03 57.47 1.85
llama 13B F16 pp120 35.31 58.01 1.64
llama 13B F16 pp128 33.09 57.76 1.75
llama 13B F16 pp130 33.05 58.28 1.76
llama 13B F16 pp240 35.19 58.66 1.67
llama 13B F16 pp255 35.24 58.62 1.66
llama 13B F16 pp256 33.98 58.57 1.72
llama 13B F16 pp510 33.76 57.94 1.72
llama 13B F16 pp512 33.34 57.51 1.72
llama 13B F16 pp1024 33.03 56.41 1.71
llama 13B F16 pp1025 32.59 53.96 1.66
llama 13B F16 pp2048 32.08 54.93 1.71
llama 13B F16 tg128 2.18 2.19 1.00

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 8, 2024
@Djip007 Djip007 marked this pull request as draft December 8, 2024 03:25
@Djip007 Djip007 force-pushed the perfo/tinyblas branch 5 times, most recently from f7c5a68 to b1c72b9 Compare December 9, 2024 22:34
@Djip007
Copy link
Contributor Author

Djip007 commented Dec 10, 2024

Some perplexity with new code. (vs master BF16/zen3)

#> zen3:
./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9452 ±    0.5268       0.00070 ±    0.00062       0.00003 ±    0.00000     0.186 ±  0.018 %    99.608 ±  0.392 %
   2       5.4448 ±    0.6044       0.00187 ±    0.00151       0.00004 ±    0.00000     0.183 ±  0.013 %    99.804 ±  0.196 %
   3       4.6848 ±    0.4030       0.00093 ±    0.00104       0.00006 ±    0.00000     0.255 ±  0.030 %    99.869 ±  0.131 %
   4       5.0051 ±    0.3673       0.00039 ±    0.00080       0.00005 ±    0.00000     0.243 ±  0.024 %    99.902 ±  0.098 %
   5       5.2917 ±    0.3433       0.00004 ±    0.00067       0.00006 ±    0.00000     0.243 ±  0.020 %    99.922 ±  0.078 %
   6       5.8289 ±    0.3542       0.00000 ±    0.00057       0.00006 ±    0.00000     0.233 ±  0.017 %    99.869 ±  0.092 %
   7       6.2242 ±    0.3544       0.00025 ±    0.00054       0.00006 ±    0.00000     0.228 ±  0.015 %    99.832 ±  0.097 %
   8       6.4312 ±    0.3454       0.00041 ±    0.00048       0.00006 ±    0.00000     0.229 ±  0.014 %    99.755 ±  0.110 %
   9       6.8865 ±    0.3580       0.00036 ±    0.00043       0.00006 ±    0.00000     0.227 ±  0.013 %    99.739 ±  0.107 %
  10       7.2362 ±    0.3590       0.00026 ±    0.00039       0.00006 ±    0.00000     0.224 ±  0.012 %    99.765 ±  0.096 %
  11       7.2572 ±    0.3420       0.00018 ±    0.00036       0.00005 ±    0.00000     0.218 ±  0.011 %    99.750 ±  0.094 %
  12       7.2827 ±    0.3297       0.00015 ±    0.00033       0.00007 ±    0.00001     0.230 ±  0.011 %    99.673 ±  0.103 %
  13       7.4379 ±    0.3228       0.00011 ±    0.00031       0.00007 ±    0.00001     0.226 ±  0.011 %    99.608 ±  0.109 %
  14       7.3367 ±    0.3061       0.00016 ±    0.00030       0.00007 ±    0.00001     0.224 ±  0.010 %    99.636 ±  0.101 %
  15       7.1258 ±    0.2859       0.00012 ±    0.00028       0.00007 ±    0.00001     0.222 ±  0.010 %    99.634 ±  0.098 %
  16       7.1695 ±    0.2792       0.00012 ±    0.00026       0.00007 ±    0.00001     0.223 ±  0.009 %    99.657 ±  0.092 %
  17       6.8048 ±    0.2538       0.00008 ±    0.00025       0.00007 ±    0.00001     0.223 ±  0.009 %    99.677 ±  0.086 %
  18       6.8631 ±    0.2517       0.00016 ±    0.00024       0.00007 ±    0.00001     0.221 ±  0.008 %    99.651 ±  0.087 %
  19       6.9983 ±    0.2515       0.00016 ±    0.00023       0.00007 ±    0.00001     0.220 ±  0.008 %    99.670 ±  0.082 %
  20       6.7969 ±    0.2383       0.00013 ±    0.00022       0.00007 ±    0.00001     0.229 ±  0.008 %    99.667 ±  0.081 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9430 ±    0.5263       0.00016 ±    0.00050       0.00003 ±    0.00000     0.159 ±  0.014 %    100.000 ±  0.000 %
   2       5.4442 ±    0.6042       0.00177 ±    0.00153       0.00003 ±    0.00000     0.155 ±  0.009 %    99.804 ±  0.196 %
   3       4.6852 ±    0.4029       0.00103 ±    0.00104       0.00003 ±    0.00000     0.179 ±  0.010 %    99.869 ±  0.131 %
   4       5.0069 ±    0.3673       0.00074 ±    0.00079       0.00003 ±    0.00000     0.178 ±  0.009 %    99.902 ±  0.098 %
   5       5.2939 ±    0.3435       0.00046 ±    0.00064       0.00003 ±    0.00000     0.171 ±  0.008 %    99.922 ±  0.078 %
   6       5.8312 ±    0.3544       0.00039 ±    0.00053       0.00003 ±    0.00000     0.165 ±  0.007 %    99.935 ±  0.065 %
   7       6.2260 ±    0.3545       0.00055 ±    0.00050       0.00003 ±    0.00000     0.165 ±  0.006 %    99.888 ±  0.079 %
   8       6.4317 ±    0.3454       0.00047 ±    0.00044       0.00003 ±    0.00000     0.169 ±  0.007 %    99.853 ±  0.085 %
   9       6.8872 ±    0.3580       0.00047 ±    0.00039       0.00003 ±    0.00000     0.168 ±  0.006 %    99.826 ±  0.087 %
  10       7.2376 ±    0.3590       0.00045 ±    0.00036       0.00003 ±    0.00000     0.167 ±  0.006 %    99.804 ±  0.088 %
  11       7.2592 ±    0.3421       0.00045 ±    0.00033       0.00003 ±    0.00000     0.166 ±  0.005 %    99.822 ±  0.080 %
  12       7.2841 ±    0.3298       0.00033 ±    0.00031       0.00003 ±    0.00000     0.172 ±  0.005 %    99.837 ±  0.073 %
  13       7.4398 ±    0.3229       0.00036 ±    0.00029       0.00003 ±    0.00000     0.171 ±  0.005 %    99.849 ±  0.067 %
  14       7.3379 ±    0.3062       0.00033 ±    0.00027       0.00003 ±    0.00000     0.168 ±  0.005 %    99.860 ±  0.063 %
  15       7.1275 ±    0.2859       0.00035 ±    0.00025       0.00003 ±    0.00000     0.167 ±  0.005 %    99.843 ±  0.064 %
  16       7.1714 ±    0.2793       0.00039 ±    0.00024       0.00003 ±    0.00000     0.171 ±  0.005 %    99.828 ±  0.065 %
  17       6.8067 ±    0.2539       0.00036 ±    0.00023       0.00003 ±    0.00000     0.169 ±  0.004 %    99.839 ±  0.061 %
  18       6.8643 ±    0.2518       0.00033 ±    0.00022       0.00003 ±    0.00000     0.168 ±  0.004 %    99.804 ±  0.065 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00021       0.00003 ±    0.00000     0.166 ±  0.004 %    99.814 ±  0.062 %
  20       6.7977 ±    0.2383       0.00026 ±    0.00020       0.00003 ±    0.00000     0.168 ±  0.004 %    99.824 ±  0.059 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9425 ±    0.5261       0.00002 ±    0.00006       0.00000 ±    0.00000     0.022 ±  0.002 %    100.000 ±  0.000 %
   2       5.4427 ±    0.6040       0.00148 ±    0.00151       0.00000 ±    0.00000     0.023 ±  0.002 %    100.000 ±  0.000 %
   3       4.6851 ±    0.4029       0.00100 ±    0.00101       0.00000 ±    0.00000     0.029 ±  0.003 %    100.000 ±  0.000 %
   4       5.0068 ±    0.3674       0.00073 ±    0.00075       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   5       5.2945 ±    0.3436       0.00057 ±    0.00060       0.00000 ±    0.00000     0.028 ±  0.002 %    100.000 ±  0.000 %
   6       5.8317 ±    0.3545       0.00048 ±    0.00050       0.00000 ±    0.00000     0.027 ±  0.002 %    100.000 ±  0.000 %
   7       6.2264 ±    0.3545       0.00061 ±    0.00047       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
   8       6.4321 ±    0.3454       0.00054 ±    0.00041       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
   9       6.8873 ±    0.3580       0.00049 ±    0.00036       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
  10       7.2376 ±    0.3591       0.00045 ±    0.00033       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.039 %
  11       7.2589 ±    0.3421       0.00041 ±    0.00030       0.00000 ±    0.00000     0.026 ±  0.001 %    99.964 ±  0.036 %
  12       7.2846 ±    0.3299       0.00040 ±    0.00027       0.00000 ±    0.00000     0.027 ±  0.001 %    99.967 ±  0.033 %
  13       7.4399 ±    0.3229       0.00038 ±    0.00025       0.00000 ±    0.00000     0.026 ±  0.001 %    99.970 ±  0.030 %
  14       7.3381 ±    0.3062       0.00035 ±    0.00023       0.00000 ±    0.00000     0.026 ±  0.001 %    99.972 ±  0.028 %
  15       7.1273 ±    0.2860       0.00033 ±    0.00022       0.00000 ±    0.00000     0.026 ±  0.001 %    99.974 ±  0.026 %
  16       7.1709 ±    0.2793       0.00031 ±    0.00021       0.00000 ±    0.00000     0.026 ±  0.001 %    99.975 ±  0.025 %
  17       6.8063 ±    0.2539       0.00030 ±    0.00019       0.00000 ±    0.00000     0.027 ±  0.001 %    99.977 ±  0.023 %
  18       6.8639 ±    0.2518       0.00028 ±    0.00018       0.00000 ±    0.00000     0.027 ±  0.001 %    99.956 ±  0.031 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00017       0.00000 ±    0.00000     0.027 ±  0.001 %    99.959 ±  0.029 %
  20       6.7978 ±    0.2384       0.00026 ±    0.00016       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.028 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.Q8_0.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9483 ±    0.5259       0.00149 ±    0.00252       0.00062 ±    0.00005     0.837 ±  0.073 %    99.216 ±  0.554 %
   2       5.4523 ±    0.6051       0.00324 ±    0.00270       0.00075 ±    0.00006     0.820 ±  0.050 %    99.020 ±  0.437 %
   3       4.6906 ±    0.4032       0.00217 ±    0.00211       0.00088 ±    0.00005     0.945 ±  0.056 %    98.824 ±  0.390 %
   4       5.0117 ±    0.3678       0.00170 ±    0.00173       0.00085 ±    0.00004     0.935 ±  0.047 %    98.824 ±  0.338 %
   5       5.3001 ±    0.3441       0.00161 ±    0.00147       0.00085 ±    0.00004     0.928 ±  0.040 %    98.745 ±  0.312 %
   6       5.8416 ±    0.3554       0.00217 ±    0.00130       0.00085 ±    0.00003     0.902 ±  0.036 %    98.889 ±  0.268 %
   7       6.2372 ±    0.3555       0.00234 ±    0.00124       0.00086 ±    0.00003     0.932 ±  0.040 %    98.824 ±  0.255 %
   8       6.4396 ±    0.3462       0.00171 ±    0.00114       0.00087 ±    0.00003     0.923 ±  0.036 %    98.578 ±  0.262 %
   9       6.8928 ±    0.3586       0.00129 ±    0.00106       0.00089 ±    0.00003     0.927 ±  0.036 %    98.562 ±  0.249 %
  10       7.2420 ±    0.3596       0.00106 ±    0.00099       0.00088 ±    0.00002     0.905 ±  0.033 %    98.549 ±  0.237 %
  11       7.2616 ±    0.3424       0.00079 ±    0.00092       0.00085 ±    0.00002     0.887 ±  0.031 %    98.610 ±  0.221 %
  12       7.2866 ±    0.3302       0.00068 ±    0.00087       0.00086 ±    0.00002     0.888 ±  0.029 %    98.529 ±  0.218 %
  13       7.4416 ±    0.3232       0.00061 ±    0.00084       0.00087 ±    0.00002     0.878 ±  0.027 %    98.431 ±  0.216 %
  14       7.3400 ±    0.3065       0.00061 ±    0.00080       0.00087 ±    0.00002     0.869 ±  0.026 %    98.403 ±  0.210 %
  15       7.1295 ±    0.2862       0.00063 ±    0.00076       0.00086 ±    0.00002     0.863 ±  0.025 %    98.431 ±  0.201 %
  16       7.1739 ±    0.2795       0.00074 ±    0.00074       0.00088 ±    0.00002     0.879 ±  0.024 %    98.407 ±  0.196 %
  17       6.8092 ±    0.2541       0.00074 ±    0.00071       0.00086 ±    0.00002     0.873 ±  0.024 %    98.431 ±  0.189 %
  18       6.8646 ±    0.2519       0.00038 ±    0.00069       0.00085 ±    0.00002     0.869 ±  0.023 %    98.453 ±  0.182 %
  19       6.9997 ±    0.2516       0.00035 ±    0.00067       0.00085 ±    0.00002     0.865 ±  0.022 %    98.431 ±  0.179 %
  20       6.7990 ±    0.2385       0.00044 ±    0.00065       0.00085 ±    0.00002     0.877 ±  0.021 %    98.451 ±  0.173 %

#> zen4:
./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9443 ±    0.5267       0.00048 ±    0.00053       0.00004 ±    0.00001     0.169 ±  0.019 %    99.608 ±  0.392 %
   2       5.4419 ±    0.6039       0.00133 ±    0.00153       0.00005 ±    0.00001     0.167 ±  0.012 %    99.412 ±  0.339 %
   3       4.6835 ±    0.4027       0.00066 ±    0.00105       0.00006 ±    0.00001     0.236 ±  0.021 %    99.608 ±  0.226 %
   4       5.0057 ±    0.3672       0.00051 ±    0.00080       0.00005 ±    0.00000     0.231 ±  0.017 %    99.608 ±  0.196 %
   5       5.2931 ±    0.3434       0.00030 ±    0.00065       0.00005 ±    0.00000     0.220 ±  0.014 %    99.686 ±  0.157 %
   6       5.8307 ±    0.3543       0.00030 ±    0.00055       0.00005 ±    0.00000     0.216 ±  0.012 %    99.739 ±  0.131 %
   7       6.2255 ±    0.3544       0.00047 ±    0.00052       0.00005 ±    0.00000     0.210 ±  0.011 %    99.664 ±  0.137 %
   8       6.4316 ±    0.3454       0.00047 ±    0.00046       0.00005 ±    0.00000     0.218 ±  0.010 %    99.657 ±  0.130 %
   9       6.8874 ±    0.3580       0.00050 ±    0.00041       0.00005 ±    0.00000     0.213 ±  0.010 %    99.608 ±  0.130 %
  10       7.2365 ±    0.3589       0.00030 ±    0.00038       0.00005 ±    0.00000     0.209 ±  0.009 %    99.569 ±  0.130 %
  11       7.2584 ±    0.3420       0.00034 ±    0.00035       0.00005 ±    0.00000     0.205 ±  0.008 %    99.572 ±  0.123 %
  12       7.2835 ±    0.3298       0.00025 ±    0.00032       0.00005 ±    0.00000     0.211 ±  0.009 %    99.542 ±  0.122 %
  13       7.4389 ±    0.3228       0.00025 ±    0.00030       0.00005 ±    0.00000     0.209 ±  0.009 %    99.578 ±  0.113 %
  14       7.3370 ±    0.3061       0.00021 ±    0.00029       0.00005 ±    0.00000     0.208 ±  0.008 %    99.524 ±  0.115 %
  15       7.1270 ±    0.2859       0.00028 ±    0.00027       0.00005 ±    0.00000     0.209 ±  0.008 %    99.529 ±  0.111 %
  16       7.1706 ±    0.2792       0.00027 ±    0.00026       0.00005 ±    0.00000     0.216 ±  0.007 %    99.510 ±  0.109 %
  17       6.8060 ±    0.2538       0.00026 ±    0.00025       0.00006 ±    0.00000     0.215 ±  0.007 %    99.539 ±  0.103 %
  18       6.8641 ±    0.2517       0.00030 ±    0.00024       0.00006 ±    0.00000     0.213 ±  0.007 %    99.521 ±  0.102 %
  19       6.9992 ±    0.2515       0.00028 ±    0.00023       0.00006 ±    0.00000     0.214 ±  0.007 %    99.546 ±  0.097 %
  20       6.7980 ±    0.2384       0.00030 ±    0.00022       0.00006 ±    0.00000     0.217 ±  0.007 %    99.569 ±  0.092 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9433 ±    0.5262       0.00024 ±    0.00045       0.00003 ±    0.00000     0.166 ±  0.028 %    99.216 ±  0.554 %
   2       5.4430 ±    0.6041       0.00154 ±    0.00158       0.00003 ±    0.00000     0.167 ±  0.016 %    99.216 ±  0.391 %
   3       4.6865 ±    0.4030       0.00130 ±    0.00108       0.00004 ±    0.00001     0.205 ±  0.024 %    99.477 ±  0.261 %
   4       5.0071 ±    0.3673       0.00079 ±    0.00082       0.00003 ±    0.00000     0.193 ±  0.019 %    99.510 ±  0.219 %
   5       5.2951 ±    0.3436       0.00068 ±    0.00066       0.00003 ±    0.00000     0.185 ±  0.016 %    99.608 ±  0.175 %
   6       5.8319 ±    0.3544       0.00051 ±    0.00056       0.00003 ±    0.00000     0.178 ±  0.014 %    99.608 ±  0.160 %
   7       6.2266 ±    0.3545       0.00064 ±    0.00051       0.00003 ±    0.00000     0.174 ±  0.012 %    99.608 ±  0.148 %
   8       6.4322 ±    0.3455       0.00056 ±    0.00045       0.00003 ±    0.00000     0.177 ±  0.011 %    99.657 ±  0.130 %
   9       6.8885 ±    0.3581       0.00066 ±    0.00041       0.00003 ±    0.00000     0.175 ±  0.010 %    99.608 ±  0.130 %
  10       7.2386 ±    0.3592       0.00059 ±    0.00037       0.00003 ±    0.00000     0.172 ±  0.009 %    99.647 ±  0.117 %
  11       7.2603 ±    0.3422       0.00060 ±    0.00034       0.00003 ±    0.00000     0.170 ±  0.009 %    99.679 ±  0.107 %
  12       7.2852 ±    0.3299       0.00049 ±    0.00031       0.00003 ±    0.00000     0.173 ±  0.008 %    99.673 ±  0.103 %
  13       7.4408 ±    0.3230       0.00049 ±    0.00029       0.00003 ±    0.00000     0.172 ±  0.008 %    99.698 ±  0.095 %
  14       7.3386 ±    0.3063       0.00041 ±    0.00028       0.00003 ±    0.00000     0.170 ±  0.007 %    99.692 ±  0.093 %
  15       7.1278 ±    0.2860       0.00040 ±    0.00026       0.00003 ±    0.00000     0.167 ±  0.007 %    99.686 ±  0.090 %
  16       7.1714 ±    0.2793       0.00038 ±    0.00024       0.00003 ±    0.00000     0.166 ±  0.006 %    99.706 ±  0.085 %
  17       6.8064 ±    0.2539       0.00032 ±    0.00023       0.00003 ±    0.00000     0.164 ±  0.006 %    99.723 ±  0.080 %
  18       6.8641 ±    0.2518       0.00031 ±    0.00022       0.00003 ±    0.00000     0.162 ±  0.006 %    99.695 ±  0.081 %
  19       6.9994 ±    0.2515       0.00032 ±    0.00021       0.00003 ±    0.00000     0.161 ±  0.006 %    99.711 ±  0.077 %
  20       6.7979 ±    0.2384       0.00028 ±    0.00020       0.00003 ±    0.00000     0.165 ±  0.005 %    99.706 ±  0.076 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9432 ±    0.5262       0.00021 ±    0.00007       0.00000 ±    0.00000     0.023 ±  0.003 %    100.000 ±  0.000 %
   2       5.4435 ±    0.6041       0.00163 ±    0.00150       0.00000 ±    0.00000     0.025 ±  0.002 %    100.000 ±  0.000 %
   3       4.6856 ±    0.4029       0.00111 ±    0.00100       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   4       5.0072 ±    0.3674       0.00081 ±    0.00075       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   5       5.2951 ±    0.3437       0.00067 ±    0.00060       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   6       5.8323 ±    0.3545       0.00057 ±    0.00050       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   7       6.2269 ±    0.3546       0.00069 ±    0.00047       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   8       6.4324 ±    0.3455       0.00059 ±    0.00041       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   9       6.8876 ±    0.3581       0.00053 ±    0.00036       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
  10       7.2379 ±    0.3591       0.00049 ±    0.00033       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.039 %
  11       7.2590 ±    0.3421       0.00043 ±    0.00030       0.00000 ±    0.00000     0.027 ±  0.001 %    99.964 ±  0.036 %
  12       7.2847 ±    0.3299       0.00041 ±    0.00027       0.00000 ±    0.00000     0.028 ±  0.001 %    99.935 ±  0.046 %
  13       7.4400 ±    0.3229       0.00039 ±    0.00025       0.00000 ±    0.00000     0.028 ±  0.001 %    99.940 ±  0.043 %
  14       7.3381 ±    0.3062       0.00035 ±    0.00023       0.00000 ±    0.00000     0.028 ±  0.001 %    99.888 ±  0.056 %
  15       7.1273 ±    0.2860       0.00032 ±    0.00022       0.00000 ±    0.00000     0.028 ±  0.001 %    99.895 ±  0.052 %
  16       7.1708 ±    0.2793       0.00030 ±    0.00021       0.00000 ±    0.00000     0.030 ±  0.001 %    99.902 ±  0.049 %
  17       6.8062 ±    0.2539       0.00029 ±    0.00019       0.00000 ±    0.00000     0.030 ±  0.001 %    99.908 ±  0.046 %
  18       6.8639 ±    0.2518       0.00027 ±    0.00018       0.00000 ±    0.00000     0.030 ±  0.001 %    99.891 ±  0.049 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00017       0.00000 ±    0.00000     0.030 ±  0.001 %    99.897 ±  0.046 %
  20       6.7978 ±    0.2384       0.00027 ±    0.00016       0.00000 ±    0.00000     0.030 ±  0.001 %    99.882 ±  0.048 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.Q8_0.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9525 ±    0.5268       0.00257 ±    0.00200       0.00062 ±    0.00006     0.780 ±  0.079 %    98.824 ±  0.677 %
   2       5.4551 ±    0.6050       0.00375 ±    0.00244       0.00074 ±    0.00005     0.790 ±  0.052 %    99.020 ±  0.437 %
   3       4.6895 ±    0.4028       0.00195 ±    0.00192       0.00094 ±    0.00007     0.974 ±  0.066 %    98.824 ±  0.390 %
   4       5.0104 ±    0.3675       0.00144 ±    0.00155       0.00089 ±    0.00005     0.958 ±  0.054 %    98.922 ±  0.324 %
   5       5.2958 ±    0.3435       0.00082 ±    0.00133       0.00089 ±    0.00004     0.924 ±  0.046 %    98.902 ±  0.292 %
   6       5.8327 ±    0.3544       0.00064 ±    0.00118       0.00087 ±    0.00004     0.897 ±  0.040 %    98.824 ±  0.276 %
   7       6.2283 ±    0.3547       0.00092 ±    0.00114       0.00088 ±    0.00003     0.888 ±  0.036 %    98.768 ±  0.261 %
   8       6.4327 ±    0.3453       0.00063 ±    0.00107       0.00088 ±    0.00003     0.877 ±  0.033 %    98.529 ±  0.267 %
   9       6.8834 ±    0.3577      -0.00008 ±    0.00100       0.00088 ±    0.00003     0.869 ±  0.030 %    98.562 ±  0.249 %
  10       7.2338 ±    0.3588      -0.00007 ±    0.00094       0.00087 ±    0.00002     0.859 ±  0.028 %    98.431 ±  0.246 %
  11       7.2576 ±    0.3420       0.00023 ±    0.00088       0.00086 ±    0.00002     0.843 ±  0.026 %    98.503 ±  0.229 %
  12       7.2853 ±    0.3300       0.00051 ±    0.00084       0.00088 ±    0.00002     0.851 ±  0.025 %    98.529 ±  0.218 %
  13       7.4384 ±    0.3229       0.00018 ±    0.00080       0.00087 ±    0.00002     0.841 ±  0.024 %    98.522 ±  0.210 %
  14       7.3364 ±    0.3062       0.00011 ±    0.00076       0.00087 ±    0.00002     0.838 ±  0.023 %    98.543 ±  0.201 %
  15       7.1259 ±    0.2859       0.00013 ±    0.00073       0.00086 ±    0.00002     0.833 ±  0.022 %    98.614 ±  0.189 %
  16       7.1687 ±    0.2792       0.00001 ±    0.00070       0.00086 ±    0.00002     0.841 ±  0.021 %    98.603 ±  0.184 %
  17       6.8051 ±    0.2538       0.00013 ±    0.00067       0.00085 ±    0.00002     0.838 ±  0.020 %    98.570 ±  0.180 %
  18       6.8615 ±    0.2516      -0.00007 ±    0.00066       0.00084 ±    0.00002     0.839 ±  0.020 %    98.540 ±  0.177 %
  19       6.9961 ±    0.2514      -0.00016 ±    0.00064       0.00084 ±    0.00002     0.838 ±  0.019 %    98.514 ±  0.174 %
  20       6.7965 ±    0.2384       0.00008 ±    0.00062       0.00085 ±    0.00002     0.855 ±  0.020 %    98.529 ±  0.169 %

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 10, 2024

For me look good.
@ggerganov @slaren what do you think with the 3 failed?

@Djip007 Djip007 marked this pull request as ready for review December 10, 2024 00:52
@slaren
Copy link
Collaborator

slaren commented Dec 10, 2024

@ggerganov @slaren what do you think with the 3 failed?

Not sure, try merging the current master to see if it is some issue in the server that has already been fixed.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

--------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=4213
=========================== short test summary info ============================
FAILED unit/test_completion.py::test_consistent_result_same_seed[2] - AssertionError: assert ' making. Eve...hen, they saw' == ' making. Eve...ining and dan'
  
     making. Everyone are very hungry.
  - One day, it is time to go to the park with his mom. They had a quiet window. They were shining and dan
  + One day, it is time to go to the park with his mom. They had a talking eye to rest. But then, they saw
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
======================== 1 failed, 32 passed in 22.85s =========================

look like a small diff in result.

unit/test_completion.py::test_consistent_result_same_seed[1] PASSED      [ 34%]
unit/test_completion.py::test_consistent_result_same_seed[2] FAILED      [ 35%]
@pytest.mark.parametrize("n_slots", [1, 2])
def test_consistent_result_same_seed(n_slots: int):
    global server
    server.n_slots = n_slots

what is n_slots?

I have to check some elements in my code tomorrow...

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

what is n_slots?

I am not sure what's the effect of increasing the number of slots for this test. I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed.

That's what I was thinking last night, but it was too late. I have a little idea, but I was too tired to check/correct it.

@Djip007 Djip007 marked this pull request as draft December 11, 2024 18:58
@ggerganov
Copy link
Owner

The failing test seems to be using 2 slots. With 2 slots, the KV cache buffer is shared among the two generations. Initially, the buffer is empty:

..................................................................

Then the first request is processed by slot 0 and thus the beginning of the buffer is occupied:

000000000000000000000000000000000000..............................

The second request is processed on slot 1, so the old data remains in the buffer:

0000000000000000000000000000000000001111111111111111111111111111..

Because we compute the attention on the entire buffer by masking out the cross-sequence values, it is actually possible to get different results between the 2 generations. This happens due to summing floating-point across the length of the KV buffer. In the next example, even-though the data in the buffer is the same, it can lead to different numerical results during the V*QK matrix multiplication simply because the data occupies different cells and the SIMD groups would produce different reults:

000000000000000000000000000000000000..1111111111111111111111111111

I'm thinking that maybe there isn't a bug in the implementation in this PR, and it's a side-effect of the unified KV cache. Probably this test for n_slots > 1 should be disabled for now.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

@ggerganov
Great! Thanks for these explanations. Very clear. And then very nice to learn how it works.

On the other hand, by going over my code step by step, there are a small number of cases (2 to ~5?) where I do too much calculation and write outside the wrong value (possibly by overwriting correct data that I just calculated...)

So I corrected that. It remains to be seen if the test passes.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

Well that wasn't enough,
at least I fixed a small bug.

I'm doing another pass on the perlexity to be sure with my last correction.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been running random tests with test-backend-ops and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

I have been running random tests with test-backend-ops and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.

not sur how to do it:

# replace 
# @pytest.mark.parametrize("n_slots", [1, 2])
# with that?
@pytest.mark.parametrize("n_slots", [1])
def test_consistent_result_same_seed(n_slots: int):
    global server
    server.n_slots = n_slots
    server.start()
    last_res = None
    for _ in range(4):
        res = server.make_request("POST", "/completion", data={
            "prompt": "I believe the meaning of life is",
            "seed": 42,
            "temperature": 1.0,
            "cache_prompt": False,  # TODO: remove this once test_cache_vs_nocache_prompt is fixed
        })
        if last_res is not None:
            assert res.body["content"] == last_res.body["content"]
        last_res = res

@github-actions github-actions bot added examples python python script changes server labels Dec 11, 2024
@ggerganov
Copy link
Owner

A different test is failing now. Add:

--- a/examples/server/tests/unit/test_completion.py
+++ b/examples/server/tests/unit/test_completion.py
@@ -116,6 +116,7 @@ def test_different_result_different_seed(n_slots: int):
 def test_consistent_result_different_batch_size(n_batch: int, temperature: float):
     global server
     server.n_batch = n_batch
+    server.n_slots = 1
     server.start()
     last_res = None
     for _ in range(4):

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

On my system (intel 13900k) I see better performance with BF16, but worse with F16 in some cases:

Model Test t/s master t/s perfo/tinyblas Speedup
llama 7B BF16 pp32 33.94 42.27 1.25
llama 7B BF16 pp64 34.82 43.27 1.24
llama 7B BF16 pp128 33.22 43.69 1.32
llama 7B BF16 tg32 6.75 6.41 0.95
llama 7B F16 pp32 41.45 28.85 0.70
llama 7B F16 pp64 42.88 26.34 0.61
llama 7B F16 pp128 43.69 29.75 0.68
llama 7B F16 tg32 6.82 6.47 0.95

With different numbers of threads:

Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B F16 8 pp64 51.98 59.14 1.14
llama 7B F16 16 pp64 35.20 28.97 0.82
llama 7B F16 24 pp64 63.18 43.40 0.69
llama 7B F16 32 pp64 75.45 54.18 0.72

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

On my system (intel 13900k)

It is a AVX512 or a AVX2?

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

AVX2

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

This is the "bad" intel CPU: that disable avx512 on effichent core

Nombre de Performance-cores 8
Nombre d'Efficient-cores 16

look faster on perf core slower on other...

can you bench with that:

#elif (defined(__AVX__) || defined(__AVX2__)) && defined(__F16C__)
// do not convert B to FP16
        if (Btype == GGML_TYPE_F32) {
            tinyBLAS<8, __m256, __m256, ggml_fp16_t, float, float> tb{ k,
                (const ggml_fp16_t *)A, lda,
                (const float *)B, ldb,
                (float *)C, ldc,
                ith, nth};
            return tb.matmul(m, n);
        }

and may be with the BF16 too...

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

Not sure where to change that, can you show me the diff of this change (change it locally and run git diff)?

tb.matmul(m, n);
return true;
if (Btype == GGML_TYPE_F16) {
tinyBLAS<8, __m256, __m256, ggml_fp16_t, ggml_fp16_t, float> tb{ k,
Copy link
Contributor Author

@Djip007 Djip007 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#elif (defined(__AVX__) || defined(__AVX2__)) && defined(__F16C__)
// do not convert B to FP16 what was did before...
        if (Btype == GGML_TYPE_F32) {
            tinyBLAS<8, __m256, __m256, ggml_fp16_t, float, float> tb{ k,
                (const ggml_fp16_t *)A, lda,
                (const float *)B, ldb,
                (float *)C, ldc,
                ith, nth};
            return tb.matmul(m, n);
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I' will try the same on my zen3 ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely faster, but still slower than master in some cases.

Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B F16 8 pp64 52.68 61.85 1.17
llama 7B F16 16 pp64 42.89 37.64 0.88
llama 7B F16 24 pp64 63.02 56.05 0.89
llama 7B F16 32 pp64 77.93 68.80 0.88

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these heterogeneous CPUs, the dispatch would need to be refined. For the moment, the calculations are distributed evenly between the Cores, so there is a "chance" that we will wait for the eco cores....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the tinyblas implementation doesn't play very well with the e-cores or multi-threading. The ggml implementation has dynamic chunking so that the faster threads will get more work, but I don't think that this implemented in tinyblas.

Copy link
Contributor Author

@Djip007 Djip007 Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My RAM is 3600 vs 2133 it may explain the diff in TG...
CPU is the same.

I have 4x32Go config as interleaved
did you have 4x16Go ? with what config?

g++ --version
g++ (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3)

OS: fedora 41 vs Ubuntu 11.4

  • Mistral-7b : look we have the same behavior with it so it's not due to the model.
Model Test t/s master t/s PR Speedup PC
llama 7B F16 pp120 55.95 84.83 1.52 djip007
llama 7B F16 pp120 70.64 44.11 0.62 ggerganov

For now I don't understand what's happening...

something with Memory config ?

Memory Device
	Array Handle: 0x0032
	Error Information Handle: 0x003F
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_B1
	Bank Locator: BANK 2
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 3600 MT/s
	Manufacturer: CRUCIAL
	Serial Number: -----------
	Asset Tag: Not Specified
	Part Number: BL32G36C16U4B.M16FB1
	Rank: 2
	Configured Memory Speed: 3600 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 6, Hex 0x9B
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen 9 5950X 16-Core Processor
    CPU family:           25
    Model:                33
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             0
    Frequency boost:      enabled
    CPU(s) scaling MHz:   65%
    CPU max MHz:          5084,0000
    CPU min MHz:          550,0000
    BogoMIPS:             6800,74
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdp
                          e1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma c
                          x16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowpr
                          efetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs i
                          bpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv
                          1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_loc
                          k nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku o
                          spke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Caches (sum of all):      
  L1d:                    512 KiB (16 instances)
  L1i:                    512 KiB (16 instances)
  L2:                     8 MiB (16 instances)
  L3:                     64 MiB (2 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Vulnerable: Safe RET, no microcode
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

memory addressing:

Handle 0x003B, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 128 GB
	Physical Device Handle: 0x003A
	Memory Array Mapped Address Handle: 0x0034
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown
Handle 0x003E, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 128 GB
	Physical Device Handle: 0x003D
	Memory Array Mapped Address Handle: 0x0034
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown
Handle 0x0041, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 128 GB
	Physical Device Handle: 0x0040
	Memory Array Mapped Address Handle: 0x0034
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown
Handle 0x0044, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 128 GB
	Physical Device Handle: 0x0043
	Memory Array Mapped Address Handle: 0x0034
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

You have Corsair Vengeance RGB RS 16 Go DDR4 3200 MHz CL16 RAM and clock it a 2133 MHz ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My RAM is 3600 vs 2133 it may explain the diff in TG...

Yes, probably that's it.

I have 4x32Go config as interleaved
did you have 4x16Go ? with what config?

Yes, these are 4x16GB. I don't know if they are interleaved and how to check. Here is the full dmidecode if it helps:

$ sudo dmidecode -t memory
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x000A, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: None
	Maximum Capacity: 128 GB
	Error Information Handle: 0x0009
	Number Of Devices: 4

Handle 0x0012, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x000A
	Error Information Handle: 0x0011
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL A
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2133 MT/s
	Manufacturer: Unknown
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMG16GX4M1E3200C16
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 3, Hex 0x9E
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 16 GB
	Cache Size: None
	Logical Size: None

Handle 0x0015, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x000A
	Error Information Handle: 0x0014
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 1
	Bank Locator: P0 CHANNEL A
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2133 MT/s
	Manufacturer: Unknown
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMG16GX4M1E3200C16
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 3, Hex 0x9E
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 16 GB
	Cache Size: None
	Logical Size: None

Handle 0x0018, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x000A
	Error Information Handle: 0x0017
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL B
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2133 MT/s
	Manufacturer: Unknown
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMG16GX4M1E3200C16
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 3, Hex 0x9E
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 16 GB
	Cache Size: None
	Logical Size: None

Handle 0x001B, DMI type 17, 92 bytes
Memory Device
	Array Handle: 0x000A
	Error Information Handle: 0x001A
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 1
	Bank Locator: P0 CHANNEL B
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2133 MT/s
	Manufacturer: Unknown
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMG16GX4M1E3200C16
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 3, Hex 0x9E
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 16 GB
	Cache Size: None
	Logical Size: None

$ sudo dmidecode -t 20
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0013, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x00FFFFFFFFF
	Range Size: 64 GB
	Physical Device Handle: 0x0012
	Memory Array Mapped Address Handle: 0x000C
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

Handle 0x0016, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x00FFFFFFFFF
	Range Size: 64 GB
	Physical Device Handle: 0x0015
	Memory Array Mapped Address Handle: 0x000C
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

Handle 0x0019, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x00FFFFFFFFF
	Range Size: 64 GB
	Physical Device Handle: 0x0018
	Memory Array Mapped Address Handle: 0x000C
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

Handle 0x001C, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x00FFFFFFFFF
	Range Size: 64 GB
	Physical Device Handle: 0x001B
	Memory Array Mapped Address Handle: 0x000C
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown

lscpu:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen 9 5950X 16-Core Processor
    CPU family:           25
    Model:                33
    Thread(s) per core:   2
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             2
    Frequency boost:      enabled
    CPU max MHz:          5083,3979
    CPU min MHz:          2200,0000
    BogoMIPS:             6787.87
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr
                          8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm
                          _total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Caches (sum of all):      
  L1d:                    512 KiB (16 instances)
  L1i:                    512 KiB (16 instances)
  L2:                     8 MiB (16 instances)
  L3:                     64 MiB (2 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Vulnerable: Safe RET, no microcode
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

You have Corsair Vengeance RGB RS 16 Go DDR4 3200 MHz CL16 RAM and clock it a 2133 MHz ?

Hm, it's possible. It's an old machine that I currently keep remotely and I remember that at some point in the past I was adjusting some settings in the BIOS with the main goal to reduce the CPU fan noises. Though it's completely possible that I have underclocked the memory by accident.

Anyway, I am OK to ignore this datapoint for now since it is very likely a misconfiguration on my side. Next time I have access to the machine, will check what the BIOS settings are. But no need to worry about this for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the tests you asked for.

Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B BF16 8 pp1 6.54 6.85 1.05
llama 7B BF16 8 pp2 11.15 14.50 1.30
llama 7B BF16 8 pp3 14.71 21.46 1.46
llama 7B BF16 8 pp4 17.29 25.63 1.48
llama 7B BF16 8 pp5 19.17 32.46 1.69
llama 7B BF16 8 pp6 20.78 37.58 1.81
llama 7B BF16 8 pp7 22.28 40.30 1.81
llama 7B BF16 8 pp8 23.07 44.56 1.93
llama 7B BF16 8 pp9 22.69 47.97 2.11
llama 7B BF16 8 pp10 22.85 50.19 2.20
llama 7B BF16 8 pp11 23.28 52.97 2.28
llama 7B BF16 8 pp12 23.64 55.29 2.34
llama 7B BF16 8 pp13 23.93 55.26 2.31
llama 7B BF16 8 pp14 17.59 55.09 3.13
llama 7B BF16 8 pp15 23.55 57.91 2.46
llama 7B BF16 8 pp16 24.62 57.64 2.34
llama 7B BF16 8 pp30 26.04 63.60 2.44
llama 7B BF16 8 pp31 25.55 63.00 2.47
llama 7B BF16 8 pp32 25.07 60.48 2.41
llama 7B BF16 8 pp64 25.38 63.79 2.51
llama 7B BF16 8 pp65 24.97 66.64 2.67
llama 7B BF16 8 pp66 25.25 68.75 2.72
llama 7B BF16 8 pp120 23.24 67.85 2.92
llama 7B BF16 8 pp128 25.64 66.69 2.60
llama 7B BF16 8 pp130 23.48 54.18 2.31
llama 7B BF16 8 pp240 24.38 68.85 2.82
llama 7B BF16 8 pp255 24.51 66.71 2.72
llama 7B BF16 8 pp256 23.19 66.83 2.88
llama 7B BF16 8 pp510 22.54 60.81 2.70
llama 7B BF16 8 pp512 23.94 60.61 2.53
llama 7B BF16 8 pp1023 23.68 59.02 2.49
llama 7B BF16 8 pp1024 23.78 59.15 2.49
llama 7B BF16 8 pp1025 23.73 58.89 2.48
llama 7B BF16 8 pp2048 23.29 58.20 2.50
llama 7B BF16 8 tg128 6.69 6.42 0.96
llama 7B BF16 16 pp1 6.89 6.90 1.00
llama 7B BF16 16 pp2 13.49 14.19 1.05
llama 7B BF16 16 pp3 18.20 20.99 1.15
llama 7B BF16 16 pp4 21.92 27.21 1.24
llama 7B BF16 16 pp5 24.63 34.20 1.39
llama 7B BF16 16 pp6 26.45 39.95 1.51
llama 7B BF16 16 pp7 26.87 45.12 1.68
llama 7B BF16 16 pp8 20.01 49.56 2.48
llama 7B BF16 16 pp9 28.62 54.41 1.90
llama 7B BF16 16 pp10 30.72 57.11 1.86
llama 7B BF16 16 pp11 31.19 62.05 1.99
llama 7B BF16 16 pp12 31.64 65.01 2.05
llama 7B BF16 16 pp13 33.39 66.48 1.99
llama 7B BF16 16 pp14 33.71 69.13 2.05
llama 7B BF16 16 pp15 34.12 71.64 2.10
llama 7B BF16 16 pp16 33.81 71.72 2.12
llama 7B BF16 16 pp30 34.43 81.93 2.38
llama 7B BF16 16 pp31 34.60 80.99 2.34
llama 7B BF16 16 pp32 34.67 81.60 2.35
llama 7B BF16 16 pp64 35.44 85.54 2.41
llama 7B BF16 16 pp65 34.26 85.67 2.50
llama 7B BF16 16 pp66 36.01 86.00 2.39
llama 7B BF16 16 pp120 36.28 86.63 2.39
llama 7B BF16 16 pp128 32.27 87.02 2.70
llama 7B BF16 16 pp130 35.74 86.62 2.42
llama 7B BF16 16 pp240 33.41 75.09 2.25
llama 7B BF16 16 pp255 33.79 86.10 2.55
llama 7B BF16 16 pp256 33.65 86.45 2.57
llama 7B BF16 16 pp510 32.94 82.24 2.50
llama 7B BF16 16 pp512 34.21 75.92 2.22
llama 7B BF16 16 pp1023 33.01 76.81 2.33
llama 7B BF16 16 pp1024 32.96 77.04 2.34
llama 7B BF16 16 pp1025 33.54 72.84 2.17
llama 7B BF16 16 pp2048 32.17 72.21 2.24
llama 7B BF16 16 tg128 6.64 6.49 0.98
llama 7B BF16 24 pp1 6.45 6.46 1.00
llama 7B BF16 24 pp2 13.87 14.12 1.02
llama 7B BF16 24 pp3 19.54 21.26 1.09
llama 7B BF16 24 pp4 23.84 27.43 1.15
llama 7B BF16 24 pp5 27.16 34.66 1.28
llama 7B BF16 24 pp6 29.39 40.81 1.39
llama 7B BF16 24 pp7 31.82 46.53 1.46
llama 7B BF16 24 pp8 32.90 52.14 1.58
llama 7B BF16 24 pp9 34.22 57.32 1.67
llama 7B BF16 24 pp10 34.64 61.83 1.78
llama 7B BF16 24 pp11 35.80 67.03 1.87
llama 7B BF16 24 pp12 35.61 70.34 1.98
llama 7B BF16 24 pp13 26.88 72.89 2.71
llama 7B BF16 24 pp14 36.83 76.42 2.08
llama 7B BF16 24 pp15 36.29 83.74 2.31
llama 7B BF16 24 pp16 37.80 84.55 2.24
llama 7B BF16 24 pp30 39.96 96.48 2.41
llama 7B BF16 24 pp31 40.38 97.68 2.42
llama 7B BF16 24 pp32 40.44 98.66 2.44
llama 7B BF16 24 pp64 42.22 98.08 2.32
llama 7B BF16 24 pp65 41.12 95.11 2.31
llama 7B BF16 24 pp66 33.80 95.15 2.82
llama 7B BF16 24 pp120 41.40 74.86 1.81
llama 7B BF16 24 pp128 41.64 97.67 2.35
llama 7B BF16 24 pp130 35.86 99.80 2.78
llama 7B BF16 24 pp240 41.78 96.56 2.31
llama 7B BF16 24 pp255 41.48 81.96 1.98
llama 7B BF16 24 pp256 38.66 98.73 2.55
llama 7B BF16 24 pp510 39.75 85.03 2.14
llama 7B BF16 24 pp512 39.44 94.90 2.41
llama 7B BF16 24 pp1023 39.18 88.19 2.25
llama 7B BF16 24 pp1024 38.70 87.89 2.27
llama 7B BF16 24 pp1025 38.99 85.98 2.20
llama 7B BF16 24 pp2048 37.26 83.63 2.24
llama 7B BF16 24 tg128 6.12 6.14 1.00
llama 7B BF16 32 pp1 6.61 6.49 0.98
llama 7B BF16 32 pp2 13.41 12.63 0.94
llama 7B BF16 32 pp3 16.67 19.67 1.18
llama 7B BF16 32 pp4 24.53 26.30 1.07
llama 7B BF16 32 pp5 25.11 30.75 1.22
llama 7B BF16 32 pp6 30.08 35.58 1.18
llama 7B BF16 32 pp7 32.79 44.60 1.36
llama 7B BF16 32 pp8 33.60 47.41 1.41
llama 7B BF16 32 pp9 35.68 49.91 1.40
llama 7B BF16 32 pp10 36.61 61.07 1.67
llama 7B BF16 32 pp11 34.87 60.74 1.74
llama 7B BF16 32 pp12 38.80 58.61 1.51
llama 7B BF16 32 pp13 38.92 67.56 1.74
llama 7B BF16 32 pp14 37.30 71.50 1.92
llama 7B BF16 32 pp15 38.23 52.93 1.38
llama 7B BF16 32 pp16 39.77 77.02 1.94
llama 7B BF16 32 pp30 42.88 93.16 2.17
llama 7B BF16 32 pp31 41.70 92.21 2.21
llama 7B BF16 32 pp32 42.97 96.90 2.26
llama 7B BF16 32 pp64 45.44 106.63 2.35
llama 7B BF16 32 pp65 44.98 105.98 2.36
llama 7B BF16 32 pp66 45.10 105.50 2.34
llama 7B BF16 32 pp120 46.50 104.57 2.25
llama 7B BF16 32 pp128 45.95 104.48 2.27
llama 7B BF16 32 pp130 46.34 101.40 2.19
llama 7B BF16 32 pp240 43.26 87.72 2.03
llama 7B BF16 32 pp255 46.97 109.11 2.32
llama 7B BF16 32 pp256 46.94 104.42 2.22
llama 7B BF16 32 pp510 44.52 92.37 2.07
llama 7B BF16 32 pp512 44.46 92.70 2.08
llama 7B BF16 32 pp1023 43.70 93.70 2.14
llama 7B BF16 32 pp1024 43.72 94.72 2.17
llama 7B BF16 32 pp1025 43.49 93.42 2.15
llama 7B BF16 32 pp2048 42.64 90.61 2.12
llama 7B BF16 32 tg128 5.84 5.89 1.01
Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B F16 8 pp1 6.38 7.03 1.10
llama 7B F16 8 pp2 12.49 14.03 1.12
llama 7B F16 8 pp3 19.06 20.65 1.08
llama 7B F16 8 pp4 12.97 25.21 1.94
llama 7B F16 8 pp5 18.14 31.46 1.73
llama 7B F16 8 pp6 28.90 36.29 1.26
llama 7B F16 8 pp7 20.51 39.36 1.92
llama 7B F16 8 pp8 27.07 43.70 1.61
llama 7B F16 8 pp9 39.64 46.82 1.18
llama 7B F16 8 pp10 30.42 49.03 1.61
llama 7B F16 8 pp11 23.51 51.16 2.18
llama 7B F16 8 pp12 44.81 53.55 1.20
llama 7B F16 8 pp13 35.45 53.97 1.52
llama 7B F16 8 pp14 37.40 55.91 1.50
llama 7B F16 8 pp15 47.63 57.52 1.21
llama 7B F16 8 pp16 38.27 56.69 1.48
llama 7B F16 8 pp30 55.66 62.84 1.13
llama 7B F16 8 pp31 46.17 61.79 1.34
llama 7B F16 8 pp32 46.22 62.78 1.36
llama 7B F16 8 pp64 51.43 65.10 1.27
llama 7B F16 8 pp65 51.57 65.04 1.26
llama 7B F16 8 pp66 55.29 50.84 0.92
llama 7B F16 8 pp120 56.77 65.83 1.16
llama 7B F16 8 pp128 55.29 65.64 1.19
llama 7B F16 8 pp130 54.22 66.17 1.22
llama 7B F16 8 pp240 51.06 58.46 1.14
llama 7B F16 8 pp255 56.89 66.11 1.16
llama 7B F16 8 pp256 56.60 58.38 1.03
llama 7B F16 8 pp510 49.56 59.74 1.21
llama 7B F16 8 pp512 48.86 59.75 1.22
llama 7B F16 8 pp1023 50.73 58.39 1.15
llama 7B F16 8 pp1024 46.55 58.51 1.26
llama 7B F16 8 pp1025 50.70 58.01 1.14
llama 7B F16 8 pp2048 48.65 56.76 1.17
llama 7B F16 8 tg128 6.57 6.84 1.04
llama 7B F16 16 pp1 7.36 7.04 0.96
llama 7B F16 16 pp2 11.80 13.98 1.18
llama 7B F16 16 pp3 16.18 20.90 1.29
llama 7B F16 16 pp4 14.60 26.77 1.83
llama 7B F16 16 pp5 17.95 32.82 1.83
llama 7B F16 16 pp6 29.38 38.84 1.32
llama 7B F16 16 pp7 24.25 42.92 1.77
llama 7B F16 16 pp8 26.31 47.11 1.79
llama 7B F16 16 pp9 36.26 51.87 1.43
llama 7B F16 16 pp10 30.21 53.56 1.77
llama 7B F16 16 pp11 31.95 57.11 1.79
llama 7B F16 16 pp12 26.38 60.18 2.28
llama 7B F16 16 pp13 31.82 60.02 1.89
llama 7B F16 16 pp14 35.95 63.33 1.76
llama 7B F16 16 pp15 41.48 65.03 1.57
llama 7B F16 16 pp16 36.87 65.43 1.77
llama 7B F16 16 pp30 46.23 74.36 1.61
llama 7B F16 16 pp31 43.34 73.61 1.70
llama 7B F16 16 pp32 44.02 74.51 1.69
llama 7B F16 16 pp64 44.73 77.52 1.73
llama 7B F16 16 pp65 43.67 77.59 1.78
llama 7B F16 16 pp66 36.00 77.94 2.17
llama 7B F16 16 pp120 46.27 78.65 1.70
llama 7B F16 16 pp128 44.67 64.10 1.43
llama 7B F16 16 pp130 38.55 75.47 1.96
llama 7B F16 16 pp240 47.56 79.61 1.67
llama 7B F16 16 pp255 43.95 67.90 1.54
llama 7B F16 16 pp256 43.84 78.00 1.78
llama 7B F16 16 pp510 47.37 67.67 1.43
llama 7B F16 16 pp512 47.73 69.11 1.45
llama 7B F16 16 pp1023 46.70 69.32 1.48
llama 7B F16 16 pp1024 46.85 69.70 1.49
llama 7B F16 16 pp1025 46.43 69.02 1.49
llama 7B F16 16 pp2048 41.15 64.47 1.57
llama 7B F16 16 tg128 6.77 6.79 1.00
llama 7B F16 24 pp1 6.72 6.70 1.00
llama 7B F16 24 pp2 13.08 13.80 1.06
llama 7B F16 24 pp3 19.12 19.80 1.04
llama 7B F16 24 pp4 17.18 25.52 1.49
llama 7B F16 24 pp5 20.68 31.70 1.53
llama 7B F16 24 pp6 33.86 25.86 0.76
llama 7B F16 24 pp7 20.49 42.41 2.07
llama 7B F16 24 pp8 32.02 49.30 1.54
llama 7B F16 24 pp9 45.35 54.32 1.20
llama 7B F16 24 pp10 38.55 58.23 1.51
llama 7B F16 24 pp11 41.01 62.25 1.52
llama 7B F16 24 pp12 54.11 65.30 1.21
llama 7B F16 24 pp13 45.66 66.65 1.46
llama 7B F16 24 pp14 48.00 68.97 1.44
llama 7B F16 24 pp15 58.72 74.53 1.27
llama 7B F16 24 pp16 50.72 74.72 1.47
llama 7B F16 24 pp30 65.48 84.78 1.29
llama 7B F16 24 pp31 60.81 84.96 1.40
llama 7B F16 24 pp32 61.80 85.63 1.39
llama 7B F16 24 pp64 65.16 85.38 1.31
llama 7B F16 24 pp65 64.48 84.86 1.32
llama 7B F16 24 pp66 66.96 84.38 1.26
llama 7B F16 24 pp120 55.89 85.90 1.54
llama 7B F16 24 pp128 65.76 68.28 1.04
llama 7B F16 24 pp130 65.48 85.78 1.31
llama 7B F16 24 pp240 59.03 87.17 1.48
llama 7B F16 24 pp255 66.37 84.71 1.28
llama 7B F16 24 pp256 65.55 86.56 1.32
llama 7B F16 24 pp510 64.20 74.53 1.16
llama 7B F16 24 pp512 63.80 82.39 1.29
llama 7B F16 24 pp1023 58.59 76.50 1.31
llama 7B F16 24 pp1024 58.57 73.12 1.25
llama 7B F16 24 pp1025 58.36 75.34 1.29
llama 7B F16 24 pp2048 53.68 71.46 1.33
llama 7B F16 24 tg128 6.31 6.50 1.03
llama 7B F16 32 pp1 6.95 6.52 0.94
llama 7B F16 32 pp2 10.93 13.09 1.20
llama 7B F16 32 pp3 20.45 16.78 0.82
llama 7B F16 32 pp4 18.21 25.07 1.38
llama 7B F16 32 pp5 21.85 31.60 1.45
llama 7B F16 32 pp6 37.12 33.09 0.89
llama 7B F16 32 pp7 29.10 43.65 1.50
llama 7B F16 32 pp8 35.45 49.34 1.39
llama 7B F16 32 pp9 45.19 46.98 1.04
llama 7B F16 32 pp10 43.03 58.52 1.36
llama 7B F16 32 pp11 42.45 55.39 1.30
llama 7B F16 32 pp12 61.48 64.59 1.05
llama 7B F16 32 pp13 47.92 67.93 1.42
llama 7B F16 32 pp14 53.72 63.25 1.18
llama 7B F16 32 pp15 62.70 74.37 1.19
llama 7B F16 32 pp16 58.18 65.05 1.12
llama 7B F16 32 pp30 81.05 86.72 1.07
llama 7B F16 32 pp31 70.93 85.87 1.21
llama 7B F16 32 pp32 69.70 86.73 1.24
llama 7B F16 32 pp64 75.62 85.96 1.14
llama 7B F16 32 pp65 80.69 88.52 1.10
llama 7B F16 32 pp66 86.93 88.50 1.02
llama 7B F16 32 pp120 88.89 90.04 1.01
llama 7B F16 32 pp128 85.38 72.80 0.85
llama 7B F16 32 pp130 82.46 88.17 1.07
llama 7B F16 32 pp240 73.02 90.63 1.24
llama 7B F16 32 pp255 88.03 90.93 1.03
llama 7B F16 32 pp256 84.65 78.67 0.93
llama 7B F16 32 pp510 78.80 87.88 1.12
llama 7B F16 32 pp512 73.98 88.27 1.19
llama 7B F16 32 pp1023 71.23 82.14 1.15
llama 7B F16 32 pp1024 74.38 82.08 1.10
llama 7B F16 32 pp1025 73.94 80.76 1.09
llama 7B F16 32 pp2048 65.88 75.73 1.15
llama 7B F16 32 tg128 5.92 5.98 1.01
Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B all F32 8 pp1 3.29 3.60 1.10
llama 7B all F32 8 pp2 6.29 7.10 1.13
llama 7B all F32 8 pp3 9.88 10.64 1.08
llama 7B all F32 8 pp4 7.28 13.62 1.87
llama 7B all F32 8 pp5 9.33 16.96 1.82
llama 7B all F32 8 pp6 18.77 20.40 1.09
llama 7B all F32 8 pp7 12.29 22.44 1.83
llama 7B all F32 8 pp8 14.00 17.29 1.24
llama 7B all F32 8 pp9 27.70 27.44 0.99
llama 7B all F32 8 pp10 12.74 28.43 2.23
llama 7B all F32 8 pp11 18.08 31.63 1.75
llama 7B all F32 8 pp12 34.07 36.68 1.08
llama 7B all F32 8 pp13 21.59 37.41 1.73
llama 7B all F32 8 pp14 23.54 41.72 1.77
llama 7B all F32 8 pp15 42.10 45.17 1.07
llama 7B all F32 8 pp16 26.58 45.03 1.69
llama 7B all F32 8 pp30 60.69 68.16 1.12
llama 7B all F32 8 pp31 42.30 66.59 1.57
llama 7B all F32 8 pp32 43.45 70.13 1.61
llama 7B all F32 8 pp64 60.06 81.96 1.36
llama 7B all F32 8 pp65 58.62 83.51 1.42
llama 7B all F32 8 pp66 52.86 80.40 1.52
llama 7B all F32 8 pp120 75.73 66.10 0.87
llama 7B all F32 8 pp128 71.88 79.70 1.11
llama 7B all F32 8 pp130 69.67 84.03 1.21
llama 7B all F32 8 pp240 57.52 76.86 1.34
llama 7B all F32 8 pp255 68.61 63.99 0.93
llama 7B all F32 8 pp256 57.10 76.73 1.34
llama 7B all F32 8 pp510 58.15 63.27 1.09
llama 7B all F32 8 pp512 57.39 65.34 1.14
llama 7B all F32 8 pp1023 56.31 61.57 1.09
llama 7B all F32 8 pp1024 56.42 63.13 1.12
llama 7B all F32 8 pp1025 55.91 64.92 1.16
llama 7B all F32 8 pp2048 54.29 64.00 1.18
llama 7B all F32 8 tg128 3.30 3.30 1.00
llama 7B all F32 16 pp1 3.55 3.57 1.01
llama 7B all F32 16 pp2 5.44 7.26 1.33
llama 7B all F32 16 pp3 8.37 10.87 1.30
llama 7B all F32 16 pp4 6.61 14.35 2.17
llama 7B all F32 16 pp5 7.75 17.88 2.31
llama 7B all F32 16 pp6 15.87 21.44 1.35
llama 7B all F32 16 pp7 11.32 24.51 2.17
llama 7B all F32 16 pp8 12.19 28.14 2.31
llama 7B all F32 16 pp9 22.32 31.75 1.42
llama 7B all F32 16 pp10 15.65 34.38 2.20
llama 7B all F32 16 pp11 12.30 37.65 3.06
llama 7B all F32 16 pp12 28.21 41.11 1.46
llama 7B all F32 16 pp13 19.83 43.12 2.17
llama 7B all F32 16 pp14 20.17 46.27 2.29
llama 7B all F32 16 pp15 33.49 49.29 1.47
llama 7B all F32 16 pp16 23.59 50.51 2.14
llama 7B all F32 16 pp30 48.91 80.46 1.65
llama 7B all F32 16 pp31 37.14 79.76 2.15
llama 7B all F32 16 pp32 36.78 82.64 2.25
llama 7B all F32 16 pp64 52.07 103.75 1.99
llama 7B all F32 16 pp65 39.03 76.48 1.96
llama 7B all F32 16 pp66 57.92 104.47 1.80
llama 7B all F32 16 pp120 63.20 112.86 1.79
llama 7B all F32 16 pp128 58.30 111.95 1.92
llama 7B all F32 16 pp130 57.93 112.51 1.94
llama 7B all F32 16 pp240 58.98 103.05 1.75
llama 7B all F32 16 pp255 62.83 84.99 1.35
llama 7B all F32 16 pp256 60.20 102.69 1.71
llama 7B all F32 16 pp510 58.07 82.11 1.41
llama 7B all F32 16 pp512 56.77 88.96 1.57
llama 7B all F32 16 pp1023 57.33 84.32 1.47
llama 7B all F32 16 pp1024 57.65 83.80 1.45
llama 7B all F32 16 pp1025 54.17 85.64 1.58
llama 7B all F32 16 pp2048 53.54 77.67 1.45
llama 7B all F32 16 tg128 3.38 3.32 0.98
llama 7B all F32 24 pp1 3.57 3.58 1.00
llama 7B all F32 24 pp2 4.44 7.18 1.62
llama 7B all F32 24 pp3 9.34 10.78 1.15
llama 7B all F32 24 pp4 7.36 14.26 1.94
llama 7B all F32 24 pp5 8.86 17.80 2.01
llama 7B all F32 24 pp6 18.73 21.35 1.14
llama 7B all F32 24 pp7 13.51 24.65 1.82
llama 7B all F32 24 pp8 14.72 28.18 1.91
llama 7B all F32 24 pp9 27.34 31.47 1.15
llama 7B all F32 24 pp10 19.02 34.47 1.81
llama 7B all F32 24 pp11 20.65 37.93 1.84
llama 7B all F32 24 pp12 36.08 41.20 1.14
llama 7B all F32 24 pp13 25.13 43.69 1.74
llama 7B all F32 24 pp14 25.49 47.22 1.85
llama 7B all F32 24 pp15 40.66 50.46 1.24
llama 7B all F32 24 pp16 28.48 52.54 1.84
llama 7B all F32 24 pp30 60.87 85.70 1.41
llama 7B all F32 24 pp31 46.29 85.91 1.86
llama 7B all F32 24 pp32 33.68 88.61 2.63
llama 7B all F32 24 pp64 63.22 115.66 1.83
llama 7B all F32 24 pp65 62.47 115.16 1.84
llama 7B all F32 24 pp66 72.74 116.25 1.60
llama 7B all F32 24 pp120 82.94 125.57 1.51
llama 7B all F32 24 pp128 71.71 125.38 1.75
llama 7B all F32 24 pp130 58.06 123.70 2.13
llama 7B all F32 24 pp240 71.65 115.68 1.61
llama 7B all F32 24 pp255 68.59 97.20 1.42
llama 7B all F32 24 pp256 67.18 114.58 1.71
llama 7B all F32 24 pp510 62.15 106.84 1.72
llama 7B all F32 24 pp512 60.77 107.10 1.76
llama 7B all F32 24 pp1023 59.90 94.97 1.59
llama 7B all F32 24 pp1024 59.98 96.93 1.62
llama 7B all F32 24 pp1025 59.06 94.01 1.59
llama 7B all F32 24 pp2048 59.30 91.46 1.54
llama 7B all F32 24 tg128 3.40 3.40 1.00
llama 7B all F32 32 pp1 3.44 3.03 0.88
llama 7B all F32 32 pp2 6.29 5.22 0.83
llama 7B all F32 32 pp3 9.26 10.28 1.11
llama 7B all F32 32 pp4 7.41 14.41 1.94
llama 7B all F32 32 pp5 8.90 16.51 1.85
llama 7B all F32 32 pp6 13.72 20.73 1.51
llama 7B all F32 32 pp7 12.95 25.04 1.93
llama 7B all F32 32 pp8 14.24 26.06 1.83
llama 7B all F32 32 pp9 26.56 29.83 1.12
llama 7B all F32 32 pp10 18.28 33.25 1.82
llama 7B all F32 32 pp11 19.46 33.97 1.75
llama 7B all F32 32 pp12 35.39 38.65 1.09
llama 7B all F32 32 pp13 23.80 42.63 1.79
llama 7B all F32 32 pp14 25.25 42.34 1.68
llama 7B all F32 32 pp15 44.43 43.85 0.99
llama 7B all F32 32 pp16 27.65 33.80 1.22
llama 7B all F32 32 pp30 62.07 76.51 1.23
llama 7B all F32 32 pp31 48.28 79.72 1.65
llama 7B all F32 32 pp32 50.50 83.07 1.64
llama 7B all F32 32 pp64 62.50 114.16 1.83
llama 7B all F32 32 pp65 72.63 119.77 1.65
llama 7B all F32 32 pp66 89.07 121.06 1.36
llama 7B all F32 32 pp120 91.80 131.97 1.44
llama 7B all F32 32 pp128 84.08 130.04 1.55
llama 7B all F32 32 pp130 84.99 128.60 1.51
llama 7B all F32 32 pp240 75.48 123.00 1.63
llama 7B all F32 32 pp255 82.80 125.00 1.51
llama 7B all F32 32 pp256 80.86 125.54 1.55
llama 7B all F32 32 pp510 79.07 103.85 1.31
llama 7B all F32 32 pp512 72.10 118.05 1.64
llama 7B all F32 32 pp1023 69.98 102.26 1.46
llama 7B all F32 32 pp1024 73.17 112.25 1.53
llama 7B all F32 32 pp1025 71.53 102.11 1.43
llama 7B all F32 32 pp2048 68.90 98.77 1.43
llama 7B all F32 32 tg128 3.13 3.16 1.01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Thanks for the benchmark !
Look good with the intel CPU. the FP32 is really good for PP...
with 24 thread it look well balanced now.

@ggerganov : it's really strange these differences. for now I don't see why. I'll try to do other tests later. it would be good to find out why, if it's a question of config it would be good to be able to indicate it.
I tried by disabling interlacing, the tg is half but the pp has almost the same peak.

Will try to make some change so the pp don't slow down. 🤞

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you don't find a solution to this, I think it is fine to ignore the results from my Ryzen, because it is very likely that I have misconfigured something in the BIOS.

@Djip007 Djip007 force-pushed the perfo/tinyblas branch 3 times, most recently from b2dab60 to 30ae0d2 Compare December 14, 2024 21:06
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes script Script related server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants