llamafile prompt eval, execution strategy when multiple user client queries at the same time #652

beartell · 2024-12-05T06:49:23Z

beartell
Dec 5, 2024

When we execute the similar commands both in llama.cpp and llamafile below:

./gemma-2-2b-it.Q6_K.llamafile --port 8082 --nobrowser --host 172.29.102.76 --mlock -fa -np 4 -b 16384
./llama-server -m gemma-2-2b-it-Q6_K.gguf --host 172.29.102.76 --port 8082 --mlock -c 128800 -np 4 -b 16384 -ub 128000 -fa

Both commands issues a http server in front the llama.cpp layer. After this step multiple users can access the famous web ui and give their user prompts with or without additional text. At last what is the strategy of llamafile when queueing these multiple user requests. There are terms like slot, task, etc... How can we calculate the possible multi user limit of this llamafile executable on this hardware for instance ? Is it wortless to optimize at thread level, parallel execution level, etc... leave up the defaults ?

model	size	params	backend	threads	fa	test	t/s
gemma2 2B Q6_K	2.00 GiB	2.61 B	CPU	64	1	pp512	363.60 ± 11.40
gemma2 2B Q6_K	2.00 GiB	2.61 B	CPU	64	1	tg128	14.44 ± 0.49

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6430
CPU family: 6
Model: 143
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 4
Stepping: 8
BogoMIPS: 4200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss
ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonsto
p_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_de
adline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs
_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx51
2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wb
noinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx
512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities
Virtualization features:
Hypervisor vendor: VMware
Virtualization type: full
Caches (sum of all):
L1d: 3 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 128 MiB (64 instances)
L3: 240 MiB (4 instances)
NUMA:
NUMA node(s): 8
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63

           total        used        free      shared  buff/cache   available

Mem: 128384 10077 87951 8 33659 118306
Swap: 2047 278 1769

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile prompt eval, execution strategy when multiple user client queries at the same time #652

{{title}}

Replies: 0 comments

Select a reply

llamafile prompt eval, execution strategy when multiple user client queries at the same time #652

beartell Dec 5, 2024

Replies: 0 comments

beartell
Dec 5, 2024