System Info
TGI Image: ghcr.io/huggingface/neuronx-tgi:0.0.23
Platform:
- Platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35
- Python version: 3.10.12
Python packages:
- `optimum-neuron` version: 0.0.23
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: NA
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21
Neuron Driver:
aws-neuronx-collectives/now 2.20.22.0-c101c322e amd64 [installed,local]
aws-neuronx-dkms/now 2.16.7.0 amd64 [installed,local]
aws-neuronx-runtime-lib/now 2.20.22.0-1b3ca6425 amd64 [installed,local]
aws-neuronx-tools/now 2.17.1.0 amd64 [installed,local]
Who can help?
@dacorvo
Information
Tasks
Reproduction (minimal, reproducible, runnable)
I use optimum-cli to export Llama3 model from the official repository with this command:
optimum-cli export neuron --model meta-llama/Meta-Llama-3-8B
--batch_size 1
--sequence_length 8192
--auto_cast_type fp16 `
--num_cores 24
/data/llama3_neuron/
I use the same image for that as I use to run TGI: ghcr.io/huggingface/neuronx-tgi:0.0.23
Then I run the TGI container with this command:
docker run -p 8080:80
--rm
-it
-v $(pwd)/data:/data
--device=/dev/neuron0
--device=/dev/neuron1
--device=/dev/neuron2
--device=/dev/neuron3
--device=/dev/neuron4
--device=/dev/neuron5
--device=/dev/neuron6
--device=/dev/neuron7
--device=/dev/neuron8
--device=/dev/neuron9
--device=/dev/neuron10
--device=/dev/neuron11
--privileged
-e HF_TOKEN=...
-e HF_AUTO_CAST_TYPE="fp16"
-e HF_NUM_CORES=24
ghcr.io/huggingface/neuronx-tgi:latest
--model-id /data/llama3_neuron
--max-batch-size 1
--max-input-length 3164
--max-total-tokens 8192
I run simple load tests with locust, emulating 10 users sending concurrent requests of variable length, and see that all neuron cores are loaded around 60% only and the performance is far from ideal (0.1-0.2 rps):
Expected behavior
Neuron cores are loaded closely to 100%.
System Info
Who can help?
@dacorvo
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I use optimum-cli to export Llama3 model from the official repository with this command:
optimum-cli export neuron --model meta-llama/Meta-Llama-3-8B
--batch_size 1
--sequence_length 8192
--auto_cast_type fp16 `
--num_cores 24
/data/llama3_neuron/
I use the same image for that as I use to run TGI: ghcr.io/huggingface/neuronx-tgi:0.0.23
Then I run the TGI container with this command:
docker run -p 8080:80
--rm
-it
-v $(pwd)/data:/data
--device=/dev/neuron0
--device=/dev/neuron1
--device=/dev/neuron2
--device=/dev/neuron3
--device=/dev/neuron4
--device=/dev/neuron5
--device=/dev/neuron6
--device=/dev/neuron7
--device=/dev/neuron8
--device=/dev/neuron9
--device=/dev/neuron10
--device=/dev/neuron11
--privileged
-e HF_TOKEN=...
-e HF_AUTO_CAST_TYPE="fp16"
-e HF_NUM_CORES=24
ghcr.io/huggingface/neuronx-tgi:latest
--model-id /data/llama3_neuron
--max-batch-size 1
--max-input-length 3164
--max-total-tokens 8192
I run simple load tests with locust, emulating 10 users sending concurrent requests of variable length, and see that all neuron cores are loaded around 60% only and the performance is far from ideal (0.1-0.2 rps):
Expected behavior
Neuron cores are loaded closely to 100%.