Underloaded Neuron Cores with Llama3

### System Info

```shell
TGI Image: ghcr.io/huggingface/neuronx-tgi:0.0.23

Platform:

- Platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35
- Python version: 3.10.12


Python packages:

- `optimum-neuron` version: 0.0.23
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: NA
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21


Neuron Driver:

aws-neuronx-collectives/now 2.20.22.0-c101c322e amd64 [installed,local]
aws-neuronx-dkms/now 2.16.7.0 amd64 [installed,local]
aws-neuronx-runtime-lib/now 2.20.22.0-1b3ca6425 amd64 [installed,local]
aws-neuronx-tools/now 2.17.1.0 amd64 [installed,local]
```


### Who can help?

@dacorvo

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

I use optimum-cli to export Llama3 model from the official repository with this command:

optimum-cli export neuron --model meta-llama/Meta-Llama-3-8B \
  --batch_size 1 \
  --sequence_length 8192 \
  --auto_cast_type fp16 ` \
  --num_cores 24 \
  /data/llama3_neuron/

I use the same image for that as I use to run TGI: ghcr.io/huggingface/neuronx-tgi:0.0.23

Then I run the TGI container with this command:
docker run -p 8080:80 \
       --rm \
       -it \
       -v $(pwd)/data:/data \
       --device=/dev/neuron0 \
       --device=/dev/neuron1 \
       --device=/dev/neuron2 \
       --device=/dev/neuron3 \
       --device=/dev/neuron4 \
       --device=/dev/neuron5 \
       --device=/dev/neuron6 \
       --device=/dev/neuron7 \
       --device=/dev/neuron8 \
       --device=/dev/neuron9 \
       --device=/dev/neuron10 \
       --device=/dev/neuron11 \
       --privileged \
       -e HF_TOKEN=... \
       -e HF_AUTO_CAST_TYPE="fp16" \
       -e HF_NUM_CORES=24 \
       ghcr.io/huggingface/neuronx-tgi:latest \
       --model-id /data/llama3_neuron \
       --max-batch-size 1 \
       --max-input-length 3164 \
       --max-total-tokens 8192

I run simple load tests with locust, emulating 10 users sending concurrent requests of variable length, and see that all neuron cores are loaded around 60% only and the performance is far from ideal (0.1-0.2 rps):

<img width="1451" alt="tgi_neuron_performance" src="https://github.com/user-attachments/assets/922fa7b0-0df2-41a4-aed8-697ce0a05d15">



### Expected behavior

Neuron cores are loaded closely to 100%. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Underloaded Neuron Cores with Llama3 #672

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Underloaded Neuron Cores with Llama3 #672

Description

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions