@@ -49,6 +49,17 @@ It can be used to benchmark any text generation server that exposes an OpenAI-co
49
49
50
50
## Get started
51
51
52
+ ### Install
53
+
54
+ If you have [ cargo] ( https://rustup.rs/ ) already installed:
55
+ ``` bash
56
+ cargo install --git https://github.com/huggingface/inference-benchmarker/
57
+ ```
58
+
59
+ Or download the [ latest released binary] ( https://github.com/huggingface/inference-benchmarker/releases/latest )
60
+
61
+ Or you can run docker images.
62
+
52
63
### Run a benchmark
53
64
54
65
#### 1. Start an inference server
@@ -76,22 +87,12 @@ docker run --runtime nvidia --gpus all \
76
87
--model $MODEL
77
88
```
78
89
79
- #### 2. Run a benchmark using Docker image
90
+
91
+ #### 2. Run a benchmark
80
92
81
93
``` shell
82
- MODEL=meta-llama/Llama-3.1-8B-Instruct
83
- HF_TOKEN=< your HF READ token>
84
- # run a benchmark to evaluate the performance of the model for chat use case
85
- # we mount results to the current directory
86
- $ docker run \
87
- --rm \
88
- -it \
89
- --net host \
90
- -v $( pwd) :/opt/inference-benchmarker/results \
91
- -e " HF_TOKEN=$HF_TOKEN " \
92
- ghcr.io/huggingface/inference-benchmarker:latest \
93
- inference-benchmarker \
94
- --tokenizer-name " $MODEL " \
94
+ inference-benchmarker
95
+ --tokenizer-name " meta-llama/Llama-3.1-8B-Instruct" \
95
96
--url http://localhost:8080 \
96
97
--profile chat
97
98
```
@@ -132,16 +133,7 @@ Available modes:
132
133
Example running a benchmark at a fixed request rates:
133
134
134
135
``` shell
135
- MODEL=meta-llama/Llama-3.1-8B-Instruct
136
- HF_TOKEN=< your HF READ token>
137
- $ docker run \
138
- --rm \
139
- -it \
140
- --net host \
141
- -v $( pwd) :/opt/inference-benchmarker/results \
142
- -e " HF_TOKEN=$HF_TOKEN " \
143
- ghcr.io/huggingface/inference-benchmarker:latest \
144
- inference-benchmarker \
136
+ inference-benchmarker \
145
137
--tokenizer-name " meta-llama/Llama-3.1-8B-Instruct" \
146
138
--max-vus 800 \
147
139
--duration 120s \
0 commit comments