Added explicit greedy and updated readme

openshift-psap · Jan 22, 2024 · 86e3b72 · 86e3b72
1 parent 40c851b
commit 86e3b72
Show file tree

Hide file tree

Showing 3 changed files with 11 additions and 1 deletion.
diff --git a/language/llama2-70b/SUT.py b/language/llama2-70b/SUT.py
@@ -169,6 +169,7 @@ def query_api(self, input):
             'parameters': {
                 'max_new_tokens': 1024,
                 'min_new_tokens': 1,
+                'decoding_method': "GREEDY"
             },
         }
 
@@ -390,6 +391,7 @@ def stream_api(self, input, response_ids):
             'parameters': {
                 'max_new_tokens': 1024,
                 'min_new_tokens': 1,
+                'decoding_method': "GREEDY"
             },
         }
 

diff --git a/language/llama2-70b/api-endpoint-artifacts/README.md b/language/llama2-70b/api-endpoint-artifacts/README.md
@@ -6,6 +6,7 @@ Prerequisites:
  - Apply `secret.yaml`, `sa.yaml`, `serving-runtime.yaml`, then finally `model.yaml`
  - Create a benchmark pod using `benchmark.yaml`
 
+In the pod, before any benchmark, first run `cd inference/language/llama2-70b`
 
 For the full accuracy benchmark (offline), run in the pod:
 ```
@@ -21,4 +22,11 @@ python3 -u main.py --scenario Offline --model-path ${CHECKPOINT_PATH} --api-serv
 (It is the same, just with `--accuracy` removed)
 
 
+For the performance benchmark (server), run in the pod:
+```
+python3 -u main.py --scenario Server --model-path ${CHECKPOINT_PATH} --api-server <INSERT SERVER STREAM API CALL ENDPOINT> --api-model-name Llama-2-70b-chat-hf-caikit --mlperf-conf mlperf.conf --user-conf user.conf --total-sample-count 24576 --dataset-path ${DATASET_PATH} --output-log-dir server-logs --dtype float32 --device cpu 2>&1 | tee server_performance_log.log
+```
+(Configure target qps in `user.conf`)
+
+
 NOTE: Hyperparams are currently configured for 8xH100
diff --git a/language/llama2-70b/api-endpoint-artifacts/benchmark.yaml b/language/llama2-70b/api-endpoint-artifacts/benchmark.yaml
@@ -6,7 +6,7 @@ spec:
   restartPolicy: Never
   containers:
   - name: mlperf-env
-    image: quay.io/meyceoz/mlperf-inference:v3-base
+    image: quay.io/meyceoz/mlperf-inference:v3-greedy
     resources:
       requests:
         memory: 20000Mi