Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenShift AI Caikit+TGIS MLPerf Inference Implementation for Llama2-70b #1

Open
wants to merge 75 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
4d0e246
Add TEST01 for stable diffusion XL (#1574)
nvyihengz Jan 23, 2024
901ce67
Fixes for report generation and submission checker for models without…
arjunsuresh Jan 25, 2024
190413d
Add Llama2 checks and log additional values (#1578)
pgmpablo157321 Jan 25, 2024
27ef43a
🔄 synced local 'tools/submission/power/sources_checksums.json' with r…
mlcommons-bot Jan 26, 2024
9b8006f
Fix image list mismatch (#1579)
pgmpablo157321 Jan 26, 2024
180014a
#1558 update llama2 reference fp32 accuracy (#1583)
nvzhihanj Jan 26, 2024
523316e
🔄 synced local 'tools/submission/power/power_checker.py' with remote …
mlcommons-bot Jan 26, 2024
3ad8534
Update the main README.md for 4.0 (#1586)
arjunsuresh Jan 26, 2024
a04b1f5
Ignore trailing whitespace lines in spl.txt files (#1584)
psyhtest Jan 30, 2024
4bdf56f
Add stable diffusion and llama2 to the final spreadsheet (#1589)
arjunsuresh Feb 1, 2024
3a902e5
Add support to dump 10 compliance images during accuracy run for SDXL…
nvyihengz Feb 1, 2024
cc3daae
#1598: fix token and sample logging for Llama2 when accuracy_log_samp…
nvzhihanj Feb 1, 2024
473053f
Fix loadgen token metrics latency constrains (#1596)
pgmpablo157321 Feb 2, 2024
104d855
Add sample length check to test06 (#1603)
pgmpablo157321 Feb 6, 2024
357ccef
Enable equal issue mode for LLM benchmarks (#1610)
nvzhihanj Feb 7, 2024
44285d9
Set completed samples per second as llama metric (#1613)
pgmpablo157321 Feb 7, 2024
d45a66c
Add upper limit to tokens per sample (#1612)
pgmpablo157321 Feb 7, 2024
d7dba08
Remove loadgen warnings (#1608)
pgmpablo157321 Feb 7, 2024
b0777f0
Update README.md - remove unwanted lines in CM commands (#1601)
arjunsuresh Feb 7, 2024
3190d09
Typo fix in README.md (#1588)
arjunsuresh Feb 7, 2024
840435a
Update README.md with CM commands to download stable-diffusion, gptj …
arjunsuresh Feb 7, 2024
817dd96
Turn equal issue mode off for TEST06 (#1615)
nvzhihanj Feb 8, 2024
0ed5190
Fix submission checker and TEST06 for Llama2 (#1616)
nvzhihanj Feb 8, 2024
f06b920
Bugfix: equal-issue mode on offline causing accuracy run to fail (3D-…
nv-jinhosuh Feb 12, 2024
f9a643c
Add number of tokens to offline (#1623)
pgmpablo157321 Feb 12, 2024
486a629
Hotfix: DLRMv2 Audit Test01 fallback failure (#1626)
nv-jinhosuh Feb 15, 2024
de31ee2
Fix preprocess_submission script to copy the code path while inferrin…
arjunsuresh Feb 20, 2024
268bc9d
Add batching to SD reference
pgmpablo157321 Feb 13, 2024
5d0c221
Add Rclone-Cloudflare download instructions to README.md
nathanw-mlc Feb 21, 2024
dc94ae3
Add Rclone-Cloudflare download instructiosn to README.md
nathanw-mlc Feb 21, 2024
ab747c4
Minor wording edit to README.md
nathanw-mlc Feb 21, 2024
d037f22
Add Rclone-Cloudflare download instructions to README.md
nathanw-mlc Feb 21, 2024
147a91a
Add Rclone-GDrive download instructions to README.md
nathanw-mlc Feb 21, 2024
15d14c9
Add new and old instructions to README.md
nathanw-mlc Feb 21, 2024
46a35c2
Tweak language in README.md
nathanw-mlc Feb 21, 2024
c0bd844
Language tweak in README.md
nathanw-mlc Feb 21, 2024
8219069
Minor language tweak in README.md
nathanw-mlc Feb 21, 2024
e39003a
Fix typo in README.md
nathanw-mlc Feb 23, 2024
396d3f8
TGI support first pass
Maxusmusti Jan 10, 2024
84c9673
Added functional API querying
Maxusmusti Jan 11, 2024
dbab9f0
Changed batch size and str bug fix
Maxusmusti Jan 12, 2024
ffcbc0e
Added v1 offline artifacts
Maxusmusti Jan 15, 2024
86c594e
server scenario first pass
Maxusmusti Jan 16, 2024
8751a35
Funcional server scenario
Maxusmusti Jan 17, 2024
6ff4090
Added concurrent request support for server scenario
Maxusmusti Jan 18, 2024
2225a45
Added explicit greedy and updated readme
Maxusmusti Jan 18, 2024
22eb574
Update image to include ommitted mlperf conf
Maxusmusti Jan 22, 2024
7bb4c1b
Update for new image version
Maxusmusti Jan 22, 2024
e96c8a6
Updated model serving yamls
Maxusmusti Jan 22, 2024
84475d7
First pass: standalone TGIS, grpc, batching
Maxusmusti Jan 25, 2024
58e16ea
Streaming first pass
Maxusmusti Jan 25, 2024
f35c17e
Updated server impl
Maxusmusti Jan 25, 2024
132f725
Fully functional, updated README
Maxusmusti Jan 25, 2024
9f31f19
Update default client-side batches
Maxusmusti Feb 12, 2024
654dda5
v8 Update
Maxusmusti Feb 12, 2024
c7f699d
GPT-J first pass
Maxusmusti Feb 15, 2024
081024f
Offline functional, now testing server
Maxusmusti Feb 15, 2024
43afdea
v1 full implementation
Maxusmusti Feb 15, 2024
f8cc5ba
Update README for gpt-j
Maxusmusti Feb 16, 2024
8ae2bf4
First pass multi-endpoint
Maxusmusti Feb 16, 2024
0ad354e
Full multiple endpoint support
Maxusmusti Feb 19, 2024
6b117e0
Random gpt-j vllm experimental bits
Maxusmusti Feb 21, 2024
ebf0710
Change file names
Maxusmusti Feb 21, 2024
a357cf4
Added vllm server + multi-endpoint for gpt-j
Maxusmusti Feb 22, 2024
a505e83
Minor adjustments
Maxusmusti Feb 22, 2024
ef6b3db
Updated for exact values
Maxusmusti Feb 22, 2024
230d495
Update llama-2 with vllm
Maxusmusti Feb 27, 2024
48a4396
Fixed output cap bug
Maxusmusti Feb 27, 2024
57e241d
Fix llama server bug
Maxusmusti Feb 27, 2024
6ef5023
Added v10 image for llama
Maxusmusti Feb 27, 2024
3bc09fa
Updated token gen count for offline llama
Maxusmusti Feb 28, 2024
38e3aea
Updated READMEs
Maxusmusti Feb 28, 2024
84f1aac
Updated server first token inclusion in first query
Maxusmusti Feb 29, 2024
c7eeef9
Updated image in yaml
Maxusmusti Feb 29, 2024
8ab5998
Fix first token dtype
Maxusmusti Feb 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,23 @@ Please see the [MLPerf Inference benchmark paper](https://arxiv.org/abs/1911.025
```
## MLPerf Inference v4.0 (submission deadline February 23, 2024)

Code freeze coming soon...
There is an extra one-week extension allowed only for the llama2-70b submissions. For submissions, please use the master branch and any commit since the [4.0 seed release](https://github.com/mlcommons/inference/commit/8e36925bd36a503e39fcbbc488e9e46126f079ed) although it is best to use the latest commit. v4.0 tag will be created from the master branch after the result publication.

For power submissions please use [SPEC PTD 1.10](https://github.com/mlcommons/power/tree/main/inference_v1.0) (needs special access) and any commit of the power-dev repository after the [code-freeze](https://github.com/mlcommons/power-dev/commit/4e026f43481f46ad57d2464d28924018444b0428)

| model | reference app | framework | dataset | category
| ---- | ---- | ---- | ---- | ---- |
| resnet50-v1.5 | [vision/classification_and_detection](https://github.com/mlcommons/inference/tree/master/vision/classification_and_detection) | tensorflow, onnx, tvm, ncnn | imagenet2012 | edge,datacenter |
| retinanet 800x800 | [vision/classification_and_detection](https://github.com/mlcommons/inference/tree/master/vision/classification_and_detection) | pytorch, onnx | openimages resized to 800x800| edge,datacenter |
| bert | [language/bert](https://github.com/mlcommons/inference/tree/master/language/bert) | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
| dlrm-v2 | [recommendation/dlrm_v2](https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v2/pytorch) | pytorch | Multihot Criteo Terabyte | datacenter |
| 3d-unet | [vision/medical_imaging/3d-unet-kits19](https://github.com/mlcommons/inference/tree/master/vision/medical_imaging/3d-unet-kits19) | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
| rnnt | [speech_recognition/rnnt](https://github.com/mlcommons/inference/tree/master/speech_recognition/rnnt) | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
| gpt-j | [language/gpt-j](https://github.com/mlcommons/inference/tree/master/language/gpt-j)| pytorch | CNN-Daily Mail | edge,datacenter |
| stable-diffusion-xl | [text_to_image](https://github.com/mlcommons/inference/tree/master/text_to_image) | pytorch | COCO 2014| edge,datacenter |
| llama2-70b | [language/llama2-70b](https://github.com/mlcommons/inference/tree/master/language/llama2-70b) | pytorch | OpenOrca | datacenter |

* Framework here is given for the reference implementation. Submitters are free to use their own frameworks to run the benchmark.

## MLPerf Inference v3.1 (submission August 18, 2023)
Please use [v3.1 tag](https://github.com/mlcommons/inference/releases/tag/v3.1) (```git checkout v3.1```) if you would like to reproduce the v3.1 results.
Expand Down
9 changes: 9 additions & 0 deletions compliance/nvidia/TEST01/stable-diffusion-xl/audit.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# The format of this config file is 'key = value'.
# The key has the format 'model.scenario.key'. Value is mostly int64_t.
# Model maybe '*' as wildcard. In that case the value applies to all models.
# All times are in milli seconds

# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
*.*.mode = 2
*.*.accuracy_log_rng_seed = 720381539243781796
*.*.accuracy_log_sampling_target = 128
5 changes: 4 additions & 1 deletion compliance/nvidia/TEST06/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ This repository provides the config files and scripts to run and verify TEST 06

## Introduction

The purpose of this test is to ensure the consistency of the output of the Llama2 model and avoid a potential EOS exploit. This test will make a performance run, with a limit of 100 samples and logging them into `mlperf_log_accuracy.json`. To achieve a passing result in this test, two criteria must be met:
The purpose of this test is to ensure the consistency of the output of the Llama2 model and avoid a potential EOS exploit. This test will make a performance run, with a limit of 100 samples and logging them into `mlperf_log_accuracy.json`. To achieve a passing result in this test, three criteria must be met:
- In the case the first token is reported independently (not applicable for Offline scenario), it should match for every query with the first token of the model output.
- For each query, the model output should only end with zero or one EOS token
- The number of reported tokens should match with the length of it's

## Requisites

Expand All @@ -36,6 +37,7 @@ Expected output
```
First token check pass: True
EOS check pass: True
Sample length check pass: True
TEST06 verification complete
```

Expand All @@ -44,5 +46,6 @@ Or:
```
First token check pass: Skipped
EOS check pass: True
Sample length check pass: True
TEST06 verification complete
```
4 changes: 3 additions & 1 deletion compliance/nvidia/TEST06/audit.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@
*.*.accuracy_log_rng_seed = 720381539243781796
*.*.accuracy_log_sampling_target = 100
*.*.min_query_count = 100
*.*.min_duration = 0
*.*.min_duration = 0
# Turn off equal issue mode for TEST06
*.*.sample_concatenate_permutation = 0
21 changes: 17 additions & 4 deletions compliance/nvidia/TEST06/run_verification.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,21 @@ def first_token_check(acc_data, dtype):
for sample in acc_data:
data = np.frombuffer(bytes.fromhex(sample["data"]), dtype=dtype)
token_data = np.frombuffer(bytes.fromhex(sample["token_data"]), dtype=dtype)
print(token_data)
for t1, t2 in zip(data, token_data):
if t1 != t2:
return False

return True

def sample_len_check(acc_data, dtype):
for sample in acc_data:
data = np.frombuffer(bytes.fromhex(sample["data"]), dtype=dtype)
token_count = int(sample["token_count"])
if len(data) != token_count:
return False
return True


def main():
args = get_args()
accuracy_file = os.path.join(args.compliance_dir, "mlperf_log_accuracy.json")
Expand All @@ -90,6 +98,8 @@ def main():
print("Unexpected error occured while doing the first token check")
first_token_pass = False

sample_len_pass = sample_len_check(acc_data, DTYPE_MAP[args.dtype])

# Construct output based on the results of checks
output = ""
# Add first token check
Expand All @@ -101,14 +111,17 @@ def main():
# Add EOS check
output += f"EOS check pass: {eos_pass}\n"

if eos_pass and first_token_pass:
# Add sample length check
output += f"Sample length check pass: {sample_len_pass}\n"

if eos_pass and first_token_pass and sample_len_pass:
output += "TEST06 verification complete\n"
else:
output += "TEST06 verification failed\n"

# Output test output to console and folder
output_dir = args.output_dir
output_accuracy_dir = os.path.join(args.output_dir, "accuracy")
output_dir = os.path.join(args.output_dir, "TEST06")
output_accuracy_dir = os.path.join(output_dir, "accuracy")

if not os.path.isdir(output_dir):
os.makedirs(output_dir)
Expand Down
4 changes: 0 additions & 4 deletions language/bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,6 @@ The below CM command will launch the SUT server

```
cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch \
--rerun --adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000 --network=sut
```

Expand All @@ -61,8 +59,6 @@ Once the SUT server is launched, the below command can be run on the loadgen nod

```
cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch --rerun \
--adr.mlperf-implementation.version=custom \
--adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
--mode=performance --device=cuda --quiet --test_query_count=1000 \
--sut_servers,=http://localhost:8000 --network=lon
```
Expand Down
30 changes: 28 additions & 2 deletions language/gpt-j/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,37 @@ pip install datasets
python prepare-calibration.py --calibration-list-file calibration-list.txt --output-dir </path/to/output-folder>
```
### Download GPT-J model
Please download the fine-tuned GPT-J checkpoint from [here](https://cloud.mlcommons.org/index.php/s/QAZ2oM94MkFtbQx) and extract it as model/. The download_gptj.py only downloads the default huggingface model which is not fine-tuned on CNN-Daily mail dataset.
Please download the fine-tuned GPT-J checkpoint using the instructions below. The download_gptj.py only downloads the default huggingface model which is not fine-tuned on CNN-Daily mail dataset.

#### CM method

The following MLCommons CM commands can be used to programmatically download the model checkpoint.

```
pip install cmind
cm pull repo mlcommons@ck
cm run script --tags=get,ml-model,gptj,_pytorch,_rclone -j
```

#### Manual method

The above command automatically runs a set of Rclone commands to download the data from a Cloudflare R2 bucket. However, if you'd like to run the Rclone commands manually, you can do so as follows:

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
wget https://cloud.mlcommons.org/index.php/s/QAZ2oM94MkFtbQx/download --output-document checkpoint.zip
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the model checkpoint:

```
rclone copy mlc-inference:mlcommons-inference-wg-public/gpt-j ./model -P
```


### Running the Benchmark
Replace the model and dataset path arguments with your corresponding paths. For evaluating the ROUGE score after the run, include --accuracy as shown below. For user specific target qps, please include user.conf.
Expand Down
Loading