openshift-psap · Maxusmusti · Jan 23, 2024 · Jan 25, 2024 · Jan 25, 2024 · Jan 26, 2024
diff --git a/README.md b/README.md
@@ -15,7 +15,23 @@ Please see the [MLPerf Inference benchmark paper](https://arxiv.org/abs/1911.025
 ```
 ## MLPerf Inference v4.0 (submission deadline February 23, 2024)
 
-Code freeze coming soon...
+There is an extra one-week extension allowed only for the llama2-70b submissions. For submissions, please use the master branch and any commit since the [4.0 seed release](https://github.com/mlcommons/inference/commit/8e36925bd36a503e39fcbbc488e9e46126f079ed) although it is best to use the latest commit. v4.0 tag will be created from the master branch after the result publication.
+
+For power submissions please use [SPEC PTD 1.10](https://github.com/mlcommons/power/tree/main/inference_v1.0) (needs special access) and any commit of the power-dev repository after the [code-freeze](https://github.com/mlcommons/power-dev/commit/4e026f43481f46ad57d2464d28924018444b0428)
+
+| model | reference app | framework | dataset | category
+| ---- | ---- | ---- | ---- | ---- |
+| resnet50-v1.5 | [vision/classification_and_detection](https://github.com/mlcommons/inference/tree/master/vision/classification_and_detection) | tensorflow, onnx, tvm, ncnn | imagenet2012 | edge,datacenter |
+| retinanet 800x800 | [vision/classification_and_detection](https://github.com/mlcommons/inference/tree/master/vision/classification_and_detection) | pytorch, onnx | openimages resized to 800x800| edge,datacenter |
+| bert | [language/bert](https://github.com/mlcommons/inference/tree/master/language/bert) | tensorflow, pytorch, onnx | squad-1.1 | edge,datacenter |
+| dlrm-v2 | [recommendation/dlrm_v2](https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v2/pytorch) | pytorch | Multihot Criteo Terabyte | datacenter |
+| 3d-unet | [vision/medical_imaging/3d-unet-kits19](https://github.com/mlcommons/inference/tree/master/vision/medical_imaging/3d-unet-kits19) | pytorch, tensorflow, onnx | KiTS19 | edge,datacenter |
+| rnnt | [speech_recognition/rnnt](https://github.com/mlcommons/inference/tree/master/speech_recognition/rnnt) | pytorch | OpenSLR LibriSpeech Corpus | edge,datacenter |
+| gpt-j | [language/gpt-j](https://github.com/mlcommons/inference/tree/master/language/gpt-j)| pytorch | CNN-Daily Mail | edge,datacenter |
+| stable-diffusion-xl | [text_to_image](https://github.com/mlcommons/inference/tree/master/text_to_image) | pytorch | COCO 2014| edge,datacenter |
+| llama2-70b | [language/llama2-70b](https://github.com/mlcommons/inference/tree/master/language/llama2-70b) | pytorch | OpenOrca | datacenter |
+
+* Framework here is given for the reference implementation. Submitters are free to use their own frameworks to run the benchmark.
 
 ## MLPerf Inference v3.1 (submission August 18, 2023)
 Please use [v3.1 tag](https://github.com/mlcommons/inference/releases/tag/v3.1) (```git checkout v3.1```) if you would like to reproduce the v3.1 results. 

diff --git a/compliance/nvidia/TEST01/stable-diffusion-xl/audit.config b/compliance/nvidia/TEST01/stable-diffusion-xl/audit.config
@@ -0,0 +1,9 @@
+# The format of this config file is 'key = value'.
+# The key has the format 'model.scenario.key'. Value is mostly int64_t.
+# Model maybe '*' as wildcard. In that case the value applies to all models.
+# All times are in milli seconds
+
+# mode dictionary (0 = submission, 1 = accuracy, 2 = performance, 3 = find peak perf)
+*.*.mode = 2
+*.*.accuracy_log_rng_seed = 720381539243781796
+*.*.accuracy_log_sampling_target = 128
diff --git a/compliance/nvidia/TEST06/README.md b/compliance/nvidia/TEST06/README.md
@@ -8,9 +8,10 @@ This repository provides the config files and scripts to run and verify TEST 06
 
 ## Introduction
 
-The purpose of this test is to ensure the consistency of the output of the Llama2 model and avoid a potential EOS exploit. This test will make a performance run, with a limit of 100 samples and logging them into `mlperf_log_accuracy.json`. To achieve a passing result in this test, two criteria must be met:
+The purpose of this test is to ensure the consistency of the output of the Llama2 model and avoid a potential EOS exploit. This test will make a performance run, with a limit of 100 samples and logging them into `mlperf_log_accuracy.json`. To achieve a passing result in this test, three criteria must be met:
 - In the case the first token is reported independently (not applicable for Offline scenario), it should match for every query with the first token of the model output.
 - For each query, the model output should only end with zero or one EOS token
+- The number of reported tokens should match with the length of it's
 
 ## Requisites
 
@@ -36,6 +37,7 @@ Expected output
 ```
 First token check pass: True                
 EOS check pass: True             
+Sample length check pass: True  
 TEST06 verification complete   
 ```
 
@@ -44,5 +46,6 @@ Or:
 ```
 First token check pass: Skipped                
 EOS check pass: True             
+Sample length check pass: True  
 TEST06 verification complete     
 ```
diff --git a/compliance/nvidia/TEST06/audit.config b/compliance/nvidia/TEST06/audit.config
@@ -8,4 +8,6 @@
 *.*.accuracy_log_rng_seed = 720381539243781796
 *.*.accuracy_log_sampling_target = 100
 *.*.min_query_count = 100
-*.*.min_duration = 0
+*.*.min_duration = 0
+# Turn off equal issue mode for TEST06
+*.*.sample_concatenate_permutation = 0
diff --git a/compliance/nvidia/TEST06/run_verification.py b/compliance/nvidia/TEST06/run_verification.py
@@ -61,13 +61,21 @@ def first_token_check(acc_data, dtype):
     for sample in acc_data:
         data = np.frombuffer(bytes.fromhex(sample["data"]), dtype=dtype)
         token_data = np.frombuffer(bytes.fromhex(sample["token_data"]), dtype=dtype)
-        print(token_data)
         for t1, t2 in zip(data, token_data):
             if t1 != t2:
                 return False
 
     return True
 
+def sample_len_check(acc_data, dtype):
+    for sample in acc_data:
+        data = np.frombuffer(bytes.fromhex(sample["data"]), dtype=dtype)
+        token_count = int(sample["token_count"])
+        if len(data) != token_count:
+            return False
+    return True
+
+
 def main():
     args = get_args()
     accuracy_file = os.path.join(args.compliance_dir, "mlperf_log_accuracy.json")
@@ -90,6 +98,8 @@ def main():
             print("Unexpected error occured while doing the first token check")
             first_token_pass = False
 
+    sample_len_pass = sample_len_check(acc_data, DTYPE_MAP[args.dtype])
+
     # Construct output based on the results of checks
     output = ""
     # Add first token check
@@ -101,14 +111,17 @@ def main():
     # Add EOS check
     output += f"EOS check pass: {eos_pass}\n"
 
-    if eos_pass and first_token_pass:
+    # Add sample length check
+    output += f"Sample length check pass: {sample_len_pass}\n"
+
+    if eos_pass and first_token_pass and sample_len_pass:
         output += "TEST06 verification complete\n"
     else:
         output += "TEST06 verification failed\n"
 
     # Output test output to console and folder
-    output_dir = args.output_dir
-    output_accuracy_dir = os.path.join(args.output_dir, "accuracy")
+    output_dir = os.path.join(args.output_dir, "TEST06")
+    output_accuracy_dir = os.path.join(output_dir, "accuracy")
 
     if not os.path.isdir(output_dir):
         os.makedirs(output_dir)

diff --git a/language/bert/README.md b/language/bert/README.md
@@ -51,8 +51,6 @@ The below CM command will launch the SUT server
 
 ```
 cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  \
---rerun --adr.mlperf-implementation.version=custom \
---adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
 --mode=performance --device=cuda --quiet --test_query_count=1000 --network=sut
 ```
 
@@ -61,8 +59,6 @@ Once the SUT server is launched, the below command can be run on the loadgen nod
 
 ```
 cm run script --tags=generate-run-cmds,inference --model=bert-99 --backend=pytorch  --rerun \
---adr.mlperf-implementation.version=custom \
---adr.mlperf-implementation.tags=_repo.https://github.com/GATEOVerflow/inference \
 --mode=performance --device=cuda --quiet --test_query_count=1000  \
 --sut_servers,=http://localhost:8000 --network=lon
 ```

diff --git a/language/gpt-j/README.md b/language/gpt-j/README.md
@@ -68,11 +68,37 @@ pip install datasets
 python prepare-calibration.py --calibration-list-file calibration-list.txt --output-dir </path/to/output-folder>
 ```
 ### Download GPT-J model
-Please download the fine-tuned GPT-J checkpoint from [here](https://cloud.mlcommons.org/index.php/s/QAZ2oM94MkFtbQx) and extract it as model/. The download_gptj.py only downloads the default huggingface model which is not fine-tuned on CNN-Daily mail dataset. 
+Please download the fine-tuned GPT-J checkpoint using the instructions below. The download_gptj.py only downloads the default huggingface model which is not fine-tuned on CNN-Daily mail dataset. 
 
+#### CM method
+
+The following MLCommons CM commands can be used to programmatically download the model checkpoint. 
+
+```
+pip install cmind
+cm pull repo mlcommons@ck
+cm run script --tags=get,ml-model,gptj,_pytorch,_rclone -j
+```
+
+#### Manual method
+
+The above command automatically runs a set of Rclone commands to download the data from a Cloudflare R2 bucket. However, if you'd like to run the Rclone commands manually, you can do so as follows:
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
 ```
-wget https://cloud.mlcommons.org/index.php/s/QAZ2oM94MkFtbQx/download --output-document checkpoint.zip
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
 ```
+Once Rclone is installed, run the following command to authenticate with the bucket:
+```
+rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```
+You can then navigate in the terminal to your desired download directory and run the following command to download the model checkpoint:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/gpt-j ./model -P
+```
+
 
 ### Running the Benchmark
 Replace the model and dataset path arguments with your corresponding paths. For evaluating the ROUGE score after the run, include --accuracy as shown below. For user specific target qps, please include user.conf.