Add mlcommons storage dataset download instructions

ShriyaRishab · ShriyaRishab · commit 4f5a62d6e614 · 2025-09-22T12:47:04.000-07:00
Signed-off-by: ShriyaRishab &lt;spalsamudram@nvidia.com&gt;
diff --git a/README.md b/README.md
@@ -25,7 +25,7 @@ Each reference implementation provides the following:
  
 * Code that implements the model in at least one framework.
 * A Dockerfile which can be used to run the benchmark in a container.
-* A script which downloads the appropriate dataset.
+* Instructions to download the appropriate dataset.
 * A script which runs and times training the model.
 * Documentation on the dataset, model, and machine setup.
 
@@ -34,7 +34,7 @@ Each reference implementation provides the following:
 Follow instructions on the Readme of each benchmark. Generally, a benchmark can be run with the following steps:
 
 1. Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this. Some benchmarks will have additional setup, mentioned in their READMEs.
-2. Download the dataset using `./download_dataset.sh`. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
+2. Download the dataset from [mlcommons-storage](https://training.mlcommons-storage.org/index.html). This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
 3. Optionally, run `verify_dataset.sh` to ensure the was successfully downloaded.
 4. Build and run the docker image, the command to do this is included with each Benchmark. 
 
diff --git a/graph_neural_network/download_data.sh b/graph_neural_network/download_data.sh
@@ -1,5 +1,7 @@
 #!/bin/bash
 
+# Deprecated: Use instructions in https://training.mlcommons-storage.org/index.html to download the dataset.
+
 DATA_DIR="./igbh/full/processed"
 
 # Capture MLCube parameter
diff --git a/graph_neural_network/download_igbh_full.sh b/graph_neural_network/download_igbh_full.sh
@@ -1,5 +1,7 @@
 #!/bin/bash
 
+# Deprecated: Use instructions in https://training.mlcommons-storage.org/index.html to download the dataset.
+
 #https://github.com/IllinoisGraphBenchmark/IGB-Datasets/blob/main/igb/download_igbh600m.sh
 echo "IGBH600M download starting"
 cd ../../data/
diff --git a/recommendation_v2/torchrec_dlrm/README.MD b/recommendation_v2/torchrec_dlrm/README.MD
@@ -138,32 +138,16 @@ A module which can be used for DLRM inference exists [here](https://github.com/p
 
 # Running the MLPerf DLRM v2 benchmark
 
-## Create the synthetic multi-hot dataset
-### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
-
-### Step 2: Run the 1TB Criteo Preprocess script.
-Example usage:
+## Download the preprocessed Criteo multi-hot dataset 
+You can download this data from the bucket using the [MLCommons R2 Downloader](https://github.com/mlcommons/r2-downloader). Navigate in the terminal to your desired download directory and run the following commands to download the dataset. More information about the MLCommons R2 Downloader, including how to run it on Windows, can be found [here](https://training.mlcommons-storage.org).
 
-```
-bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
-./criteo_1tb/raw_input_dataset_dir \
-./criteo_1tb/temp_intermediate_files_dir \
-./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
-```
+```bash
 
-The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
+bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs.uri
 
-### Step 3: Run the `materialize_synthetic_multihot_dataset.py` script
-```
-python materialize_synthetic_multihot_dataset.py \
-    --in_memory_binary_criteo_path $PREPROCESSED_CRITEO_1TB_CLICK_LOGS_DATASET_PATH \
-    --output_path $MATERIALIZED_DATASET_PATH \
-    --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36 \
-    --multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1 \
-    --multi_hot_distribution_type uniform
 ```
 
-### Step 4: Run the MLPerf DLRM v2 benchmark, which uses the materialized multi-hot dataset
+## Run the MLPerf DLRM v2 benchmark, which uses the materialized multi-hot dataset
 Example running 8 GPUs:
 ```
 export TOTAL_TRAINING_SAMPLES=4195197692 ;
@@ -214,3 +198,34 @@ torchx run -s local_cwd dist.ddp -j 1x8 --script dlrm_main.py -- \
     --learning_rate 0.005 \
     --multi_hot_distribution_type uniform \
     --multi_hot_sizes=3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1
+```
+
+# Appendix
+
+## Steps to create the synthetic multi-hot dataset
+
+This is not required if you follow instructions above to download the preprocessed dataset from [mlcommons-storage](https://training.mlcommons-storage.org).
+
+### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
+
+### Step 2: Run the 1TB Criteo Preprocess script.
+Example usage:
+
+```
+bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
+./criteo_1tb/raw_input_dataset_dir \
+./criteo_1tb/temp_intermediate_files_dir \
+./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
+```
+
+The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
+
+### Step 3: Run the `materialize_synthetic_multihot_dataset.py` script
+```
+python materialize_synthetic_multihot_dataset.py \
+    --in_memory_binary_criteo_path $PREPROCESSED_CRITEO_1TB_CLICK_LOGS_DATASET_PATH \
+    --output_path $MATERIALIZED_DATASET_PATH \
+    --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36 \
+    --multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1 \
+    --multi_hot_distribution_type uniform
+```