Skip to content

Commit 4f5a62d

Browse files
committed
Add mlcommons storage dataset download instructions
Signed-off-by: ShriyaRishab <[email protected]>
1 parent 142de8b commit 4f5a62d

File tree

4 files changed

+42
-23
lines changed

4 files changed

+42
-23
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Each reference implementation provides the following:
2525

2626
* Code that implements the model in at least one framework.
2727
* A Dockerfile which can be used to run the benchmark in a container.
28-
* A script which downloads the appropriate dataset.
28+
* Instructions to download the appropriate dataset.
2929
* A script which runs and times training the model.
3030
* Documentation on the dataset, model, and machine setup.
3131

@@ -34,7 +34,7 @@ Each reference implementation provides the following:
3434
Follow instructions on the Readme of each benchmark. Generally, a benchmark can be run with the following steps:
3535

3636
1. Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this. Some benchmarks will have additional setup, mentioned in their READMEs.
37-
2. Download the dataset using `./download_dataset.sh`. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
37+
2. Download the dataset from [mlcommons-storage](https://training.mlcommons-storage.org/index.html). This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
3838
3. Optionally, run `verify_dataset.sh` to ensure the was successfully downloaded.
3939
4. Build and run the docker image, the command to do this is included with each Benchmark.
4040

graph_neural_network/download_data.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# Deprecated: Use instructions in https://training.mlcommons-storage.org/index.html to download the dataset.
4+
35
DATA_DIR="./igbh/full/processed"
46

57
# Capture MLCube parameter

graph_neural_network/download_igbh_full.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/bash
22

3+
# Deprecated: Use instructions in https://training.mlcommons-storage.org/index.html to download the dataset.
4+
35
#https://github.com/IllinoisGraphBenchmark/IGB-Datasets/blob/main/igb/download_igbh600m.sh
46
echo "IGBH600M download starting"
57
cd ../../data/

recommendation_v2/torchrec_dlrm/README.MD

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -138,32 +138,16 @@ A module which can be used for DLRM inference exists [here](https://github.com/p
138138

139139
# Running the MLPerf DLRM v2 benchmark
140140

141-
## Create the synthetic multi-hot dataset
142-
### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
143-
144-
### Step 2: Run the 1TB Criteo Preprocess script.
145-
Example usage:
141+
## Download the preprocessed Criteo multi-hot dataset
142+
You can download this data from the bucket using the [MLCommons R2 Downloader](https://github.com/mlcommons/r2-downloader). Navigate in the terminal to your desired download directory and run the following commands to download the dataset. More information about the MLCommons R2 Downloader, including how to run it on Windows, can be found [here](https://training.mlcommons-storage.org).
146143

147-
```
148-
bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
149-
./criteo_1tb/raw_input_dataset_dir \
150-
./criteo_1tb/temp_intermediate_files_dir \
151-
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
152-
```
144+
```bash
153145

154-
The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
146+
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://training.mlcommons-storage.org/metadata/dlrmv2-preprocessed-criteo-click-logs.uri
155147

156-
### Step 3: Run the `materialize_synthetic_multihot_dataset.py` script
157-
```
158-
python materialize_synthetic_multihot_dataset.py \
159-
--in_memory_binary_criteo_path $PREPROCESSED_CRITEO_1TB_CLICK_LOGS_DATASET_PATH \
160-
--output_path $MATERIALIZED_DATASET_PATH \
161-
--num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36 \
162-
--multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1 \
163-
--multi_hot_distribution_type uniform
164148
```
165149

166-
### Step 4: Run the MLPerf DLRM v2 benchmark, which uses the materialized multi-hot dataset
150+
## Run the MLPerf DLRM v2 benchmark, which uses the materialized multi-hot dataset
167151
Example running 8 GPUs:
168152
```
169153
export TOTAL_TRAINING_SAMPLES=4195197692 ;
@@ -214,3 +198,34 @@ torchx run -s local_cwd dist.ddp -j 1x8 --script dlrm_main.py -- \
214198
--learning_rate 0.005 \
215199
--multi_hot_distribution_type uniform \
216200
--multi_hot_sizes=3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1
201+
```
202+
203+
# Appendix
204+
205+
## Steps to create the synthetic multi-hot dataset
206+
207+
This is not required if you follow instructions above to download the preprocessed dataset from [mlcommons-storage](https://training.mlcommons-storage.org).
208+
209+
### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
210+
211+
### Step 2: Run the 1TB Criteo Preprocess script.
212+
Example usage:
213+
214+
```
215+
bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
216+
./criteo_1tb/raw_input_dataset_dir \
217+
./criteo_1tb/temp_intermediate_files_dir \
218+
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
219+
```
220+
221+
The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
222+
223+
### Step 3: Run the `materialize_synthetic_multihot_dataset.py` script
224+
```
225+
python materialize_synthetic_multihot_dataset.py \
226+
--in_memory_binary_criteo_path $PREPROCESSED_CRITEO_1TB_CLICK_LOGS_DATASET_PATH \
227+
--output_path $MATERIALIZED_DATASET_PATH \
228+
--num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36 \
229+
--multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1 \
230+
--multi_hot_distribution_type uniform
231+
```

0 commit comments

Comments
 (0)