You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ Each reference implementation provides the following:
25
25
26
26
* Code that implements the model in at least one framework.
27
27
* A Dockerfile which can be used to run the benchmark in a container.
28
-
*A script which downloads the appropriate dataset.
28
+
*Instructions to download the appropriate dataset.
29
29
* A script which runs and times training the model.
30
30
* Documentation on the dataset, model, and machine setup.
31
31
@@ -34,7 +34,7 @@ Each reference implementation provides the following:
34
34
Follow instructions on the Readme of each benchmark. Generally, a benchmark can be run with the following steps:
35
35
36
36
1. Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this. Some benchmarks will have additional setup, mentioned in their READMEs.
37
-
2. Download the dataset using `./download_dataset.sh`. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
37
+
2. Download the dataset from [mlcommons-storage](https://training.mlcommons-storage.org/index.html). This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
38
38
3. Optionally, run `verify_dataset.sh` to ensure the was successfully downloaded.
39
39
4. Build and run the docker image, the command to do this is included with each Benchmark.
Copy file name to clipboardExpand all lines: recommendation_v2/torchrec_dlrm/README.MD
+36-21Lines changed: 36 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -138,32 +138,16 @@ A module which can be used for DLRM inference exists [here](https://github.com/p
138
138
139
139
# Running the MLPerf DLRM v2 benchmark
140
140
141
-
## Create the synthetic multi-hot dataset
142
-
### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
143
-
144
-
### Step 2: Run the 1TB Criteo Preprocess script.
145
-
Example usage:
141
+
## Download the preprocessed Criteo multi-hot dataset
142
+
You can download this data from the bucket using the [MLCommons R2 Downloader](https://github.com/mlcommons/r2-downloader). Navigate in the terminal to your desired download directory and run the following commands to download the dataset. More information about the MLCommons R2 Downloader, including how to run it on Windows, can be found [here](https://training.mlcommons-storage.org).
The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
## Steps to create the synthetic multi-hot dataset
206
+
207
+
This is not required if you follow instructions above to download the preprocessed dataset from [mlcommons-storage](https://training.mlcommons-storage.org).
208
+
209
+
### Step 1: Download and uncompressing the [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/ressources/criteo-1tb-click-logs-dataset-for-mlperf)
The script requires 700GB of RAM and takes 1-2 days to run. MD5 checksums for the output dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
222
+
223
+
### Step 3: Run the `materialize_synthetic_multihot_dataset.py` script
0 commit comments