Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation for hybrid logging example from resnet - alcf polaris #11

Merged
merged 5 commits into from
Oct 10, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 85 additions & 1 deletion docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -547,4 +547,88 @@ Example of running this configurations are:
# Set the mode to PRELOAD
export DLIO_PROFILER_INIT=PRELOAD
# Enable profiler
DLIO_PROFILER_ENABLE=1
DLIO_PROFILER_ENABLE=1


Resnet50 with pytorch and torchvision example from ALCF Polaris:
******************************************************************
hariharan-devarajan marked this conversation as resolved.
Show resolved Hide resolved

Create a separate conda environment for the application and install dlio-profiler

.. code-block:: bash
:linenos:

#!/bin/bash +x
set -e
set -x
export MODULEPATH=/soft/modulefiles/conda/:$MODULEPATH
module load 2023-10-04 # This is the latest conda module on Polaris

export ML_ENV=$PWD/PolarisAT/conda-envs/ml_workload_latest_conda_2 # Please change the following path accordingly

if [[ -e $ML_ENV ]]; then
conda activate $ML_ENV
else
conda create -p $ML_ENV --clone /soft/datascience/conda/2023-10-04/mconda3/
conda activate $ML_ENV
yes | MPICC="cc -shared -target-accel=nvidia80" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py
yes | pip install --no-cache-dir git+https://github.com/hariharan-devarajan/dlio-profiler.git
pip uninstall -y torch horovod
yes | pip install --no-cache-dir horovod
#INSTALL OTHER MISSING FILES
fi

Since, torchvision.datasets.ImageFolder spawns separate python processes to help the parallel data loading in torch, we will be using the `HYBRID MODE` of the DLIO Profiler, so that the application can use both APP and PRELOAD Mode to log I/O from all dynamically spawned processes and function profiling from application.
hariharan-devarajan marked this conversation as resolved.
Show resolved Hide resolved

The following dlio_profiler code is added to profile the application at the function level.
Note: dlio-profiler python level log file location is provided inside the python code in the dlio_logger.initialize_log() function and the POSIX or STDIO calls level log file location is provided in the job scirpt environment variable `DLIO_PROFILER_LOG_FILE`

.. code-block:: python
:linenos:

...
# From the preamble
from dlio_profiler.logger import dlio_logger as logger, fn_interceptor as dlp_event_logging
dlp_pid=os.getpid()
log_inst=logger.initialize_log(f"./resnet50/dlio_log_py_level-{dlp_pid}.pfw", "", dlp_pid)
compute_dlp = dlp_event_logging("Compute")
io_dlp = dlp_event_logging("IO", name="real_IO")
...
# From the train() function
for i, (images, target) in io_dlp.iter(enumerate(train_loader)):
with dlp_event_logging("communication-except-io", name="cpu-gpu-transfer", step=i, epoch=epoch) as transfer:
images = images.to(device)
target = target.to(device)
with dlp_event_logging("compute", name="model-compute-forward-prop", step=i, epoch=epoch) as compute:
output = model(images)
loss = criterion(output, target)
with dlp_event_logging("compute", name="model-compute-backward-prop", step=i, epoch=epoch) as compute:
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))

...
# At the end of main function
log_inst.finalize()

Job submition script

.. code-block:: bash
:linenos:

export MODULEPATH=/soft/modulefiles/conda/:$MODULEPATH
module load 2023-10-04
conda activate./dlio_ml_workloads/PolarisAT/conda-envs/ml_workload_latest_conda

export LD_LIBRARY_PATH=$env_path/lib/:$LD_LIBRARY_PATH
export DLIO_PROFILER_LOG_LEVEL=ERROR
export DLIO_PROFILER_ENABLE=1
export DLIO_PROFILER_INC_METADATA=1
export DLIO_PROFILER_INIT=PRELOAD
export DLIO_PROFILER_DATA_DIR=./resnet_original_data #Path to the orignal resnet 50 dataset
export DLIO_PROFILER_LOG_FILE=./dlio_log_posix_level.pfw

LD_PRELOAD=./dlio_ml_workloads/PolarisAT/conda-envs/ml_workload_latest_conda/lib/libdlio_profiler_preload.so aprun -n 4 -N 4 python resnet_hvd_dlio.py --batch-size 64 --epochs 1 > dlio_log 2>&1

cat *.pfw > combined_logs.pfw # To combine to a single pfw file.