Skip to content

Commit

Permalink
Merge branch 'dev' into feature/thread_id
Browse files Browse the repository at this point in the history
  • Loading branch information
hariharan-devarajan authored Oct 10, 2023
2 parents cb003e4 + 3c2c45a commit c1e85c7
Showing 1 changed file with 93 additions and 2 deletions.
95 changes: 93 additions & 2 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -481,10 +481,11 @@ Example of running this configurations are:
export DLIO_PROFILER_ENABLE=1
.. _python-hybrid-mode:

Hybrid Example:
**************************


.. code-block:: python
:linenos:
Expand Down Expand Up @@ -549,6 +550,93 @@ Example of running this configurations are:
# Enable profiler
DLIO_PROFILER_ENABLE=1
----------------------------------------------------------------
Resnet50 with pytorch and torchvision example from ALCF Polaris:
----------------------------------------------------------------

Create a separate conda environment for the application and install dlio-profiler

.. code-block:: bash
:linenos:
#!/bin/bash +x
set -e
set -x
export MODULEPATH=/soft/modulefiles/conda/:$MODULEPATH
module load 2023-10-04 # This is the latest conda module on Polaris
export ML_ENV=$PWD/PolarisAT/conda-envs/ml_workload_latest_conda_2 # Please change the following path accordingly
if [[ -e $ML_ENV ]]; then
conda activate $ML_ENV
else
conda create -p $ML_ENV --clone /soft/datascience/conda/2023-10-04/mconda3/
conda activate $ML_ENV
yes | MPICC="cc -shared -target-accel=nvidia80" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py
yes | pip install --no-cache-dir git+https://github.com/hariharan-devarajan/dlio-profiler.git
pip uninstall -y torch horovod
yes | pip install --no-cache-dir horovod
#INSTALL OTHER MISSING FILES
fi
Since, torchvision.datasets.ImageFolder spawns separate python processes to help the parallel data loading in torch, we will be using the `HYBRID MODE` of the DLIO Profiler (e.g., see
:ref:`Python Hybrid mode <python-hybrid-mode>`), so that the application can use both APP and PRELOAD Mode to log I/O from all dynamically spawned processes and function profiling from application.

The following dlio_profiler code is added to profile the application at the function level.
Note: dlio-profiler python level log file location is provided inside the python code in the dlio_logger.initialize_log() function and the POSIX or STDIO calls level log file location is provided in the job scirpt environment variable `DLIO_PROFILER_LOG_FILE`

.. code-block:: python
:linenos:
...
# From the preamble
from dlio_profiler.logger import dlio_logger as logger, fn_interceptor as dlp_event_logging
dlp_pid=os.getpid()
log_inst=logger.initialize_log(f"./resnet50/dlio_log_py_level-{dlp_pid}.pfw", "", dlp_pid)
compute_dlp = dlp_event_logging("Compute")
io_dlp = dlp_event_logging("IO", name="real_IO")
...
# From the train() function
for i, (images, target) in io_dlp.iter(enumerate(train_loader)):
with dlp_event_logging("communication-except-io", name="cpu-gpu-transfer", step=i, epoch=epoch) as transfer:
images = images.to(device)
target = target.to(device)
with dlp_event_logging("compute", name="model-compute-forward-prop", step=i, epoch=epoch) as compute:
output = model(images)
loss = criterion(output, target)
with dlp_event_logging("compute", name="model-compute-backward-prop", step=i, epoch=epoch) as compute:
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))
...
# At the end of main function
log_inst.finalize()
Job submition script

.. code-block:: bash
:linenos:
export MODULEPATH=/soft/modulefiles/conda/:$MODULEPATH
module load 2023-10-04
conda activate./dlio_ml_workloads/PolarisAT/conda-envs/ml_workload_latest_conda
export LD_LIBRARY_PATH=$env_path/lib/:$LD_LIBRARY_PATH
export DLIO_PROFILER_LOG_LEVEL=ERROR
export DLIO_PROFILER_ENABLE=1
export DLIO_PROFILER_INC_METADATA=1
export DLIO_PROFILER_INIT=PRELOAD
export DLIO_PROFILER_DATA_DIR=./resnet_original_data #Path to the orignal resnet 50 dataset
export DLIO_PROFILER_LOG_FILE=./dlio_log_posix_level.pfw
LD_PRELOAD=./dlio_ml_workloads/PolarisAT/conda-envs/ml_workload_latest_conda/lib/libdlio_profiler_preload.so aprun -n 4 -N 4 python resnet_hvd_dlio.py --batch-size 64 --epochs 1 > dlio_log 2>&1
cat *.pfw > combined_logs.pfw # To combine to a single pfw file.
***********************
Integrated Applications
***********************
Expand All @@ -557,6 +645,7 @@ Here is the list applications that currently use DLIO Profiler.

1. `DLIO Benchmark <https://github.com/argonne-lcf/dlio_benchmark>`_
2. MuMMI
3. Resnet50 with pytorch and torchvision

****************************
Example Chrome Tracing Plots
Expand All @@ -567,4 +656,6 @@ Here, we can see that we can get application level calls (e.g., ``train`` and ``

.. image:: images/tracing/trace.png
:width: 400
:alt: Unet3D applications
:alt: Unet3D applications


0 comments on commit c1e85c7

Please sign in to comment.