diff --git a/starter/ProfilerReports/benchmark/profiler-report.html b/starter/ProfilerReports/benchmark/profiler-report.html new file mode 100644 index 00000000..cdc457ba --- /dev/null +++ b/starter/ProfilerReports/benchmark/profiler-report.html @@ -0,0 +1,15030 @@ + + +
+ ++ SageMaker Debugger auto generated this report. You can generate similar reports on all supported training jobs. The report provides summary of training job, system resource usage statistics, framework metrics, rules summary, and detailed analysis from each rule. The graphs and tables are interactive. +
++ + Legal disclaimer: + + This report and any recommendations are provided for informational purposes only and are not definitive. You are responsible for making your own independent assessment of the information. +
+# Parameters
+processing_job_arn = "arn:aws:sagemaker:us-east-1:598348623909:processing-job/pytorch-training-2023-04-1-profilerreport-5f46c0a2"
+
+ \\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"
\\n\"+\n", + " \"\\n\"+\n",
+ " \"from bokeh.resources import INLINE\\n\"+\n",
+ " \"output_notebook(resources=INLINE)\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"\n", + " | Description | \n", + "Recommendation | \n", + "Number of times rule triggered | \n", + "Number of datapoints | \n", + "Rule parameters | \n", + "
---|---|---|---|---|---|
LowGPUUtilization | \n", + "Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size. | \n", + "Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size. | \n", + "14 | \n", + "1751 | \n", + "threshold_p95:70 threshold_p5:10 window:500 patience:1000 | \n",
+ "
BatchSize | \n", + "Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization. | \n", + "The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size. | \n", + "14 | \n", + "1750 | \n", + "cpu_threshold_p95:70 gpu_threshold_p95:70 gpu_memory_threshold_p95:70 patience:1000 window:500 | \n",
+ "
Dataloader | \n", + "Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU. | \n", + "Change the number of data loader processes. | \n", + "1 | \n", + "8373 | \n", + "min_threshold:70 max_threshold:200 | \n",
+ "
MaxInitializationTime | \n", + "Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes. | \n", + "Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework. | \n", + "0 | \n", + "0 | \n", + "threshold:20 | \n", + "
GPUMemoryIncrease | \n", + "Measures the average GPU memory footprint and triggers if there is a large increase. | \n", + "Choose a larger instance type with more memory if footprint is close to maximum available memory. | \n", + "0 | \n", + "1751 | \n", + "increase:5 patience:1000 window:10 | \n",
+ "
CPUBottleneck | \n", + "Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. | \n", + "Consider increasing the number of data loaders or applying data pre-fetching. | \n", + "0 | \n", + "3514 | \n", + "threshold:50 cpu_threshold:90 gpu_threshold:10 patience:1000 | \n",
+ "
LoadBalancing | \n", + "Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization. | \n", + "Choose a different distributed training strategy or a different distributed training framework. | \n", + "0 | \n", + "1751 | \n", + "threshold:0.2 patience:1000 | \n",
+ "
IOBottleneck | \n", + "Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. | \n", + "Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance. | \n", + "0 | \n", + "3514 | \n", + "threshold:50 io_threshold:50 gpu_threshold:10 patience:1000 | \n",
+ "
StepOutlier | \n", + "Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues. | \n", + "Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers. | \n", + "0 | \n", + "0 | \n", + "threshold:3 mode:None n_outliers:10 stddev:3 | \n",
+ "
+ SageMaker Debugger auto generated this report. You can generate similar reports on all supported training jobs. The report provides summary of training job, system resource usage statistics, framework metrics, rules summary, and detailed analysis from each rule. The graphs and tables are interactive. +
++ + Legal disclaimer: + + This report and any recommendations are provided for informational purposes only and are not definitive. You are responsible for making your own independent assessment of the information. +
+# Parameters
+processing_job_arn = "arn:aws:sagemaker:us-east-1:598348623909:processing-job/pytorch-training-2023-04-1-profilerreport-644ed05c"
+
+ \\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"
\\n\"+\n", + " \"\\n\"+\n",
+ " \"from bokeh.resources import INLINE\\n\"+\n",
+ " \"output_notebook(resources=INLINE)\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"\n", + " | Description | \n", + "Recommendation | \n", + "Number of times rule triggered | \n", + "Number of datapoints | \n", + "Rule parameters | \n", + "
---|---|---|---|---|---|
LowGPUUtilization | \n", + "Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size. | \n", + "Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size. | \n", + "28 | \n", + "2660 | \n", + "threshold_p95:70 threshold_p5:10 window:500 patience:1000 | \n",
+ "
BatchSize | \n", + "Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization. | \n", + "The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size. | \n", + "28 | \n", + "2659 | \n", + "cpu_threshold_p95:70 gpu_threshold_p95:70 gpu_memory_threshold_p95:70 patience:1000 window:500 | \n",
+ "
Dataloader | \n", + "Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU. | \n", + "Change the number of data loader processes. | \n", + "1 | \n", + "13041 | \n", + "min_threshold:70 max_threshold:200 | \n",
+ "
CPUBottleneck | \n", + "Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. | \n", + "Consider increasing the number of data loaders or applying data pre-fetching. | \n", + "0 | \n", + "5337 | \n", + "threshold:50 cpu_threshold:90 gpu_threshold:10 patience:1000 | \n",
+ "
StepOutlier | \n", + "Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues. | \n", + "Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers. | \n", + "0 | \n", + "0 | \n", + "threshold:3 mode:None n_outliers:10 stddev:3 | \n",
+ "
IOBottleneck | \n", + "Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. | \n", + "Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance. | \n", + "0 | \n", + "5337 | \n", + "threshold:50 io_threshold:50 gpu_threshold:10 patience:1000 | \n",
+ "
MaxInitializationTime | \n", + "Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes. | \n", + "Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework. | \n", + "0 | \n", + "0 | \n", + "threshold:20 | \n", + "
GPUMemoryIncrease | \n", + "Measures the average GPU memory footprint and triggers if there is a large increase. | \n", + "Choose a larger instance type with more memory if footprint is close to maximum available memory. | \n", + "0 | \n", + "2660 | \n", + "increase:5 patience:1000 window:10 | \n",
+ "
LoadBalancing | \n", + "Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization. | \n", + "Choose a different distributed training strategy or a different distributed training framework. | \n", + "0 | \n", + "2660 | \n", + "threshold:0.2 patience:1000 | \n",
+ "