Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nsys report analyzer #65

Closed
wants to merge 19 commits into from
Closed

Add nsys report analyzer #65

wants to merge 19 commits into from

Conversation

FindHao
Copy link
Member

@FindHao FindHao commented Nov 21, 2024

This PR add a nsys report analyzer providing metrics

nsys_metrics_to_reports = {
    # the sum of kernel execution time
    "nsys_gpu_kernel_sum": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the overhead of kernel launch
    "nsys_launch_overhead": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the names of kernels
    "nsys_kernel_names": ["cuda_gpu_kern_sum"],
    # the durations of kernels
    "nsys_kernel_durations": ["cuda_gpu_kern_sum"],
    # the duration of nvtx range
    "nsys_nvtx_range_duration": ["nvtx_sum"],
    # the number of kernels
    "nsys_num_of_kernels": ["cuda_gpu_kern_sum"],
}

nsys_gpu_kernel_sum is the sum of total GPU kernel execution time on GPUs, the nsys_nvtx_range_duration is the total execution time of the operator, and the nsys_launch_overhead is their difference which indicates the launch overhead. This is one way to measure execution time mentioned in #50

Fix #67

Test Plan:

% python run.py --op rope  --num-inputs 1  --metrics nsys_gpu_kernel_sum,nsys_launch_overhead,nsys_kernel_names,nsys_kernel_durations,nsys_nvtx_range_duration,nsys_num_of_kernels --csv --dump-csv
  0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-531e.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/apply_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-39ea.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/liger_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-e8bf.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/inductor_rotary_pos_emb_full_op_0/nsys_output.nsys-rep
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.40s/it]
(H, T);apply_rotary_pos_emb-nsys_kernel_names;apply_rotary_pos_emb-nsys_kernel_durations;apply_rotary_pos_emb-nsys_gpu_kernel_sum;apply_rotary_pos_emb-nsys_num_of_kernels;apply_rotary_pos_emb-nsys_launch_overhead;apply_rotary_pos_emb-nsys_nvtx_range_duration;liger_rotary_pos_emb-nsys_kernel_names;liger_rotary_pos_emb-nsys_kernel_durations;liger_rotary_pos_emb-nsys_gpu_kernel_sum;liger_rotary_pos_emb-nsys_num_of_kernels;liger_rotary_pos_emb-nsys_launch_overhead;liger_rotary_pos_emb-nsys_nvtx_range_duration;inductor_rotary_pos_emb_full_op-nsys_kernel_names;inductor_rotary_pos_emb_full_op-nsys_kernel_durations;inductor_rotary_pos_emb_full_op-nsys_gpu_kernel_sum;inductor_rotary_pos_emb_full_op-nsys_num_of_kernels;inductor_rotary_pos_emb_full_op-nsys_launch_overhead;inductor_rotary_pos_emb_full_op-nsys_nvtx_range_duration
(8192, 1024);['void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>, unsigned int, (int)4, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::CUDAFunctor_add<float>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::neg_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)'];0.090065;0.351364;4;0.4534;0.804764;['_triton_rope'];0.049281;0.049281;1;0.176437;0.225718;['triton_poi_fused_add_cat_mul_0', 'triton_poi_fused_add_cat_mul_1'];0.0266885;0.053377;2;0.444969;0.498346
[TritonBench] Dumped csv to /tmp/tritonbench/op_rope__z_yqmrz.csv

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

reports_required.extend(nsys_metrics_to_reports[metric])
reports_required = list(set(reports_required))
assert reports_required, "No nsys reports required"
cmd = f"nsys stats --report {','.join(reports_required)} --force-export=true --format csv --output . --force-overwrite=true {report_path} > /dev/null 2>&1"
Copy link
Member Author

@FindHao FindHao Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple ways to obtain reports and collect data from them. 1. run nsys with desired reports via --report . It will automatically generate the csv files for each report. 2. only generate sqlite and utilize report scripts in nsys/target-linux-x64/reports. See discussion here.

In summary, we always need one system command call for nsys to either generate sqlite or csv reports, and the report scripts in nsys folder also do csv process, so we can directly utilize nsys to obtain csv report files rather than processing by ourselves.

@FindHao
Copy link
Member Author

FindHao commented Nov 21, 2024

Some results are not correct. working on the fix.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

if "nvtx_sum" in csv_contents:
# It is supposed to be only one row. The nvtx range is `:tritonbench_range`
assert len(csv_contents["nvtx_sum"]) == 1
nvtx_range_duration = (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nsys returns this number in us rather than ns in some cases. it's not our bug. left this comment for future check.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@FindHao merged this pull request in deb9183.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We need to add support for string lists as metrics
3 participants