Fix cudagraph mem #112

FindHao · 2024-12-11T00:36:45Z

Test Plan:

% python run.py --op kl_div --mode fwd  --precision fp32 --metrics latency,speedup,gpu_peak_mem,mem_footprint_compression_ratio --csv --cudagraph
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:29<00:00,  4.85s/it]
(B, T, V);torch_kl_div-gpu_peak_mem;torch_kl_div-latency;liger_kl_div-mem_footprint_compression_ratio;liger_kl_div-gpu_peak_mem;liger_kl_div-speedup;liger_kl_div-latency;inductor_kl_div-mem_footprint_compression_ratio;inductor_kl_div-gpu_peak_mem;inductor_kl_div-speedup;inductor_kl_div-latency
(8, 512, 4096);0.335545344;0.2913801074028015;2.499664352734763;0.13423616;4.345591424995377;0.06705188751220703;2.4999408737712234;0.134221312;4.5710188874722935;0.06374511122703552
(8, 512, 8192);0.671089664;0.5571430921554565;2.4998321648446384;0.268453888;4.4568456168076676;0.12500838935375214;2.4999704364909068;0.26843904;4.131225595689207;0.13486145436763763
(8, 512, 16384);1.342178304;1.0886905193328857;2.4999160795413364;0.536889344;4.558430863427795;0.238830104470253;2.499985218146775;0.536874496;4.128684738124056;0.26368942856788635
(8, 512, 32768);2.684355584;2.14784574508667;2.499958039050386;1.073760256;4.621623748214942;0.4647383391857147;2.499992609048718;1.073745408;4.12963063769356;0.5201060175895691
(8, 512, 65536);5.368710144;4.265157222747803;2.4999790193451172;2.14750208;4.651345931803366;0.9169726967811584;2.499996304518191;2.147487232;4.136367624897469;1.0311359167099
(8, 512, 131072);10.737419264;8.51184368133545;2.499989509627539;4.294985728;4.678422853171112;1.819383144378662;2.499998152257554;4.29497088;4.1419456155050405;2.05503511428833

tritonbench/utils/triton_op.py

This reverts commit 9102236.

FindHao · 2024-12-11T18:24:47Z

% python run.py --op kl_div --mode fwd_bwd  --precision fp32 --metrics latency,speedup --csv  --num-inputs 1 --cudagraph
  0%|                                                                                                                           | 0/1 [00:00<?, ?it/s]
Caught exception, terminating early with partial results
Traceback (most recent call last):
  File "/scratch/yhao/miniconda3/envs/pta_gil/lib/python3.10/site-packages/triton/testing.py", line 79, in do_bench_cudagraph
    fn()
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 708, in <lambda>
    fwd_bwd_fn = lambda: (fwd_fn(), bwd_fn())
  File "/scratch/yhao/pta/tritonbench/tritonbench/operators/kl_div/operator.py", line 59, in <lambda>
    return lambda: y.backward(retain_graph=True)
  File "/scratch/yhao/pta/pytorch/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/scratch/yhao/pta/pytorch/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/scratch/yhao/pta/pytorch/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 800, in run
    y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 788, in _reduce_benchmarks
    acc[bm_name] = self._do_bench(
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 1037, in _do_bench
    metrics.latency = do_bench_wrapper(
  File "/scratch/yhao/pta/tritonbench/tritonbench/components/do_bench/run.py", line 16, in do_bench_wrapper
    return triton.testing.do_bench_cudagraph(
  File "/scratch/yhao/miniconda3/envs/pta_gil/lib/python3.10/site-packages/triton/testing.py", line 74, in do_bench_cudagraph
    with torch.cuda.graph(g):
  File "/scratch/yhao/pta/pytorch/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/scratch/yhao/pta/pytorch/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(B, T, V)

still fail for bwd included tests.

FindHao · 2024-12-11T21:48:10Z

Let's focus on forward fix for now. will use another PR to fix backward issue.

facebook-github-bot · 2024-12-11T21:48:47Z

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xuzhao9

LGTM

facebook-github-bot · 2024-12-12T23:49:30Z

@FindHao merged this pull request in f58c0b6.

fix cudagraph mem

9102236

facebook-github-bot added the cla signed label Dec 11, 2024

FindHao had a problem deploying to docker-s3-upload December 11, 2024 00:36 — with GitHub Actions Failure

FindHao temporarily deployed to docker-s3-upload December 11, 2024 00:36 — with GitHub Actions Inactive

xuzhao9 reviewed Dec 11, 2024

View reviewed changes

tritonbench/utils/triton_op.py Outdated Show resolved Hide resolved

Revert "fix cudagraph mem"

1aa0c89

This reverts commit 9102236.

FindHao temporarily deployed to docker-s3-upload December 11, 2024 17:57 — with GitHub Actions Inactive

FindHao had a problem deploying to docker-s3-upload December 11, 2024 17:57 — with GitHub Actions Error

fix do_bench_cudagraph_mem

7fd089f

FindHao temporarily deployed to docker-s3-upload December 11, 2024 17:59 — with GitHub Actions Inactive

xuzhao9 approved these changes Dec 12, 2024

View reviewed changes

facebook-github-bot closed this in f58c0b6 Dec 12, 2024

facebook-github-bot added the Merged label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cudagraph mem #112

Fix cudagraph mem #112

FindHao commented Dec 11, 2024

FindHao commented Dec 11, 2024

FindHao commented Dec 11, 2024

facebook-github-bot commented Dec 11, 2024

xuzhao9 left a comment

facebook-github-bot commented Dec 12, 2024

Fix cudagraph mem #112

Fix cudagraph mem #112

Conversation

FindHao commented Dec 11, 2024

FindHao commented Dec 11, 2024

FindHao commented Dec 11, 2024

facebook-github-bot commented Dec 11, 2024

xuzhao9 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 12, 2024