Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cudagraph mem #112

Closed
wants to merge 3 commits into from
Closed

Fix cudagraph mem #112

wants to merge 3 commits into from

Conversation

FindHao
Copy link
Member

@FindHao FindHao commented Dec 11, 2024

Fix #110

Test Plan:

% python run.py --op kl_div --mode fwd  --precision fp32 --metrics latency,speedup,gpu_peak_mem,mem_footprint_compression_ratio --csv --cudagraph
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:29<00:00,  4.85s/it]
(B, T, V);torch_kl_div-gpu_peak_mem;torch_kl_div-latency;liger_kl_div-mem_footprint_compression_ratio;liger_kl_div-gpu_peak_mem;liger_kl_div-speedup;liger_kl_div-latency;inductor_kl_div-mem_footprint_compression_ratio;inductor_kl_div-gpu_peak_mem;inductor_kl_div-speedup;inductor_kl_div-latency
(8, 512, 4096);0.335545344;0.2913801074028015;2.499664352734763;0.13423616;4.345591424995377;0.06705188751220703;2.4999408737712234;0.134221312;4.5710188874722935;0.06374511122703552
(8, 512, 8192);0.671089664;0.5571430921554565;2.4998321648446384;0.268453888;4.4568456168076676;0.12500838935375214;2.4999704364909068;0.26843904;4.131225595689207;0.13486145436763763
(8, 512, 16384);1.342178304;1.0886905193328857;2.4999160795413364;0.536889344;4.558430863427795;0.238830104470253;2.499985218146775;0.536874496;4.128684738124056;0.26368942856788635
(8, 512, 32768);2.684355584;2.14784574508667;2.499958039050386;1.073760256;4.621623748214942;0.4647383391857147;2.499992609048718;1.073745408;4.12963063769356;0.5201060175895691
(8, 512, 65536);5.368710144;4.265157222747803;2.4999790193451172;2.14750208;4.651345931803366;0.9169726967811584;2.499996304518191;2.147487232;4.136367624897469;1.0311359167099
(8, 512, 131072);10.737419264;8.51184368133545;2.499989509627539;4.294985728;4.678422853171112;1.819383144378662;2.499998152257554;4.29497088;4.1419456155050405;2.05503511428833

This reverts commit 9102236.
@FindHao
Copy link
Member Author

FindHao commented Dec 11, 2024

% python run.py --op kl_div --mode fwd_bwd  --precision fp32 --metrics latency,speedup --csv  --num-inputs 1 --cudagraph
  0%|                                                                                                                           | 0/1 [00:00<?, ?it/s]
Caught exception, terminating early with partial results
Traceback (most recent call last):
  File "/scratch/yhao/miniconda3/envs/pta_gil/lib/python3.10/site-packages/triton/testing.py", line 79, in do_bench_cudagraph
    fn()
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 708, in <lambda>
    fwd_bwd_fn = lambda: (fwd_fn(), bwd_fn())
  File "/scratch/yhao/pta/tritonbench/tritonbench/operators/kl_div/operator.py", line 59, in <lambda>
    return lambda: y.backward(retain_graph=True)
  File "/scratch/yhao/pta/pytorch/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/scratch/yhao/pta/pytorch/torch/autograd/__init__.py", line 347, in backward
    _engine_run_backward(
  File "/scratch/yhao/pta/pytorch/torch/autograd/graph.py", line 823, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: operation would make the legacy stream depend on a capturing blocking stream
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 800, in run
    y_vals: Dict[str, BenchmarkOperatorMetrics] = functools.reduce(
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 788, in _reduce_benchmarks
    acc[bm_name] = self._do_bench(
  File "/scratch/yhao/pta/tritonbench/tritonbench/utils/triton_op.py", line 1037, in _do_bench
    metrics.latency = do_bench_wrapper(
  File "/scratch/yhao/pta/tritonbench/tritonbench/components/do_bench/run.py", line 16, in do_bench_wrapper
    return triton.testing.do_bench_cudagraph(
  File "/scratch/yhao/miniconda3/envs/pta_gil/lib/python3.10/site-packages/triton/testing.py", line 74, in do_bench_cudagraph
    with torch.cuda.graph(g):
  File "/scratch/yhao/pta/pytorch/torch/cuda/graphs.py", line 186, in __exit__
    self.cuda_graph.capture_end()
  File "/scratch/yhao/pta/pytorch/torch/cuda/graphs.py", line 84, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(B, T, V)

still fail for bwd included tests.

@FindHao
Copy link
Member Author

FindHao commented Dec 11, 2024

Let's focus on forward fix for now. will use another PR to fix backward issue.

@facebook-github-bot
Copy link
Contributor

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@FindHao merged this pull request in f58c0b6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cudagraph doesn't work anymore
3 participants