GPU support for training geo-level models #149

hansbantilan · 2024-05-13T17:07:11Z

Context

Our team uses LightweightMMM and Robyn concurrently, and we are evaluating performance gains from upgrading LightweightMMM to Meridian.

We find that national-level models perform about the same as what we have come to expect from LightweightMMM, but we see performance gains in geo-level models.

Bug

We can only train geo-level models on CPU. Training on even a single GPU, we find that the _xla_windowed_adaptive_nuts() call on L1084 in ~/meridian/model/model.py results in a RET_CHECK failure.

File "~/meridian/model/model.py", line 1084, in sample_posterior
    mcmc = _xla_windowed_adaptive_nuts(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "~/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: RET_CHECK failure (external/local_xla/xla/service/gpu/gemm_rewriter_triton.cc:1269) fusion_inputs.size() - operand_count_before <= TritonFusionAnalysis::kMaxParameterPerScope  [Op:__inference__xla_windowed_adaptive_nuts_93438]

We have trained on proprietary data even for these small tests, because of the 404 errors others have encountered (c.f. #144) in trying to access the getting-started notebooks
https://developers.google.com/meridian/notebook/meridian-getting-started

Working links to these notebooks and the publicly available data and shared geo-level model training code snippets would help you reproduce these errors.

Our tests were local runs on three different machines, with a desktop RTX 3080 running popOS, a laptop RTX 3080 running fedora, and a laptop RTX 4090 running archlinux, respectively, all resulting in the same RET_CHECK failure.

These were run with the TensorFlow version pulled by the recommended

pip install --upgrade git+https://{github_token}@github.com/google/meridian.git

with CUDA 12.3.1-2 and cuDNN 8.9.7.29-1

Please advise.

The text was updated successfully, but these errors were encountered:

viktoriias · 2024-05-13T18:29:28Z

For the access, please reach out to [email protected] with subject: "[Access Issue]" and someone will help you out with this.

Tmhamm3 · 2024-05-15T20:58:06Z

Hi Hans,

Could you provide some background information related to your testing, such as the inputs you're using for meridian.sample_posterior(), as well as the size of the dataset you are using e.g. number of geos, time periods, and channels?

hansbantilan · 2024-05-20T13:13:45Z

Hi Viktoriia,

Thank you for the meridian-comms pointer, your colleagues kindly gave me and the others in my team access to important pages that helped us get to the root of this issue.

Hi Tmhamm3,

In our tests we kept the number of geos artificially small, to less than half a dozen. This turns out to be key in how we stumbled upon this RET_CHECK failure bug for GPU runs, although the same run will succeed with a CPU.

May I suggest we stick to that toy dataset you provide in

https://developers.google.com/meridian/notebook/meridian-getting-started

since it's one that everyone has access to. There are 40 geos in that toy dataset. Empirically, we find that truncating the dataset to fewer than a dozen geos will result in the same RET_CHECK failure that we see for GPU runs.

i.e., for datasets that have fewer than a dozen geos, it appears that GPU support for the NUTS MCMC sampling will fail.

Tmhamm3 · 2024-05-20T23:26:57Z

Hi Hans,

Thank you for the additional information.
Have you tried using the free tier V100 on Colab, or are you just getting this error on your RTX GPUs?

hansbantilan · 2024-05-22T17:30:43Z

Hi Travis,

Thank you for the suggestion. I've run the same test with a colab free tier T4 instance, truncating your toy dataset in exactly the same way as described above.

The _xla_windowed_adaptive_nuts() calls succeeded on that T4 instance even when the number geos were truncated to less than 12.

For our reference, the CUDA version is 12.2, cuDNN version is 8.9 on that T4 instance, matching the versions used in the local tests.

This suggests that not all GPU hardware is supported by the _xla_windowed_adaptive_nuts() call in sample_posterior()

May I ask, what do you think?

Tmhamm3 · 2024-05-29T21:01:00Z

Hi Hans,

I'm hesitant to say that not all GPU hardware is supported by the tfp.experimental.mcmc.windowed_adaptive_nuts() function, especially since you are saying it works with certain data sizes. I see you previously mentioned the following versions: CUDA 12.3.1-2 and cuDNN 8.9.7.29-1. Could you provide the driver version you have installed on your machines as well?

santoso-wijaya added the question Further information is requested label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support for training geo-level models #149

GPU support for training geo-level models #149

hansbantilan commented May 13, 2024 •

edited

Loading

viktoriias commented May 13, 2024

Tmhamm3 commented May 15, 2024

hansbantilan commented May 20, 2024 •

edited

Loading

Tmhamm3 commented May 20, 2024

hansbantilan commented May 22, 2024

Tmhamm3 commented May 29, 2024

GPU support for training geo-level models #149

GPU support for training geo-level models #149

Comments

hansbantilan commented May 13, 2024 • edited Loading

Context

Bug

viktoriias commented May 13, 2024

Tmhamm3 commented May 15, 2024

hansbantilan commented May 20, 2024 • edited Loading

Tmhamm3 commented May 20, 2024

hansbantilan commented May 22, 2024

Tmhamm3 commented May 29, 2024

hansbantilan commented May 13, 2024 •

edited

Loading

hansbantilan commented May 20, 2024 •

edited

Loading