-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support for training geo-level models #149
Comments
For the access, please reach out to [email protected] with subject: "[Access Issue]" and someone will help you out with this. |
Hi Hans, Could you provide some background information related to your testing, such as the inputs you're using for meridian.sample_posterior(), as well as the size of the dataset you are using e.g. number of geos, time periods, and channels? |
Hi Viktoriia, Thank you for the meridian-comms pointer, your colleagues kindly gave me and the others in my team access to important pages that helped us get to the root of this issue. Hi Tmhamm3, In our tests we kept the number of geos artificially small, to less than half a dozen. This turns out to be key in how we stumbled upon this RET_CHECK failure bug for GPU runs, although the same run will succeed with a CPU. May I suggest we stick to that toy dataset you provide in
since it's one that everyone has access to. There are 40 geos in that toy dataset. Empirically, we find that truncating the dataset to fewer than a dozen geos will result in the same RET_CHECK failure that we see for GPU runs. i.e., for datasets that have fewer than a dozen geos, it appears that GPU support for the NUTS MCMC sampling will fail. |
Hi Hans, Thank you for the additional information. |
Hi Travis, Thank you for the suggestion. I've run the same test with a colab free tier T4 instance, truncating your toy dataset in exactly the same way as described above. The _xla_windowed_adaptive_nuts() calls succeeded on that T4 instance even when the number geos were truncated to less than 12. For our reference, the CUDA version is 12.2, cuDNN version is 8.9 on that T4 instance, matching the versions used in the local tests. This suggests that not all GPU hardware is supported by the _xla_windowed_adaptive_nuts() call in sample_posterior() May I ask, what do you think? |
Hi Hans, I'm hesitant to say that not all GPU hardware is supported by the tfp.experimental.mcmc.windowed_adaptive_nuts() function, especially since you are saying it works with certain data sizes. I see you previously mentioned the following versions: CUDA 12.3.1-2 and cuDNN 8.9.7.29-1. Could you provide the driver version you have installed on your machines as well? |
Context
Our team uses LightweightMMM and Robyn concurrently, and we are evaluating performance gains from upgrading LightweightMMM to Meridian.
We find that national-level models perform about the same as what we have come to expect from LightweightMMM, but we see performance gains in geo-level models.
Bug
We can only train geo-level models on CPU. Training on even a single GPU, we find that the _xla_windowed_adaptive_nuts() call on L1084 in ~/meridian/model/model.py results in a RET_CHECK failure.
We have trained on proprietary data even for these small tests, because of the 404 errors others have encountered (c.f. #144) in trying to access the getting-started notebooks
https://developers.google.com/meridian/notebook/meridian-getting-started
Working links to these notebooks and the publicly available data and shared geo-level model training code snippets would help you reproduce these errors.
Our tests were local runs on three different machines, with a desktop RTX 3080 running popOS, a laptop RTX 3080 running fedora, and a laptop RTX 4090 running archlinux, respectively, all resulting in the same RET_CHECK failure.
These were run with the TensorFlow version pulled by the recommended
with CUDA 12.3.1-2 and cuDNN 8.9.7.29-1
Please advise.
The text was updated successfully, but these errors were encountered: