-
-
Notifications
You must be signed in to change notification settings - Fork 2k
GSoC 2024 projects
New contributors should first read the contributing guide and learn the basics of PyTensor. Also they should read through some of the examples in the PyMC docs.
To be considered as a GSoC student, you should make a PR to PyMC / PyTensor. It can be something small, like a doc fix or simple bug fix. Some beginner friendly issues can be found here.
If you are a student interested in participating, please contact us via our Discourse site.
Below there is a list of possible topics for your GSoC project, we are also open to other topics, contact us on Discourse. Keep in mind that these are only ideas and that some of them can't be completely solved in a single GSoC project. When writing your proposal, choose some specific tasks and make sure your proposal is adequate for the GSoC time commitment. We expect all projects to be 350h projects, if you'd like to be considered for a 175h project you must reach out on Discourse. We will not accept 175h applications from people with whom we haven't discussed their time commitments before submitting the application.
- Add PyTorch backend to PyTensor
- Improve PyTensor linear algebra support
- Extending McBackend related features
- Pathfinder variational inference
- Spatial modeling
- Extend automatic marginalization functionality in PyMC-experimental
- Implement New Statespace Models
- Improve BART performance
- Improve log-probability inference of order statistics
PyMC uses PyTensor as its computational backend. PyTensor can convert a computational graph into C-code, JAX, and Numba, functions, depending on the user needs. This allows interoperability with different libraries from different ecosystems, like RUST-based NUTS sampling via nutpie (Numba backend), or JAX-based sampling and model building via Numpyro, blackjax and wrapping of arbitrary JAX operations/library (JAX backend) and, of course, PyMC (mostly C-backend).
We would like to explore PyTorch as a new backend. Being one of the most popular ML frameworks in the Python ecosystem, this would provide a bridge between the large PyTorch ecosystem and that of PyTensor/PyMC. Linking to PyTorch would further facilitate the usage of GPU hardware, that is currently only available to PyTensor users via the JAX backend. The recent addition of static graph compilation in PyTorch 2.0 may also provide an alternative high-level bridge to the promising IREE compiler, possibly with fewer/distinct constraints than those currently imposed from JAX (like static shapes, and fixed length looping).
The existing infrastructure for the Numba and JAX backends should provide an good blueprint for the introduction of PyTorch. An AI generated PR was opened sometime ago, that may be helpful (or safe to ignore completely): https://github.com/pymc-devs/pytensor/pull/457
The Project would consist on an early proof-of-concept transpilation of a small PyTensor graph composed of a handful of Operations to PyTorch. Work will be needed to assess the specific requirements of the PyTorch backend, regarding complex operations like looping, branching, and specialized variable types like random-number generators and sparse tensors. The goal of the project is not to implement a full PyTorch coverage of the current PyTensor offerings, but rather, implement the foundational work that would guide further community contributions to this aim.
As such, this project should also have a strong focus on developer-facing documentation and long-term planning of milestones. Engagement with the growing PyTensor developer community would be highly commendable!
- Ricardo Vieira
- Hours: 350
- Expected outcome: Initial framework for linking to PyTorch from PyTensor.
- Skills required: Python, PyTorch
- Difficulty: Hard
PyMC uses PyTensor as its computational backend. PyTensor can convert a computational graph into C-code, JAX, and Numba, functions, depending on the user needs. Pytensor also automatically applies graph rewrites to improve code efficiency and numerical stability. These range from simple, a / a -> 1
to more sophisticated, like inv(A) b -> solve(A, b)
. The recent CoLA package and paper give a roadmap for graph rewrites and optimizations that exploit matrix structure. This project would dramatically speed up many model families in PyMC, like state space models, time series, Gaussian processes, and other spatio-temporal models. This project is related to the Fast Exact GP project.
- Bill Engels
- Jesse Grabowski
- Ricardo Vieira
- Hours: 350
- Expected outcome: Improvement to PyTensors linear algebra functionality
- Skills required: Python, linear algebra
- Difficulty: Medium
Since v5.1.1, PyMC has optional support for using McBackend as a storage backend for MCMC draws & stats during sampling. This enables sampling of very big models, and when used with its ClickHouse backend, also live streaming of draws & stats to a high-performance database.
A GSoC project could explore different directions in this area:
- Streaming/dashboarding visualizations of live MCMC with inspiration from this PyMCon presentation and this STAN project.
- Implementing an HDF5 backend that writes data to disk as an alternative to streaming to a database. This could enable sparse read operations if the HDF5 has the same structure as the one written by ArviZ.
- Extending McBackend API to make it more powerful/convenient, for example as described in this Discourse post.
- Prototyping how a PyArrow backend could be implemented, thereby enabling interop with non-Python samplers such as
nutpie
. - Refactoring of the step method interfaces to be more standardized and stateless, thereby paving the road for making MCMCs resumable.
When writing an application for this project, please make it clear which direction you would like to focus on and why. Also we'll expect you to have done a little hands-on with the McBackend interface yourself, for example by using it with one of your models.
- Michael Osthege
- Colin Carroll
- Osvaldo Martin
- Adrian Seyboldt
- Hours: 350
- Expected outcome: Addition of new features, or performance improvements that strengthen the position & visibility of PyMC as the go-to library for running Bayesian inference.
- Skills required: Python, NumPy, ArviZ/xarray
- Difficulty: Moderate
I propose the implementation of the Pathfinder algorithm for variational inference in PyMC. The Pathfinder algorithm is a recent advancement in the field of approximate Bayesian inference, offering a scalable and efficient approach for approximating posterior distributions. Integrating this algorithm into PyMC, a popular probabilistic programming library, would provide users with a powerful tool for conducting Bayesian inference on complex models. The Pathfinder algorithm's ability to handle high-dimensional parameter spaces and large datasets makes it particularly well-suited for applications where traditional Markov chain Monte Carlo methods may be computationally expensive or impractical. This addition to PyMC would enhance its capabilities, making it more versatile and accessible for a broader range of Bayesian modeling scenarios.
- Chris Fonnesbeck
- Hours: 350
- Expected outcome: An implementation of the Pathfinder algorithm in PyMC
- Skills required/preferred: Python, JAX, statistics, optimization
- Difficulty: Moderate
This project will build on previous GSoC projects to continue improving PyMCs support for modeling spatial processes. There are many possible algorithms one may choose to work on, such as Gaussian process based methods for point processes like Nearest Neighbor GPs or the Vecchia approximation, and models that are types of Gaussian Markov Random Fields, like CAR, ICAR and BYM models. Implementations of these can be found in the R package CARBayes and INLA.
- Bill Engels
- Chris Fonnesbeck
- Hours: 350
- Expected outcome: An implementation of one or more of the methods listed above, along with one or more notebook examples that can be added to the PyMC docs demonstrating these techniques.
- Skills required: Python, statistics, GPs
- Difficulty: Medium
PyMC-Experimental includes a specialized PyMC MarginalModel subclass that can marginalize (and recover) finite discrete univariate variables for more efficient MCMC sampling. Recently we also added support for marginalization of DiscreteMarkovChain, yielding automatically derived HiddenMarkovModels.
A non-trivial example using this functionality in a multiple changepoint model can be found in this gist
This project would aim to extend this functionality in several ways:
- Support marginalization of truncated versions of other discrete distributions like Truncated Binomial or Truncated Poisson.
- Support marginalization of variables with closed form solution such as
Beta + Binomial = BetaBinomial
- Support marginalization of HMM models defined via Scan operations
- Integrate automatic marginalization with automatic probability derivation, rendering the
MarginalModel
class unnecessary. - Contribute new pymc-examples showcasing the new/existing functionality.
These points are suggestions and not an exhaustive list. Not all points must be tackled in the proposed project.
This project will require interacting with PyTensor, which is the backend used by PyMC. See https://www.pymc.io/projects/docs/en/v5.0.2/learn/core_notebooks/pymc_pytensor.html for more details. An understanding of probability theory is helpful but not a requirement (you can learn as you go)
- Hours: 350
- Expected outcome: Extend the functionality of MarginalModel and, ultimately, deprecate it so that PyMC users can benefit from it without having to engage with a Model subclass.
- Skills required: Python, Probability
- Difficulty: Hard
- Ricardo Vieira
- Rob Zinkov
Linear state space models offer a general framework for implementing a huge number of time series models in PyMC. PyMC-Experimental currently has a statespace module that implements SARIMAX, VARMAX, and structural models. The module helps users with estimation, forecasting, and causal analysis using these models.
Currently the module does not match all statespace models offered in the statsmodels.tsa.statespace module. In particular, dynamic factor models and [https://www.statsmodels.org/dev/statespace.html#linear-exponential-smoothing-models](linear exponential smoothing models). This project could implement one or both of these models in the existing statespace framework.
In addition, the project would produce an example notebook showing how to do analysis with the new model, similar to the SARIMAX notebook found here.
This project will require interacting with PyTensor, which is the backend used by PyMC. See https://www.pymc.io/projects/docs/en/v5.0.2/learn/core_notebooks/pymc_pytensor.html for more details. An understanding of time series analysis is also helpful, but not a requirement (you can learn as you go).
- Jesse Grabowski
- Hours: 350
- Expected outcome: New statespace model(s) in the
pymc_experimental.statespace
module - Skills required: Python; time series econometrics
- Difficulty: Medium
Bayesian Additive Regression Trees (BART) is a non-parametric regression model based on the sum of simple decision trees. In practice, it has been proven to be useful, for its flexibility and good statistical results even with minimal user intervention. In PyMC, we can create BART models through the PyMC-BART extension. This allows to include BART as components of other models. In this project we aim to increase the computational performance of BART, both increasing its speed and reducing the memory footprint. The exact approach is open to discussion. While, they may be room for significant optimization at the Python/NumPy/Numba code level. More ggressive approachs may be more productive. For instance cythonization of the tree-structures used by PyMC-BART, this is the approach used by Scikit-learn for its tree-based methods. But other options can be discussed too. Or reimplementation in a language like Rust, see here for an outdated version of PyMC-BART ported to rust).
In addition, the project would produce at least one example notebook showing how to analyze data with PyMC-BART.
- Osvaldo Martin
- Christopher Fonnesbeck
- Hours: 350
- Expected outcome: A more performant version of PyMC-BART
- Skills required: Python (and maybe familiarly with Cython, Numba, Rust)
- Difficulty: Medium
PyMC's fast-performing sampling procedure relies on taking gradients of log-probability functions inferred from random graphs. With PyTensor (PyMC's computational backend) allowing automatic differentiation, PyMC's capabilities to automatically derive the log-likelihood expression of various random graphs is at the core of many PyMC's advanced functionalities, such as pm.Censored
, pm.GaussianRandomWalk
, etc. Building upon recent work that allows inference for max and min operators, we would like to extend pymc.logprob.order.py
to handle graphs of arbitrary order statistics of i.i.d. and, eventually, non-i.i.d. random variables. This long-term goal can be achieved with incremental progress and, for GSoC 2024, we propose the following projects:
- Add log-probability functionality for
$j$ order statistics for i.i.d. random variables (see issue #7121); - Add log-probability functionality for maximum and minimum of non-i.i.d. random variables(see issue #7120).
- Larry Dong (primary)
- Ricardo Vieira
- Hours: 350
- Expected outcome: Enhancements to the
logprob
submodule, in particularpymc.logprob.order.py
- Skills required: Probability, Python
- Difficulty: Medium