-
Notifications
You must be signed in to change notification settings - Fork 161
GSoC_2025_Projects
sbi
is participating in the 2025 Google Summer of Code (GSoC) under the NumFOCUS GSoC application. It is a good project to
contribute to if you are interested in applying machine learning to solve real-world
scientific problems, working with experts in the field.
sbi
is an open source library for simulation-based inference (SBI).
What is SBI? Many researchers use computer simulations to
understand complex systems, in fields like neuroscience, astrophysics, and epidemiology.
Often, these simulations are so complex that it's difficult or impossible to use
traditional Bayesian inference methods. sbi
offers a way to perform Bayesian
inference without needing a likelihood function, by using the simulator itself. It
uses modern probabilistic machine learning methods, like normalizing flows and neural
posterior estimation, to make this possible. sbi
is built on PyTorch and is designed
to be useful for both research and practical applications.
The timeline of the GSoC internships is available at the GSoC website.
If you're interested in contributing to sbi
through Google Summer of Code, a great first
step is to get familiar with our project. We recommend starting by reading our
contributing guide. Since sbi
is built
on PyTorch, having a basic understanding of PyTorch will be very helpful. You can also
learn a lot by exploring our tutorials in the sbi
documentation – they show how to use the package in practice.
A key part of the GSoC application process for SBI is making a contribution to the sbi
codebase. This doesn't have to be a major feature; a small documentation improvement or
a fix for a minor bug is perfectly fine. You can find some good starting points by
looking at our beginner-friendly issues. This contribution (your pull request) shows us that you can work with the code and are serious about contributing.
For any questions, or to discuss project ideas, please reach out to us via Discord (get access here). We're happy to help!
We've listed some project ideas below to get you started, but we're also open to your own suggestions related to SBI! If you have an idea that's not on the list, please get in touch with us on Discord before submitting a proposal. We won't consider proposals on unlisted topics from applicants who haven't discussed their idea with us first.
The ideas below are intentionally broad. Some might be too large for a single GSoC project. Your proposal should focus on a specific, manageable set of tasks that you can realistically complete within the GSoC timeframe. We have added estimates for the complexity and duration for each project, but of course this also depends on your background. Please feel free to reach out to us to discuss your application or any questions!
- Complexity: Intermediate. Requires solid Python programming skills, basic understanding of PyTorch and basic knowledge of Bayesian inference concepts (likelihood, posterior, prior, MCMC, variational inference).
- Duration: 175-350h, depending on background and prior experience.
- Mentors: @janfb (@manuelgloeckler)
This project aims to connect the power of simulation-based inference (SBI) with the flexibility of probabilistic programming languages (PPLs) like PyMC and Pyro. SBI lets us perform Bayesian inference even when we don't have a traditional likelihood function – instead, we use a simulator to generate data. SBI learns an approximate likelihood from these simulations. PPLs, on the other hand, provide powerful tools for building and performing inference with complex Bayesian models, including hierarchical models. This project will bridge the gap, making it easier to use SBI-learned likelihoods within the rich modeling environment of PPLs.
Why is this important?
Many real-world problems involve hierarchical structures – for example, analyzing data from multiple individuals, each with multiple measurements, or modeling variations across different experimental conditions. PPLs excel at handling these hierarchical models. Currently, using SBI with these kinds of complex models is challenging. This project will unlock the ability to combine the strengths of both approaches: using SBI to handle the simulator, and a PPL to handle the hierarchical structure and advanced inference techniques.
Project Goals:
The core idea is to make the "synthetic" (approximate) likelihoods learned by SBI compatible with the model specification and inference engines like Pyro. This will allow users to:
- Define a simulator (as usual in SBI).
- Train an SBI method (NLE, or NRE) to learn an approximate likelihood.
- Use this learned likelihood within a Pyro model, just like any other likelihood function.
- Use this learned likelihood within a (hierachical) Bayesian model.
- Perform inference (e.g., MCMC or Variational Inference) using the PPL's built-in tools.
Expected Output:
- Minimal Goal: A working example demonstrating inference on a simple model (e.g., inferring the mean of a Gaussian distribution) within Pyro, where the likelihood is a "synthetic likelihood" learned by SBI.
- Main Goal: A working example demonstrating inference in a multi-level hierarchical model (e.g., a model with group-level and individual-level parameters) within Pyro, using a synthetic likelihood learned by SBI. This will showcase the power of the integration for more complex, real-world scenarios.
-
Stretch goal: Provide infrastructure for defining models and samplers, and
integrate this new functionality into
sbi
itself.
- Complexity: High. Requires strong Python programming skills, solid understanding of PyTorch, familiarity with deep learning concepts (e.g., neural networks, transformers, backpropagation, optimization), basic knowledge of Bayesian inference.
- Duration: 350h, depending on background and prior experience.
- Mentors: @manuelgloeckler (@deismic)
This project focuses on implementing the Simformer algorithm, a novel approach to
SBI introduced by Gloeckler et al. (2024), within the sbi
Python package. The Simformer offers a unified, "all-in-one" framework for SBI,
integrating posterior estimation, likelihood estimation, and posterior predictions. This
project involves translating the theoretical concepts and mathematical formulation of
the Simformer into a robust, well-tested, and user-friendly PyTorch implementation
within sbi
.
This project provides a chance to implement and contribute a state-of-the-art deep learning algorithm for simulation-based inference, gaining practical experience with Transformer architectures and contributing to a growing area of research.
Background: What is the Simformer?
Traditional SBI methods often specialize in either posterior estimation (e.g., NPE
) or
likelihood estimation (e.g., NLE
). The Simformer breaks this dichotomy by framing SBI
as a sequence modeling problem. It leverages a Transformer architecture, commonly used in natural language processing, to model the joint distribution of parameters and simulation outputs. This allows the Simformer to, at inference time, perform arbitrary
conditioning of the joint, e.g., condition on the observed data to approximate the
posterior, or condition on parameters to approximate the likelihood; all within a single,
unified framework. This additionally allows the application of sbi to problems with dynamic dimensionality (i.e. functional variables). The architecture is described in detail in Gloeckler et al. (2024) and
in a helpful blog post
(https://transferlab.ai/pills/2024/all-in-one-simulation-based-inference/).
The original implementation is available in JAX at
https://github.com/mackelab/simformer.
Project Goals:
-
Implement the Core Simformer components in PyTorch: Translate the mathematical formulation of the Simformer (as described in the paper and blog post) into Python code using PyTorch, using the JAX implementation as a guideline. This includes:
- Designing and implementing the Transformer architecture, specifically tailored for SBI, in PyTorch. This will likely involve adapting components from existing Transformer implementations. Specifically, it should allow for the adjustment of the attention mask to incorporate information about the dependence structure present in the task.
- Developing efficient data loading and preprocessing routines to handle simulation data in a format suitable for the Transformer. Specifically, this requires implementing a (learnable) Tokenizer that will lower all variables in the joint distribution to a token representation that can be processed by the Transformer.
-
Integrate with the
sbi
Package: Ensure the PyTorch implementation of the Simformer seamlessly integrates with the existingsbi
framework. This involves:- Implementing the appropriate loss functions for training the Simformer (as described in the paper). This should, at best, seamlessly integrate with the current infrastructure for training diffusion models within the sbi package.
- Creating a user-friendly API for training and using the Simformer, consistent with other
sbi
inference methods. - Writing comprehensive unit tests to ensure the correctness and robustness of the implementation.
-
Demonstrate Functionality and Performance:
- Create tutorial notebooks showcasing how to use the Simformer for various inference tasks (parameter estimation, etc.).
- Compare the performance of the Simformer to existing SBI methods (e.g., NPE, NLE,
NRE) using the
mini-sbibm
benchmark available insbi
. This will demonstrate the advantages and potential limitations of the Simformer in different scenarios.
Expected Output:
-
MVP: A working PyTorch implementation of the core Simformer algorithm, integrated
with
sbi
, which can be trained on a simple benchmark problem (e.g., inferring parameters of a Gaussian distribution). -
Main Goal: A fully functional and well-tested PyTorch-based Simformer
implementation within
sbi
, with a user-friendly API, comprehensive documentation, and example notebooks demonstrating its use on various inference tasks and comparing its performance to other SBI methods. - Stretch Goal: Explore and implement extensions to the Simformer improving training stability/efficiency with alternative training objectives, support for joint distributions with dynamic dimension, and flexible and structured Tokenizers that allow to assignment of embedding nets to variables of a certain data type (i.e. a cnn for an image), compressing it to a single (or few) tokens.
- Complexity: Intermediate. Strong Python programming skills, e.g., understanding of Python data structures like dictionaries and lists, and familiarity with object-oriented programming, understanding of Python type hints. concepts (e.g., neural networks, transformers, backpropagation, optimization), basic knowledge of Bayesian inference.
- Duration: 175-350, depending on background and prior experience.
- Mentors: @janosg (@janfb)
This project aims to improve the robustness and developer experience of the sbi
package by transitioning from "stringly typed" arguments to “strongly typed” arguments in
key functions and classes (see this blogpost or this talk for details). Currently, sbi
often uses strings to specify options,
particularly for algorithm choices (e.g., density_estimator="maf"
) and configuration
dictionaries (e.g., mcmc_parameters: Dict[str, Any]
). While flexible, this approach is
prone to errors like typos, lacks autocompletion in IDEs, and makes it harder to
discover available options. This project will introduce a more structured and type-safe
way to specify these options, inspired by techniques used in Rust and outlined in the
optimagic enhancement proposal
(https://optimagic.readthedocs.io/en/latest/development/ep-02-typing.html).
This project offers a hands-on opportunity to learn and apply best practices in modern Python development, focusing on type safety and API design, skills that are highly valuable in any software engineering role.
Project Goals:
-
Identify Key Areas for Improvement: Analyze the
sbi
codebase to identify functions and classes where stringly typed arguments are prevalent and could be replaced with stronger typing. This includes, but is not limited to:-
density_estimator
arguments in inference methods. -
mcmc_method
arguments. - Configuration dictionaries like
mcmc_parameters
.
-
-
Implement Strong Typing: Replace string-based options with more robust alternatives, such as:
- Enums: For choices with a fixed set of valid options (e.g., different density estimators or MCMC methods). This provides autocompletion and prevents typos. This follows the pattern discussed in the provided blog post.
-
Pydantic Models (or Dataclasses): For configuration dictionaries, replacing
Dict[str, Any]
with structured classes that define the expected fields and types. This provides validation and autocompletion for configuration parameters. This is inspired by the optimagic enhancement proposal.
-
Update Documentation and Tests: Thoroughly update the documentation and unit tests to reflect the changes in the API. Ensure backward compatibility where possible, or provide clear deprecation warnings and migration instructions.
Expected Output:
-
MVP: A refactored version of at least one key function (e.g., a function accepting a
density_estimator
argument) that uses enums instead of strings for option selection. This should include updated documentation and tests. -
Main Goal: A significant portion of the
sbi
codebase refactored to use enums and Pydantic models (or dataclasses) for argument specification, leading to improved type safety, better developer experience, and reduced risk of user errors. This should include comprehensive documentation and updated unit tests. - Stretch Goal: Explore generating documentation automatically from the typed definitions.