Skip to content

GSoC_2025_Projects

Jan edited this page Mar 5, 2025 · 5 revisions

Google Summer of Code with SBI

Introduction

sbi is participating in the 2025 Google Summer of Code (GSoC) under the NumFOCUS GSoC application. It is a good project to contribute to if you are interested in applying machine learning to solve real-world scientific problems, working with experts in the field.

sbi is an open source library for simulation-based inference (SBI). What is SBI? Many researchers use computer simulations to understand complex systems, in fields like neuroscience, astrophysics, and epidemiology. Often, these simulations are so complex that it's difficult or impossible to use traditional Bayesian inference methods. sbi offers a way to perform Bayesian inference without needing a likelihood function, by using the simulator itself. It uses modern probabilistic machine learning methods, like normalizing flows and neural posterior estimation, to make this possible. sbi is built on PyTorch and is designed to be useful for both research and practical applications.

Timeline

The timeline of the GSoC internships is available at the GSoC website.

How to apply

If you're interested in contributing to sbi through Google Summer of Code, a great first step is to get familiar with our project. We recommend starting by reading our contributing guide. Since sbi is built on PyTorch, having a basic understanding of PyTorch will be very helpful. You can also learn a lot by exploring our tutorials in the sbi documentation – they show how to use the package in practice.

A key part of the GSoC application process for SBI is making a contribution to the sbi codebase. This doesn't have to be a major feature; a small documentation improvement or a fix for a minor bug is perfectly fine. You can find some good starting points by looking at our beginner-friendly issues. This contribution (your pull request) shows us that you can work with the code and are serious about contributing.

For any questions, or to discuss project ideas, please reach out to us via Discord (get access here). We're happy to help!

Projects ideas

We've listed some project ideas below to get you started, but we're also open to your own suggestions related to SBI! If you have an idea that's not on the list, please get in touch with us on Discord before submitting a proposal. We won't consider proposals on unlisted topics from applicants who haven't discussed their idea with us first.

The ideas below are intentionally broad. Some might be too large for a single GSoC project. Your proposal should focus on a specific, manageable set of tasks that you can realistically complete within the GSoC timeframe. We have added estimates for the complexity and duration for each project, but of course this also depends on your background. Please feel free to reach out to us to discuss your application or any questions!

1) Bridging Simulation-Based Inference (SBI) and Probabilistic Programming

Project Overview

  • Complexity: Intermediate. Requires solid Python programming skills, basic understanding of PyTorch and basic knowledge of Bayesian inference concepts (likelihood, posterior, prior, MCMC, variational inference).
  • Duration: 175-350h, depending on background and prior experience.
  • Mentors: @janfb (@manuelgloeckler)

This project aims to connect the power of simulation-based inference (SBI) with the flexibility of probabilistic programming languages (PPLs) like PyMC and Pyro. SBI lets us perform Bayesian inference even when we don't have a traditional likelihood function – instead, we use a simulator to generate data. SBI learns an approximate likelihood from these simulations. PPLs, on the other hand, provide powerful tools for building and performing inference with complex Bayesian models, including hierarchical models. This project will bridge the gap, making it easier to use SBI-learned likelihoods within the rich modeling environment of PPLs.

Why is this important?

Many real-world problems involve hierarchical structures – for example, analyzing data from multiple individuals, each with multiple measurements, or modeling variations across different experimental conditions. PPLs excel at handling these hierarchical models. Currently, using SBI with these kinds of complex models is challenging. This project will unlock the ability to combine the strengths of both approaches: using SBI to handle the simulator, and a PPL to handle the hierarchical structure and advanced inference techniques.

Project Goals:

The core idea is to make the "synthetic" (approximate) likelihoods learned by SBI compatible with the model specification and inference engines like Pyro. This will allow users to:

  1. Define a simulator (as usual in SBI).
  2. Train an SBI method (NLE, or NRE) to learn an approximate likelihood.
  3. Use this learned likelihood within a Pyro model, just like any other likelihood function.
  4. Use this learned likelihood within a (hierachical) Bayesian model.
  5. Perform inference (e.g., MCMC or Variational Inference) using the PPL's built-in tools.

Expected Output:

  • Minimal Goal: A working example demonstrating inference on a simple model (e.g., inferring the mean of a Gaussian distribution) within Pyro, where the likelihood is a "synthetic likelihood" learned by SBI.
  • Main Goal: A working example demonstrating inference in a multi-level hierarchical model (e.g., a model with group-level and individual-level parameters) within Pyro, using a synthetic likelihood learned by SBI. This will showcase the power of the integration for more complex, real-world scenarios.
  • Stretch goal: Provide infrastructure for defining models and samplers, and integrate this new functionality into sbi itself.

2) Implementing the Simformer Algorithm for Simulation-Based Inference

Project Overview

  • Complexity: High. Requires strong Python programming skills, solid understanding of PyTorch, familiarity with deep learning concepts (e.g., neural networks, transformers, backpropagation, optimization), basic knowledge of Bayesian inference.
  • Duration: 350h, depending on background and prior experience.
  • Mentors: @manuelgloeckler (@deismic)

This project focuses on implementing the Simformer algorithm, a novel approach to SBI introduced by Gloeckler et al. (2024), within the sbi Python package. The Simformer offers a unified, "all-in-one" framework for SBI, integrating posterior estimation, likelihood estimation, and posterior predictions. This project involves translating the theoretical concepts and mathematical formulation of the Simformer into a robust, well-tested, and user-friendly PyTorch implementation within sbi.

This project provides a chance to implement and contribute a state-of-the-art deep learning algorithm for simulation-based inference, gaining practical experience with Transformer architectures and contributing to a growing area of research.

Background: What is the Simformer?

Traditional SBI methods often specialize in either posterior estimation (e.g., NPE) or likelihood estimation (e.g., NLE). The Simformer breaks this dichotomy by framing SBI as a sequence modeling problem. It leverages a Transformer architecture, commonly used in natural language processing, to model the joint distribution of parameters and simulation outputs. This allows the Simformer to, at inference time, perform arbitrary conditioning of the joint, e.g., condition on the observed data to approximate the posterior, or condition on parameters to approximate the likelihood; all within a single, unified framework. This additionally allows the application of sbi to problems with dynamic dimensionality (i.e. functional variables). The architecture is described in detail in Gloeckler et al. (2024) and in a helpful blog post (https://transferlab.ai/pills/2024/all-in-one-simulation-based-inference/). The original implementation is available in JAX at https://github.com/mackelab/simformer.

Project Goals:

  1. Implement the Core Simformer components in PyTorch: Translate the mathematical formulation of the Simformer (as described in the paper and blog post) into Python code using PyTorch, using the JAX implementation as a guideline. This includes:

    • Designing and implementing the Transformer architecture, specifically tailored for SBI, in PyTorch. This will likely involve adapting components from existing Transformer implementations. Specifically, it should allow for the adjustment of the attention mask to incorporate information about the dependence structure present in the task.
    • Developing efficient data loading and preprocessing routines to handle simulation data in a format suitable for the Transformer. Specifically, this requires implementing a (learnable) Tokenizer that will lower all variables in the joint distribution to a token representation that can be processed by the Transformer.
  2. Integrate with the sbi Package: Ensure the PyTorch implementation of the Simformer seamlessly integrates with the existing sbi framework. This involves:

    • Implementing the appropriate loss functions for training the Simformer (as described in the paper). This should, at best, seamlessly integrate with the current infrastructure for training diffusion models within the sbi package.
    • Creating a user-friendly API for training and using the Simformer, consistent with other sbi inference methods.
    • Writing comprehensive unit tests to ensure the correctness and robustness of the implementation.
  3. Demonstrate Functionality and Performance:

    • Create tutorial notebooks showcasing how to use the Simformer for various inference tasks (parameter estimation, etc.).
    • Compare the performance of the Simformer to existing SBI methods (e.g., NPE, NLE, NRE) using the mini-sbibm benchmark available in sbi. This will demonstrate the advantages and potential limitations of the Simformer in different scenarios.

Expected Output:

  • MVP: A working PyTorch implementation of the core Simformer algorithm, integrated with sbi, which can be trained on a simple benchmark problem (e.g., inferring parameters of a Gaussian distribution).
  • Main Goal: A fully functional and well-tested PyTorch-based Simformer implementation within sbi, with a user-friendly API, comprehensive documentation, and example notebooks demonstrating its use on various inference tasks and comparing its performance to other SBI methods.
  • Stretch Goal: Explore and implement extensions to the Simformer improving training stability/efficiency with alternative training objectives, support for joint distributions with dynamic dimension, and flexible and structured Tokenizers that allow to assignment of embedding nets to variables of a certain data type (i.e. a cnn for an image), compressing it to a single (or few) tokens.

3) From "Stringly Typed" to "Strongly Typed" Arguments in sbi

Project Overview

  • Complexity: Intermediate. Strong Python programming skills, e.g., understanding of Python data structures like dictionaries and lists, and familiarity with object-oriented programming, understanding of Python type hints. concepts (e.g., neural networks, transformers, backpropagation, optimization), basic knowledge of Bayesian inference.
  • Duration: 175-350, depending on background and prior experience.
  • Mentors: @janosg (@janfb)

This project aims to improve the robustness and developer experience of the sbi package by transitioning from "stringly typed" arguments to “strongly typed” arguments in key functions and classes (see this blogpost or this talk for details). Currently, sbi often uses strings to specify options, particularly for algorithm choices (e.g., density_estimator="maf") and configuration dictionaries (e.g., mcmc_parameters: Dict[str, Any]). While flexible, this approach is prone to errors like typos, lacks autocompletion in IDEs, and makes it harder to discover available options. This project will introduce a more structured and type-safe way to specify these options, inspired by techniques used in Rust and outlined in the optimagic enhancement proposal (https://optimagic.readthedocs.io/en/latest/development/ep-02-typing.html).

This project offers a hands-on opportunity to learn and apply best practices in modern Python development, focusing on type safety and API design, skills that are highly valuable in any software engineering role.

Project Goals:

  1. Identify Key Areas for Improvement: Analyze the sbi codebase to identify functions and classes where stringly typed arguments are prevalent and could be replaced with stronger typing. This includes, but is not limited to:

    • density_estimator arguments in inference methods.
    • mcmc_method arguments.
    • Configuration dictionaries like mcmc_parameters.
  2. Implement Strong Typing: Replace string-based options with more robust alternatives, such as:

    • Enums: For choices with a fixed set of valid options (e.g., different density estimators or MCMC methods). This provides autocompletion and prevents typos. This follows the pattern discussed in the provided blog post.
    • Pydantic Models (or Dataclasses): For configuration dictionaries, replacing Dict[str, Any] with structured classes that define the expected fields and types. This provides validation and autocompletion for configuration parameters. This is inspired by the optimagic enhancement proposal.
  3. Update Documentation and Tests: Thoroughly update the documentation and unit tests to reflect the changes in the API. Ensure backward compatibility where possible, or provide clear deprecation warnings and migration instructions.

Expected Output:

  • MVP: A refactored version of at least one key function (e.g., a function accepting a density_estimator argument) that uses enums instead of strings for option selection. This should include updated documentation and tests.
  • Main Goal: A significant portion of the sbi codebase refactored to use enums and Pydantic models (or dataclasses) for argument specification, leading to improved type safety, better developer experience, and reduced risk of user errors. This should include comprehensive documentation and updated unit tests.
  • Stretch Goal: Explore generating documentation automatically from the typed definitions.