-
Notifications
You must be signed in to change notification settings - Fork 203
GSoC 2025 Projects
AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (10k, 100k, 1M, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.
AiiDA is used in research projects at universities, research institutes and companies (see SciPy 2020 talk, SciPy 2022 talk, publications, and testimonials).
- Help accelerate the transition to open (computational) science
- Help fix the reproducibility crisis. Computational science is a good place to start.
- Work with a team of computational scientists (mostly physics backgrounds) who are passionate about both science and coding.
- We have an active Discourse community & biweekly developer meetings.
A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.
To be considered as a GSoC student, we ask you to make a small pull request to aiida-core
, or any active repositories in aiidateam and aiidalab organizations - could be a simple bug fix, improving the documentation, etc. See e.g. (for aiida-core
)
Say hi on our GSOC 2025 topic on Discourse.
level Mid
Expected Size 350h
Summary:
A significant portion of AiiDA’s source code is currently excluded from mypy
checks, leading to inconsistent or missing type annotations across different modules. The goal of this project is to systematically add and refine type annotations throughout AiiDA’s codebase, ensuring that all functions, classes, and methods are correctly typed. Standardizing type annotations will improve code readability, maintainability, and reduce future bugs.
- Comprehensive and consistent type annotations across all AiiDA modules.
- Removal of unnecessary exclusions from
mypy
checks. - Improved developer experience and reduced ambiguity when extending or refactoring AiiDA’s code.
- Basic familiarity with Python.
- Understanding of typing in Python (e.g.,
typing
module, type hints, generics). - Familiarity with static type checking tools (e.g.,
mypy
) is helpful but not strictly required, as it can be learned during the project.
level Advanced
Expected Size 350h
AiiDA defines running jobs as AiiDA processes, which are state machines whose states are stored on the file system—allowing them to recover after reboots and continue executing remote jobs. It uses Python’s asyncio library to handle I/O-bound tasks, such as submitting jobs and waiting for remote execution. We avoid exposing the asynchronous syntax by providing only synchronous entry points when launching jobs, shielding end users from the complexities of asynchronous programming. However, synchronous Python functions within AiiDA run in a blocking manner within the event loop of main thread. Enabling multithreading would significantly boost AiiDA’s throughput by allowing these blocking processes to run concurrently.
- Streamline event loop management.
- Deprecate
nest-asyncio
in favor ofgreenlet
. - Enable synchronous function execution in newly spawned threads.
- Familiarity with Python
- Experience with asynchronous and parallel programming
level Mid / Advanced
Expected Size 350h
One of the most powerful aspects of using AiiDA to run your workflows is that the automatically generated provenance can be used to flexibly query for the data that the user is interested in. However, while the QueryBuilder
provides this flexibility, it can be challenging to learn and even time-consuming for experienced users.
To address this issue, this project aims to develop a tool utilizing large language models (LLMs) to generate queries from natural language prompts. Additionally, LLM-based code generation can be useful for broader AiiDA applications, so the project may optionally be extended to support general AiiDA code generation.
While existing LLMs like ChatGPT already perform fairly well, their generated code is often incorrect or outdated. By creating a diverse and maintainable dataset of query prompts and corresponding Python code, and either fine-tuning a dedicated LLM or designing carefully engineered prompts, we aim to improve the accuracy of generated queries, providing users with a powerful tool for extracting relevant results more effectively. The student will have the opportunity to actively participate in decision-making to determine the best approach.
By the end of this project, we aim to have a lightweight tool that can reliably generate a correct QueryBuilder
instance from a user prompt.
This will require:
- A database that maps natural language prompts to the corresponding queries, which can be easily and incrementally expanded.
- LLM trained on this database that converts prompts into a
QueryBuilder
instance, and optionally into more general AiiDA code. - A user-friendly interface for that can be installed locally in the form of a Python package and optionally an online tool integrated with the Materials Cloud.
- We expect you to be familiar with object-oriented programming in Python.
- You need to have experience in working with large language models.
- Additionally, a first look into AI engineering is required.
This project poses an exciting challenge for both students and mentors. While the AiiDA team may not have extensive experience with LLM, we eagerly anticipate students bringing their knowledge to the table. We are ready to provide expertise for the actual queries and more, making this collaboration a dynamic and enriching opportunity.
If you're already familiar with AiiDA and have your own idea on how to improve it, we're happy to consider it (you may also want to check the development roadmap for further interesting project ideas). In this case, please think about the steps you would take to attack the problem and contact us in advance so that we can draw up a rough work plan.
Please use the GSOC 2025 topic on Discourse
- The timeline