-
Notifications
You must be signed in to change notification settings - Fork 200
GSoC 2020 Projects
AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (1k, 10k, 100k, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.
AiiDA is used in research projects at universities, research institutes and companies (examples of recent works using AiiDA are [1] [2] [3]).
To be considered as a GSoC student, we ask you to make a small pull request to aiida-core - could be a simple bug fix, improving the documentation, etc. See e.g.
- Help accelerate the transition to open (computational) science
- Contribute to fixing the reproducibility crisis. Computational science is a good place to start.
- Work with a team of computational scientists (mostly physicists) who are passionate about both science and coding
A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.
Level: intermediate
AiiDA has her own front-end Object Relational Mapper (ORM) to map python objects to the corresponding records in the (Postgresql) database. This ORM allows users to create and manage objects (e.g. AiiDA nodes in the provenance graph) through the AiiDA python API.
While an ORM provides useful abstraction for the user, it adds overhead that can become a bottleneck when operating a large numbers of objects at once. For example, the export/import functionality in AiiDA, which allows to export (parts) of a provenance graph and import it in another database, requires dealing with lots of objects in the database in one go.
The goal of this project is to speed up these processes by implementing a generic ORM API for bulk object creation that works with both low-level ORM backends supported by AiiDA (the Django and SqlAlchemy libraries).
This project will
- implement bulk insertion functionality in the AiiDA ORM to make the importing of data as efficient as possible
- make data import implementation independent of the low-level ORM backend (django / sqlalchemy) by passing through AiiDA's front-end ORM
This project will require the participant to work with the ORM of AiiDA so an understanding of Object Relational Mappers is important. AiiDA's ORM is implemented with two different libraries, Django and SqlAlchemy, so previous experience with those is desirable but not required. Finally, AiiDA uses PostgreSQL as the Relational Database Management System (RDBMS), therefore basic knowledge and understanding of an SQL-type database would be of benefit.
Level: intermediate AiiDA stores all calculation executions (including detailed information on inputs and outputs) in the form of a directed acyclic graph, where each calculation is a node, and is linked to other data nodes representing its inputs and the outputs that it created. Outputs, in turn, can be inputs of new calculations. This graph is generated automatically by AiiDA; by inspecting all the "ancestors" of a given data node in the graph, we have a complete description of the "provenance" of that data node, i.e. the full sequence of calculations (with their inputs) that led to its generation.
Each external code execution, typically on a remote high-performance computer (HPC) resource for the majority of AiiDA use cases, needs to specify (among the inputs) a "code" input. Currently, codes in AiiDA are a "symlink" to an existing executable on a given computer (i.e., they contain a reference of the computer on which they are installed, and the full path to the executable, together with some metadata, like which dynamic libraries to load at runtime). This makes it non-trivial to reproduce the outcome of a simulation on a different computer, as one would need to have the exact same code, code version, and runtime environment.
With the more widespread use and adoption of containers also in the HPC domain, we want to fully support codes that are not already compiled on the remote computer, but are stored in a container (like docker, shifter or singularity) and can be deployed at run time on the supercomputer.
In this way, we will achieve full reproducibility of any scientific workflow managed by AiiDA: inputs of any calculation are already stored in the AiiDA graph, and the codes will be fully containerised and reproducible. Moreover, the relation between the outputs of the code and the inputs of the next workflow steps are also encoded into the AiiDA graph.
With this project, it will become possible, for the first time, to really have a fully reproducible simulation environment, focusing not only on codes run on a given computer, but also when the simulation steps are run on many different remote (super)computers.
This project will
- extend the Code class/interface in AiiDA, to define a code that is not necessarily already installed on a supercomputer, but can be pulled from a container registry (e.g. DockerHub or some local registry in the supercomputer centre)
- implement routines to re-run an existing workflow (already run with AiiDA) with no parameters specified, except on which computer to rerun (the codes will be retrieved from the container registry and reused, and the inputs will be taken from the AiiDA provenance graph).
This project will require the participant to work with the engine of AiiDA. This requires advanced python experience, including some basic experience with coroutines. Moreover, experience with containerisation systems (docker, shifter, singularity) is important. Some basic experience with using job schedulers un clusters or supercomputers will also be important to be quickly productive on the project.
If you're already familiar with AiiDA and have your own idea on how to improve it, we're happy to consider it. In this case, please think about the steps of how you would go about attacking the problem so that we can draw up a rough work plan.
Available co-mentors are
We have an active Slack workspace & biweekly developer meetings.