Skip to content

GSoC 2020 Projects

Leopold Talirz edited this page Mar 3, 2020 · 17 revisions

Getting started with AiiDA

AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (1k, 10k, 100k, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.

AiiDA is used in research projects at universities, research institutes and companies (examples of recent works using AiiDA are [1] [2] [3]).

To be considered as a GSoC student, we ask you to make a small pull request to aiida-core - could be a simple bug fix, improving the documentation, etc. See e.g. GitHub issues by-label

Why work on AiiDA?

  • Help accelerate the transition to open (computational) science
  • Contribute to fixing the reproducibility crisis. Computational science is a good place to start.
  • Work with a team of computational scientists (mostly physicists) who are passionate about both science and coding

A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.

Project 1 - Performance optimizations at the ORM level

Level: intermediate

AiiDA has her own front-end Object Relational Mapper (ORM) to map python objects to the corresponding records in the (Postgresql) database. This ORM allows users to create and manage objects (e.g. AiiDA nodes in the provenance graph) through the AiiDA python API.

While an ORM provides useful abstraction for the user, it adds overhead that can become a bottleneck when operating a large numbers of objects at once. For example, the export/import functionality in AiiDA, which allows to export (parts) of a provenance graph and import it in another database, requires dealing with lots of objects in the database in one go.

The goal of this project is to speed up these processes by implementing a generic ORM API for bulk object creation that works with both low-level ORM backends supported by AiiDA (the Django and SqlAlchemy libraries).

Expected outcomes

This project will

  1. implement bulk insertion functionality in the AiiDA ORM to make the importing of data as efficient as possible
  2. make data import implementation independent of the low-level ORM backend (django / sqlalchemy) by passing through AiiDA's front-end ORM

Skills

This project will require the participant to work with the ORM of AiiDA so an understanding of Object Relational Mappers is important. AiiDA's ORM is implemented with two different libraries, Django and SqlAlchemy, so previous experience with those is desirable but not required. Finally, AiiDA uses PostgreSQL as the Relational Database Management System (RDBMS), therefore basic knowledge and understanding of an SQL-type database would be of benefit.

Project 2 - Built-in support for codes encapsulated in containers (docker, shifter, singularity, ...)

Level: intermediate

AiiDA stores all calculation executions (including detailed information on inputs and outputs) in the form of a directed acyclic graph, where each calculation is represented as a node, and is linked to other data nodes representing the inputs and the outputs that it created. Outputs, in turn, can then be inputs of new calculations. This graph is generated automatically by AiiDA; by inspecting all the "ancestors" of a given data node in the graph, we have a complete description of the "provenance" of that data node, i.e. the full sequence of calculations (with their inputs) that led to its generation.

When a calculation is performed by an external code (e.g. a binary on a remote high-performance computer (HPC)), the code is included as an input of the calculation. As of today, codes in AiiDA are represented as "symlinks" to an existing executable on a remove computer, i.e., they contain a reference of the computer on which they are installed, and the full path to the executable (plus some additional metadata, such as which dynamic libraries to load at runtime).

The last years have seen an increasing adoption of containers (using technologies such as docker, singularity, shifter or sarus), including in the HPC domain, where executables are no longer compiled on the target machine but are compiled once and run in a portable, encapsulated environment. The encapsulation of the full run-time environment, as well as the availability of global container registries, constitute a major step forward in terms of reproducibility - storing the identifier of the container in the AiiDA graph makes it possible to directly re-run existing workflows without access to the computer where it was originally executed.

This project will make containerized codes first-class citizens in the AiiDA provenance graph, making it possible to re-run recorded workflows, even if simulation steps are run on different remote (super)computers.

Expected outcomes

This project will

  1. extend the Code class/interface in AiiDA, to define a code that is not necessarily already installed on a supercomputer, but may be pulled from a container registry on demand (e.g. DockerHub or some local registry in the supercomputer centre)
  2. implement routines to re-run workflows recorded in an existing AiiDA graph, with no parameters except on which computer to run.

Skills

The participant will need to work with the workflow engine of AiiDA. This requires advanced python knowledge (including basic understanding of coroutines), as well as prior experience with container technologies (docker or singularity). Experience with job schedulers on clusters/supercomputers will be beneficial.

Project N - Your Idea Here

If you're already familiar with AiiDA and have your own idea on how to improve it, we're happy to consider it. In this case, please think about the steps of how you would go about attacking the problem so that we can draw up a rough work plan.

Mentorship

Available co-mentors are

We have an active Slack workspace & biweekly developer meetings.