diff --git a/docs/conf.py b/docs/conf.py index 2014094..ed7a8ad 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -120,6 +120,7 @@ linkcheck_exclude_documents = [ # ignore old (pre-2020) blog pages r"news/posts/201[0-9].*", + r"news/posts/2025-09-05-orm.md", ] linkcheck_ignore = [ r"https://www.youtube.com/.*", @@ -157,6 +158,7 @@ r"https://juser.fz-juelich.de/record/873744", r"https://openbis.ch", # Not accessibla as of 06-06-2025. Might come back r"https://www.eosc-hub.eu/", # Privacy warning when accessing manually + r"http://bit.ly/2HRjTWo", # Link now broken. Maybe remove fully in future? ] aiida_ignore_event_checks = [ diff --git a/docs/events.yaml b/docs/events.yaml index 9a1d65b..4fce6f1 100644 --- a/docs/events.yaml +++ b/docs/events.yaml @@ -85,7 +85,6 @@ - name: AiiDA tutorial at the BIG-MAP workshop start: 2020-12-02 location: Online - url: https://multiscale-modelling.eu/BigMapWorkshop2020 participants: ~80 - name: Materials Science Q&A @ SciPy 2020 diff --git a/docs/news/posts/2025-09-05-orm.md b/docs/news/posts/2025-09-05-orm.md new file mode 100644 index 0000000..1bf2f5b --- /dev/null +++ b/docs/news/posts/2025-09-05-orm.md @@ -0,0 +1,476 @@ +--- +blogpost: true +category: Blog +tags: architecture +author: Julian Geiger +date: 2025-09-05 +--- + +# Exploring AiiDA's ORM architecture + +In this post, we'll take a deep dive into the implementation of AiiDA's Object-Relational Mapping (ORM) system. +We'll explore how AiiDA leverages [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) to create a flexible ORM backend that abstracts database operations and separates concerns between the user-facing Python objects and the underlying database persistence. + +## Architecture overview: The multi-layer design + +AiiDA's ORM follows a four-layer architecture that provides clean separation between the user interface, business logic, and data persistence: + +``` +┌─────────────────────┐ +│ User interface │ ← Node (Python ORM class) +│ (orm/nodes) │ +├─────────────────────┤ +│ Abstract interface │ ← BackendNode (Abstract base class) +│ (implementation) │ +├─────────────────────┤ +│ ORM backend │ ← SqlaNode (SQLAlchemy implementation) +│ (psql_dos/sqlite) │ +├─────────────────────┤ +│ Database models │ ← DBNode (SQLAlchemy models) +│ (models) │ +└─────────────────────┘ +``` + +Each layer serves a distinct purpose: + +* __User interface__ (`Node`): Provides a clean, Pythonic API that hides database complexity +* __Abstract interface__ (`BackendNode`): Defines contracts that all backend implementations must follow +* __ORM backend__ (`SqlaNode`): Backend connection to the different database systems +* __Database model__ (`DbNode`): Defines the actual table schemas and relationships using SQLAlchemy's declarative approach + +Importantly, you, the user, will typically not have to interact directly with database-specific code. +Instead, you can work with the high-level `Node` class (and its child classes), while AiiDA automatically delegates database operations to the appropriate backend implementation. +The design further allows AiiDA to support multiple database backends (currently PostgreSQL and SQLite), while providing a unified Python interface. + +So let's start from the bottom and work our way up, shall we? + +## The database models: `DbNode` + +AiiDA uses SQLAlchemy's declarative approach to define database tables. +This means the database schema is defined using Python classes rather than raw SQL. +Here's how the core +[`DbNode`](https://github.com/aiidateam/aiida-core/blob/313f342f5d28eeba5967fec8196ed6fce393a77a/src/aiida/storage/psql_dos/models/node.py) +model is constructed[^1]: + +```python +from sqlalchemy.dialects.postgresql import JSONB, UUID +from sqlalchemy.orm import relationship +from sqlalchemy.schema import Column, ForeignKey +from sqlalchemy.types import DateTime, Integer, String, Text + +class DbNode(Base): + """Database model to store data for :py:class:`aiida.orm.Node`.""" + + __tablename__ = 'db_dbnode' + + # Primary key and unique identifier + id = Column(Integer, primary_key=True) + uuid = Column(UUID(as_uuid=True), default=get_new_uuid, nullable=False, unique=True) + + # Node classification + node_type = Column(String(255), default='', nullable=False, index=True) + process_type = Column(String(255), index=True) + + # Metadata + label = Column(String(255), nullable=False, default='', index=True) + description = Column(Text(), nullable=False, default='') + ctime = Column(DateTime(timezone=True), default=timezone.now, nullable=False, index=True) + mtime = Column(DateTime(timezone=True), default=timezone.now, onupdate=timezone.now, nullable=False, index=True) + + # JSON storage for flexible data + attributes = Column(JSONB, default=dict) + extras = Column(JSONB, default=dict) + repository_metadata = Column(JSONB, nullable=False, default=dict) + + # Foreign key relationships + dbcomputer_id = Column(Integer, ForeignKey('db_dbcomputer.id'), nullable=True, index=True) + user_id = Column(Integer, ForeignKey('db_dbuser.id'), nullable=False, index=True) + + # SQLAlchemy relationships + dbcomputer = relationship('DbComputer', backref='dbnodes') + user = relationship('DbUser', backref='dbnodes') +``` + +The `Base` class from which `DbNode` inherits is obtained via SQLAlchemy's `declarative_base()`, which provides the metaclass and base functionality that allows Python classes to be automatically mapped to database tables: + +```python +Base = declarative_base( + cls=Model, + name='Model', + metadata=MetaData(naming_convention=dict(naming_convention)) +) +``` + +This declarative approach means that table schemas, constraints, relationships, and indexes are all defined in Python code rather than separate SQL files, making the database structure self-documenting and version-controllable alongside the application logic. + +Some of the key features of AiiDA's SQLAlchemy models, that can already be seen from the `DbNode` definition above, are: + +1. __Use of JSON(B) for flexibility__: The use of PostgreSQL's `JSONB` (and SQLite's `JSON`) type for `attributes`, `extras`, and `repository_metadata` provides schema flexibility while maintaining query performance. + This is crucial for scientific computing where one can't predict what data users will want to store. + Traditional relational databases would require creating new columns for every new property, but JSON columns allow storing arbitrary structured data (given that it's JSON-serializable). +2. __UUID-based identity__: Each node has both an integer primary key (`id` in the table, `pk` in the Python API[^2]) for database efficiency and a UUID for global uniqueness and portability. + The integer `id` is user-friendly, fast for database joins and indexing, while the UUID allows nodes to be moved between different AiiDA installations without conflicts. +3. __Automatic timestamps__: Creation and modification times are automatically managed through SQLAlchemy's `default` and `onupdate` parameters. +4. __Indexing__: Important columns like `node_type`, `process_type`, and timestamps are indexed for query performance. + Without indexes, queries like "find all calculation nodes from last month" would be extremely slow. + The indexes make these common queries fast even with millions of nodes. + +## The abstract interface: `BackendNode` + +Above the database layer sits the abstract +[`BackendNode`](https://github.com/aiidateam/aiida-core/blob/313f342f5d28eeba5967fec8196ed6fce393a77a/src/aiida/orm/implementation/nodes.py) +class, which defines the interface contract that all database backend implementations must follow: + +```python +class BackendNode(BackendEntity, BackendEntityExtrasMixin, metaclass=abc.ABCMeta): + """Backend implementation for the `Node` ORM class. + + A node stores data input or output from a computation. + """ + + @abc.abstractmethod + def clone(self: BackendNodeType) -> BackendNodeType: + """Return an unstored clone of ourselves.""" + + @property + @abc.abstractmethod + def uuid(self) -> str: + """Return the node UUID.""" + + @property + @abc.abstractmethod + def node_type(self) -> str: + """Return the node type.""" + + # ... more abstract properties and methods +``` + +This abstract base class (ABC) ensures that regardless of which ORM backend is used, the same interface is available to higher-level code (currently, AiiDA uses only SQLAlchemy, but also Django ORM was supported in the past[^3]). +It defines essential properties like `uuid`, `node_type`, `process_type`, and methods for managing attributes, links, and storage operations. + +The abstract class serves as a __contract__ —any backend implementation must provide all these methods and properties. + +## The ORM backend: `SqlaNode` + +The +[`SqlaNode`](https://github.com/aiidateam/aiida-core/blob/313f342f5d28eeba5967fec8196ed6fce393a77a/src/aiida/storage/psql_dos/orm/nodes.py) +class bridges the abstract `BackendNode` interface with the concrete SQLAlchemy models: + +```python +class SqlaNode(entities.SqlaModelEntity[models.DbNode], ExtrasMixin, BackendNode): + """SQLA Node backend entity""" + + MODEL_CLASS = models.DbNode + USER_CLASS = SqlaUser + COMPUTER_CLASS = SqlaComputer + LINK_CLASS = models.DbLink + + def __init__(self, backend, node_type, user, computer=None, **kwargs): + """Construct a new `BackendNode` instance wrapping a new `DbNode` instance.""" + super().__init__(backend) + + arguments = { + 'node_type': node_type, + 'user': user.bare_model, + 'label': kwargs.get('label', ''), + 'description': kwargs.get('description', ''), + } + + if computer: + arguments['dbcomputer'] = computer.bare_model + + self._model = sqla_utils.ModelWrapper(self.MODEL_CLASS(**arguments), backend) + + # ... more properties and methods + + def store(self, links=None, clean=True): + session = self.backend.get_session() + + if clean: + self.clean_values() + + session.add(self.model) + + if links: + for link_triple in links: + self._add_link(*link_triple) + + if not session.in_nested_transaction(): + try: + session.commit() + except SQLAlchemyError: + session.rollback() + raise + else: + session.flush() + + return self +``` + +`SqlaNode` wraps a raw `DbNode` SQLAlchemy model in a +[`ModelWrapper`](https://github.com/aiidateam/aiida-core/blob/313f342f5d28eeba5967fec8196ed6fce393a77a/src/aiida/storage/psql_dos/orm/utils.py) +that handles session management automatically. +This wrapping provides two levels of access: `node.model` for the wrapped model with automatic session tracking, and `node.bare_model` for direct access to the raw SQLAlchemy model when you need to bypass AiiDA's management. +Beyond model wrapping, `SqlaNode` is responsible for actual database storage via its `store` method (see above). +The design further uses generic typing to ensure type safety and proper relationships between related objects like users and computers. + +## The user interface: `Node` + +Finally, we reach the top level, the `Node` class, which is the main interaction point for users. +The implementation uses composition to wrap an `SqlaNode` instance, while providing a Pythonic interface to users: + +```python +class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta): + """Base class for all nodes in AiiDA.""" + + def __init__(self, backend=None, user=None, computer=None, extras=None, **kwargs): + backend = backend or get_manager().get_profile_storage() + user = user if user else backend.default_user + + backend_entity = backend.nodes.create( + node_type=self.class_node_type, + user=user.backend_entity, + computer=computer.backend_entity if computer else None, + **kwargs + ) + super().__init__(backend_entity) +``` + +The creation magic happens in `backend.nodes.create()` which returns an `SqlaNode` instance[^4] +that gets stored in the `backend_entity` attribute. +The `Node` class then delegates all database operations to this contained `SqlaNode` through the `backend_entity` property. +From the user's perspective, just a `Node` is created—the backend selection and delegation are transparent. + +Here's how all of this looks like in action in a `verdi shell`: + +```python +In [1]: n = Int(1).store() # of type: aiida.orm.nodes.data.int.Int + +In [2]: n +Out[2]: + +In [3]: n.backend_entity +Out[3]: + +In [4]: n.backend_entity.bare_model +Out[4]: + +In [5]: n.backend_entity.model +Out[5]: +``` + +__Namespacing__ + +AiiDA further uses a namespace pattern on the `Node` class to organize functionality: + +```python +class Node(Entity['BackendNode', NodeCollection], metaclass=AbstractNodeMeta): + """Base class for all nodes in AiiDA.""" + + # ... more properties and methods + + @cached_property + def base(self) -> NodeBase: + """Return the node base namespace.""" + return NodeBase(self) + + # ... more properties and methods + +class NodeBase: + """A namespace for node related functionality.""" + + @cached_property + def repository(self) -> 'NodeRepository': + """Return the repository for this node.""" + return NodeRepository(self._node) + + @cached_property + def attributes(self) -> 'NodeAttributes': + """Return an interface to interact with the attributes.""" + return NodeAttributes(self._node) + + @cached_property + def links(self) -> 'NodeLinks': + """Return an interface to interact with the links.""" + return NodeLinks(self._node) +``` + +This namespace pattern was introduced because having all properties and methods directly on the `Node` class (as was the case in the past) made the API cluttered and prone to name conflicts. + +The namespace approach further groups related functionality together, e.g.: +- `node.base.attributes -> NodeAttributes` - attribute management +- `node.base.repository -> NodeRepository` - file repository operations +- `node.base.links -> NodeLinks` - provenance links +and the use of `@cached_property` ensures that these namespace objects are created only once per node instance[^5]. + +## Honorable mentions + +### Link management: modeling relationships + +AiiDA models relationships between nodes through a separate `DbLink` table: + +```python +class DbLink(Base): + """Database model to store links between nodes.""" + + __tablename__ = 'db_dblink' + + id = Column(Integer, primary_key=True) + input_id = Column(Integer, ForeignKey('db_dbnode.id'), nullable=False, index=True) + output_id = Column(Integer, ForeignKey('db_dbnode.id', ondelete='CASCADE'), nullable=False, index=True) + + input = relationship('DbNode', primaryjoin='DbLink.input_id == DbNode.id') + output = relationship('DbNode', primaryjoin='DbLink.output_id == DbNode.id') + + label = Column(String(255), nullable=False, index=True) + type = Column(String(255), nullable=False, index=True) +``` + +Rather than storing relationships as simple foreign keys, each link is a first-class entity with its own metadata. +The `label` field describes the specific role of the link (e.g., "structure", "parameters", "output_data"), while the `type` field categorizes the kind of relationship (e.g., "input", "output", "call"). +This metadata enables complex provenance queries like "find all structures that were relaxed using PBE exchange-correlation functional." +The `ondelete='CASCADE'` on the output relationship ensures that when a node is deleted, all its incoming links are also removed, maintaining referential integrity. + +### Pydantic models: Modern serialization + +Recently, +[`pydantic`](https://github.com/pydantic/pydantic) +models were integrated into AiiDA's ORM for modern data validation and serialization: + +```python +class Node(Entity['BackendNode', NodeCollection]): + """Base class for all nodes in AiiDA with Pydantic model support.""" + + class Model(Entity.Model): + """Pydantic model for Node serialization and validation.""" + + uuid: Optional[str] = MetadataField(None, description='The UUID of the node') + node_type: Optional[str] = MetadataField(None, description='The type of the node') + + attributes: Optional[Dict[str, Any]] = MetadataField( + None, + description='The node attributes', + orm_to_model=lambda node, _: node.base.attributes.all, + is_subscriptable=True, + exclude_to_orm=True, + ) + + repository_content: Optional[dict[str, bytes]] = MetadataField( + None, + description='Dictionary of file repository content encoded as base64', + orm_to_model=lambda node, _: { + key: base64.encodebytes(content) + for key, content in node.base.repository.serialize_content().items() + }, + exclude_to_orm=True, + ) + + def serialize(self, repository_path: Optional[pathlib.Path] = None) -> dict[str, Any]: + """Serialize the entity instance to JSON.""" + if repository_path is None: + repository_path = pathlib.Path(tempfile.mkdtemp()) / f'./aiida_serialization/{self.pk}/' + repository_path.mkdir(parents=True) + return self._to_model(repository_path).model_dump() + + @classmethod + def from_serialized(cls, **kwargs: dict[str, Any]) -> 'Node': + """Construct an entity instance from JSON serialized data.""" + return cls._from_model(cls.Model(**kwargs)) +``` + +The integration of `pydantic` brings various additional features and advantages: +- __Automatic validation__: Pydantic automatically validates data types and constraints based on type annotations +- __Serialization__: ORM objects can be automatically converted to/from `JSON` for export/import +- __API integration__: Pydantic's automatic serialization and validation capability allows for easy creation standard APIs (e.g., [aiida-restapi](https://github.com/aiidateam/aiida-restapi)) and integration with frameworks like FastAPI + +### The QueryBuilder + +The +[`QueryBuilder`](https://github.com/aiidateam/aiida-core/blob/313f342f5d28eeba5967fec8196ed6fce393a77a/src/aiida/orm/querybuilder.py) +is AiiDA's main Python API to retrieve data from the database. +It provides a uniform, backend-agnostic interface: + +```python +# Simple node query +qb = QueryBuilder() +qb.append(Node, filters={'ctime': {'>': datetime(2025, 1, 1)}}) +results = qb.all() + +# Complex relationship query +qb = QueryBuilder() +qb.append(StructureData, tag='structure') +qb.append(CalcJobNode, with_incoming='structure', tag='calc') +qb.append(FolderData, with_incoming='calc', project=['*']) +``` + +The `QueryBuilder` automatically handles several complex translation tasks behind the scenes. +Most importantly, it converts the high-level query syntax into the appropriate SQL statement (while respecting the specific dialect of the database backend), ensuring that one can write database-agnostic queries while still leveraging each backend's specific capabilities. +For this, among other operations, it converts ORM classes like `Node` to their corresponding database entities, translates relationship specifications such as `with_incoming` and `with_outgoing` into proper SQL joins, and automatically adds filters for node types and subtypes based on the classes specified in the query. + +### The case for `NodeCollection` + +In addition to the `QueryBuilder`, AiiDA also implements the `NodeCollection` class which provides a clean interface for managing ...wait for it... collections of nodes: + +```python +class NodeCollection(EntityCollection[NodeType], Generic[NodeType]): + """The collection of nodes.""" + + def delete(self, pk: int) -> None: + """Delete a Node from the collection with the given id""" + node = self.get(id=pk) + + if node.base.links.get_incoming().all(): + raise exceptions.InvalidOperation(f'cannot delete Node<{node.pk}> because it has incoming links') + + if node.base.links.get_outgoing().all(): + raise exceptions.InvalidOperation(f'cannot delete Node<{node.pk}> because it has outgoing links') + + self._backend.nodes.delete(pk) +``` + +The collection pattern provides several benefits that complement the `QueryBuilder`'s capabilities. +On one hand, while the `QueryBuilder` is suitable for complex analytical queries and relationship traversal, collections can handle the simpler but essential task of managing individual entities efficiently. +Thus, collections enable efficient bulk operations and batch processing that would be cumbersome with multiple `QueryBuilder` calls. +Finally, they also provide essential validation logic, such as preventing deletion of nodes with existing links to maintain provenance integrity, and again hide database-specific operations behind a clean interface. + +## Key Takeaways + +AiiDA's ORM architecture provides the foundation for all derived data types, for both, data and processes. More specialized types like `orm.Int`, `orm.CalcJobNode`, and `orm.SinglefileData` all build on `Node` and the functionality it provides. +The implementation makes use of several important design principles: + +1. __Multi-layer abstraction__: The multi-layer design achieves a clean separation and allows for a uniform API, while leveraging backend-specific optimizations. +1. __Multi-database support__: The architecture supports both PostgreSQL and SQLite backends through SQLAlchemy's dialect system. +1. __Use of `JSON` columns__: JSON(B) columns provide schema flexibility without sacrificing query performance. +1. __Namespace organization__: The namespace pattern on the user-facing `Node` class keeps the API clean by structuring it in distinct, nested categories. +1. __Modern integration__: Pydantic model integration brings modern Python data validation and serialization capabilities to the ORM. +1. __Universal query interface__: The `QueryBuilder` allows for identical query syntax across database backends. +1. __Collection patterns__: The collection system provides a consistent, intuitive interface for data access that complements the `QueryBuilder`. + +This architecture allows AiiDA to provide a powerful and flexible ORM that supports different database backends while maintaining a consistent user experience. + +__Footnotes__ + +[^1]: + Note that code snippets in this blog post can be simplified or slightly modified from the actual implementation (e.g., omitting type hints) for better readability or to highlight the specific aspects discussed in the text. + +[^2]: + To avoid confusion with Python's `id()` function. + +[^3]: + In practice, the `BackendNode` abstract base class is currently only used as the parent class for `SqlaNode`, making this layer of abstraction appear redundant. + The reasons for its existence are historical: In its early days, AiiDA implemented two ORM backends: Django ORM and SQLAlchemy, and the abstract `BackendNode` provided a unified interface for both. + However, the Django backend was eventually removed due to maintenance overhead and performance limitations, leaving only the SQLAlchemy implementation. + While this abstraction layer now serves only one concrete implementation, it still provides some architectural benefit, as it clearly separates the public API contract from implementation details. + It would also facilitate adding new backends in the future, however, the practical value of maintaining this abstraction versus the added complexity is debatable—it represents a classic case of YAGNI (You Aren't Gonna Need It). + Disregarding the historical context, one could also go by without the `BackendNode` class. + +[^4]: + For the SQLite backend, an `SqliteNode` is actually created, which, however, inherits from `SqlaNode`. + +[^5]: + Although we do understand that access via `node.base.` rather than direct `node.` can be counter-intuitive for new users. + + + + + diff --git a/docs/sections/team.md b/docs/sections/team.md index 199b237..d00ebd2 100644 --- a/docs/sections/team.md +++ b/docs/sections/team.md @@ -122,6 +122,6 @@ Then you might be a candidate for the AiiDA team! We advertise openings for positions on [discourse](https://aiida.discourse.group/c/announcements/) and the [news section](https://aiida.net/news/index/tag/positions.html), when they become available. But excellent candidates are encouraged to apply throughout the year (you can get contact via our emails). -(you can also check open positions on [PSI website](https://www.psi.ch/en/hr/job-opportunities), [THEOS](https://theos-wiki.epfl.ch/) and [psi-k mailing list](https://psi-k.net/list/)). +(you can also check open positions on [PSI website](https://www.psi.ch/en/hr/job-opportunities), [THEOS](https://theos-wiki.epfl.ch/) and [psi-k](https://psi-k.net) mailing list). **Use a 1-page cover letter to explain what motivates you and why you would fit into the team** – please note that we don’t consider blanket applications.