BrainKB Design Document

Author: Tek Raj Chhetri | [email protected]

Overview

BrainKB serves as a knowledge base platform that provides scientists worldwide with tools for searching, exploring, and visualizing Neuroscience knowledge represented by knowledge graphs (KGs). Moreover, BrainKB provides cutting-edge tools that enable scientists to contribute new information (or knowledge) to the platform and is expected to be a go-to destination for all neuroscience-related research needs.

The main objective of BrainKB is to represent neuroscience knowledge as a knowledge graph such that it can be used for different downstream tasks, such as making predictions and new inferences in addition to querying and viewing information. The expected outcome of the BrainKB includes the following:

(Semi-)Automated extraction of neuroscience knowledge from structured, semi-structured, and unstructured sources, and representing the knowledge via KGs.
Visualization of the KGs.
Platform to perform different analytics operations over the BrainKB KGs.
(Semi-)Automated validation of the BrainKB KGs to ensure the high quality of the content.
Provides the ability to ingest data in batch or streaming mode for the automated extraction of KGs.

Why BrainKB?

Limited Availability of Platforms for Integrating Neuroscience Data into Knowledge Graphs: In fields such as biomedicine, many platforms exist, such as, SPOKE and CIViC. However, such resources are comparatively limited in the domain of neuroscience. LinkRBrain, a web-based platform that integrates anatomical, functional, and genetic knowledge, is among the limited number of such resources. BrainKnow, the most recent platform, is another platform that is designed to synthesize and integrate neuroscience knowledge from scientific literature. Additionally, projects like DANDI, EBRAINS and Open Metadata Initiative are making strides by enabling sharing of neurophysiology data together with its metadata.
Lack of Support for Heterogeneous Data Sources: The current platforms in neuroscience are limited in their ability to handle a diverse range of data sources. For instance, LinkRBrain can only integrate knowledge from 41 databases, whereas BrainKnow solely focuses on scientific literature. However, knowledge is not restricted to just databases or scientific literature, and there is a need for platforms that can accommodate a wider variety of sources (e.g., structured, semi-structured and unstructured sources).

Principles

Structured (Modeled)

All information stored in the KG has associated data models or can be extracted to models. The information will be linked to formal ontologies and linked across datasets. All data models will have well defined schemas and descriptors for human and programmatic consumption.

Extensible (Read/Write)

The KG will allow for both information retrieval and upload. This involves a set of services and an API layer that allows for curation of information. The curation of information will reflect the data models. In addition, the KG will link, ingest, or cache other authoritative sources of information.

Curated (Expertise)

To support being an authoritative source, information entering the KG will indicate levels of curation. Such curation may take the form of expertise that is embedded into algorithms (e.g., quality metrics, alignment, mapping), is incorporated into data models (e.g., genes, anatomy), and is derived from computational and human analysis (e.g., atlases as outputs of working groups).

Usable (Utilitarian)

The architecture of the KG will be usable by humans and computational entities. The application interfaces will provide user interactivity and programmatic access. The KG will support competencies needed by the community.

Transparent (Basis)

To increase trust, the provenance of all information in the KG shall be maintained, including absence of provenance and available through the KG interfaces.

Programmable (Computable)

The information stored will lend itself to compute through appropriate APIs, data formats, and services. The KG shall connect to computational services to generate and provide inferred or derived information relevant to scientists.

Features

Data Ingestion

BrainKB will support knowledge extraction from various sources in different data formats (e.g., texts, JSON (JavaScript Object Notation)) via the BrainKB user interface (UI) and the application programming interface (API) endpoints. Both batch and streaming data ingestion modes will be supported.

Schema Flexibility

KGs evolve over time. For example, if we consider the case of the president of a country, it changes overtime. The KGs storing the information regarding the president of the country has to be updated accordingly. There are many similar cases in neuroscience or any other domain. The knowledge may change over time based on new research findings, thereby making previous knowledge obsolete or factually incorrect. Additionally, changes might also occur in the schema due to the standardization, alignment or updates. While schema changes may not always be necessary, they may be required to accommodate new information. Therefore, BrainKB will support the evolution by allowing the addition (or removal) of entities and relationships.

Example: In fields like biology, newer findings can invalidate existing terms, requiring flexibility in the schema to account for future changes.

Maintainability

BrainKB shall be maintainable, allowing operations such as KG enrichment and validation to be performed easily. When we mention "validation to be performed easily," we are referring to processes that require minimal human intervention, either through semi-automated or fully-automated methods.

Curation

BrainKB will allow the community-driven curation of the KGs as well as (semi-) automated extraction and construction of KGs from external sources, e.g, scientific literatures.

Accuracy, Completeness and Consistency (ACC)

BrainKB shall check the accuracy of the knowledge for which multi-step (semi-) automated validations will be performed. Additionally, checks will be performed to ensure that the KG triples are complete, i.e., the mandatory information is present. To further ensure accuracy and completeness, BrainKB shall guarantee the additions of new facts (or KG triples) will not lead introduce inconsistencies (see Figure 1) with existing knowledge due to factual errors, data inconsistencies, and incompleteness.

Figure 1: KGs. The image on the left shows the original knowledge graph, while the image on the right demonstrates the updated knowledge graph. The green highlighted box indicates new knowledge that has been added, while the red highlighted box indicates any inconsistencies caused by factual changes, i.e., incorrect date of birth.

The ACC process will ensure human-centricity is maintained alongside automated validation. Figure 2 shows the high-level overview of the framework that is used for the automated extraction as well as validation of the KG triples. Each agent will perform individual tasks. For example, Agent 1 and Agent 2 will perform the task of KG triples extraction from the raw text and aligning with the schema (or ontology). Similarly, the validator agents 1, 2 and 3 will perform the validation of the aggregated KG triple. Each validator agent will use the different source for the vlaidation and the validator agent 4 will take all the validation results from the three different validator agents and make the final decision. IF the validator agent 4 is unable to make the decision for any reason, such as due to unresolvable conflict, it will trigger a alert to the user, who will then perform the manual validation (or confirmation of the validation). Even though, framework below (Figure 2) shows the complete pipeline for extraction and validation, each of the tasks can be performed independently. For example, if one wishes to use validate the existing KG triple, one can do so just by using the validation component.

Figure 2: KGs. A Multi-Agent Framework for Neuroscience Knowledge Graph Construction & Validation

Provenance

To enable trust, the provenance, i.e., documentation of the source and the curators (in case of manual curation) of all the information, shall be maintained. The provenance conflict resolution mechanism will also be implemented to ensure the accuracy of the provenance information.

Querying and Reasoning

BrainKB shall support the KGs' querying and reasoning. It shall also support other downstream analytics tasks, such as link predictions (see Figure 3) using machine learning techniques.

Figure 3: Link prediction. The figure on the left indicates a KG with a missing link (or relation) indicated by dotted lines and the figure on the right displays the KG after the link prediction.

Integration and Interoperability

To ensure interoperability and ease of integration, BrainKB will focus on using standardized ontologies or schemas. However, standardized ontologies or schemas are not always available. In such cases, other schemas or ontologies must be used. To ensure the interoperability, the alignment will be performed where necessary.

Minimize Cognitive Burden and Data Fatigue

As BrainKB will also provide features to perform the analytics operation in addition to querying the information (or knowledge), a special emphasis shall be placed on ensuring that the information presented to the user does not cause a cognitive burden and data fatigue. A cognitive burden occurs when the brain must exert more effort to understand information, typically resulting from an overload of visual content. For example, the figure below (left) places more cognitive burden than on the right.

Other considerations

Assumption: We operate on open-world assumptions (OWA), not closed-world assumptions (CSA). In OWA, we do not make any assumptions about the absence of statements, while in CSA absence of statements would be evaluated as false, i.e., assumed to be false.

Example: Let's consider a university scenario. We want to determine if Jane Doe is enrolled in the AI 101 course.

In CSA, if Jane Doe's enrollment information for AI 101 is not present in the university database, this absence of information is interpreted as Jane Doe not being enrolled in the course.
Conversely under OWA, the absence of Jane Doe's enrollment information for AI 101 means that the information is simply missing and it remains uncertain whether Jane Doe is enrolled or not enrolled in the course.

Architecture

The figure below (Figure 4) shows the BrainKB's architecture. It is divided into three layers: application layer (layer 1), service layer (layer 2), and resource layer (layer 3).

Application: The application layer(or layer 1) is the go-to point that provides access to BrainKB, such as via UI.

Service: The service layer (or layer 2) implements the core logic and is broken down into multiple services based on the functionalities (e.g., ingestion service which allows ingestion of data and is represented by Figure 5). Resource: The resource layer (or layer 3) will provide the necessary computational resources that are required to deliver the required services by BrainKB.

Figure 4: Architecture of BrainKB

Additionally, Figure 4, shows the architecture of the ingest service, a component of BrainKB.

Figure 5: Ingest Service, one of the service component of BrainKB

Target Audience

Neuroscience researchers: BrainKB's primary audience will be the neuroscience researchers, who would be able to use the platform to integrate, visualize, and analyze neuroscience data. They will be able to capitalize on the platform's ability to synthesize their data (or knowledge) into KGs.
Research Labs and Academic Institutions: BrainKB will be an invaluable resource for teaching in academic contexts specializing in neuroscience. It offers convenient access to integrated neuroscience data for faculty and students.
Policy Makers: Neurology policymakers will be able to use the neuroscience knowledge that BrainKB hosts to make policy decisions.
Healthcare Professionals: Healthcare professionals in neurology (or clinical neuroscience) may use BrainKB knowledge to understand and improve neurological disease outcomes.
Neuroscience-related Companies: Companies specializing in developing drugs for neurological diseases will be able to use the platform's KGs to gain insights into neurological conditions and treatments.

Usage Scenario

Actor: Alice (Neuroscientists)

Task: Alice wants to know if she can gain new insights from their newly collected neuroscience data.

Precondition: The newly collected neuroscience dataset, which includes demographics, gene expression maps, and structural and functional MRI scans, is usable and uncorrupted.

Flow:

Alice uploads the data into the BrainKB platform through the BrainKB UI (User Interface).
BrainKB, the system, then analyzes data. If any error, e.g., unsupported file format, it will return the error; otherwise, the system will proceed to the next step of knowledge extraction.
The system will perform the knowledge extraction, validation, and alignment operation. If the validation or the alignment issue cannot be resolved automatically, the extracted knowledge represented via KG is flagged for expert review. Upon the successful review, the KGs are integrated (or stored) in the BrainKB storage and is available for visualization and analysis.

Postcondition: Alice discovers new insights through the integration of diverse knowledge sources represented in BrainKB's KGs.

Use Cases

Extraction/Integration/Refinement: BrainKB will provide features to extract knowledge from diverse sources, such as raw text and scientific publications, and integrate it with the knowledge represented via KGs. Additionally, BrainKB will also provide features to refine the extracted knowledge, e.g., through humans in the loop.
Cards: The BrainKB web application allows easy visualization of the knowledge of interest to scientists/researchers stored in KGs and their corresponding interconnected knowledge. Figure 6 shows a snippet of the entity card from the BrainKB web application, which can be accessed at http://beta.brainkb.org.

Figure 6: Snippet of Entity card from BrainKB web application
Casual Inference: Casual inference helps distinguish causation from correlation, particularly important the domains like neuroscience [1,2]. BrainKB, which stores the knowledge represented via KGs, thus supports causal inference. The reason is that the KGs can encode the (casual) relationships between entities and enable (casual) reasoning [2].

[1] Danks, D. and Davis, I., 2023. Causal inference in cognitive neuroscience. Wiley Interdisciplinary Reviews: Cognitive Science, 14(5), p.e1650. [2] Huang, H. and Vidal, M.E., 2024. CauseKG: A Framework Enhancing Causal Inference with Implicit Knowledge Deduced from Knowledge Graphs. IEEE Access.
Human in the loop: BrainKB allows the creation of KGs constructed from heterogeneous sources, e.g., text and CSV files, in a (semi-) automated fashion (e.g., using NLP) and through community contribution. BrainKB includes human-in-the-loop features, which ensure quality control of the KGs. The human in the loop is also a step in the maturity model for operations in neuroscience [1], helping to optimize KGs (knowledge graphs) curation.
- Example:
  - When new evidence is submitted, it is placed in a queued (or hold) stage and progressed upon the moderators' review. Changes might be required based on the review before it appears in the evidence entity card.
  - If the KGs are manually or automatically created, the moderators will review the concepts' alignment and determine whether the resolution (e.g., entity resolution) has been performed correctly.
[1] Johnson, E.C., Nguyen, T.T., Dichter, B.K., Zappulla, F., Kosma, M., Gunalan, K., Halchenko, Y.O., Neufeld, S.Q., Schirner, M., Ritter, P. and Martone, M.E., 2023. A maturity model for operations in neuroscience research. arXiv preprint arXiv:2401.00077.
Compare Atlases: BrainKB also integrates knowledge from diverse knowledge platform services if available for integration, providing the feature to compare knowledge from across different atlases (e.g., Allen Brain Atlases).
Find/correct Errors: BrainKB will provide a feature to search existing knowledge and correct errors if any.
Add information/API: BrainKB offers an API endpoint that enables seamless integration with its platform. These endpoints facilitate data ingestion from various sources, such as CSV files or raw text, for constructing KGs, performing search operations, and conducting analyses on the stored KGs.
Doing meta-analysis: Meta-analysis is a knowledge-intensive task that requires significant time and effort to find related studies, identify evidence items, annotate the contents, and aggregate the results [1]. BrainKB, which stores knowledge from diverse data sources, including scientific publications, facilitates the meta-analysis.

[1] Tiddi, I., Balliet, D. and ten Teije, A., 2020. Fostering scientific meta-analyses with knowledge graphs: a case-study. In The Semantic Web: 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings 17 (pp. 287-303). Springer International Publishing.

Models

Models currently used in BrianKB:

Genome Annotation Registry Service (GARS) Model
Anatomical Structure Reference Service (AnSRS) Model
Library Generation Model

Detailed descriptions of the models above are available at https://brain-bican.github.io/models/.

Technology

FastAPI
Docker
RabbitMQ
Serverless (OpenFaaS)
Python
Language models ((e.g., Google BERT and LLaMa)
SPARQL

Sequence Diagram

The sequence diagram below shows the interactions between different service components for the KG construction.

sequenceDiagram
    autonumber

    participant User
    participant UI
    participant KG_Construction as (Semi-) structured KG construction
    participant Mapping as Mapping & Annotation
    participant Alignment as Alignment & resolution
    participant Validation as Validation & Quality assurance
    participant Expert
    participant Triplestore
    User->>+ UI: Upload CSV
    UI->>+KG_Construction: Return response
    KG_Construction->>+KG_Construction: Perform initial check, e.g., presence of required columns

    alt is invalid
        KG_Construction-->>+UI: Return Error message
        UI-->>+User: Return Error message
    else is valid
        KG_Construction->>+ Mapping: Perform mapping & annotation as necessary
        Mapping->>+ Validation: Perform validation of KG triples
        Validation->>+Validation: Validation checks, e.g., SHACL, provenance conflict
        Validation->>+ Alignment: Resolve conflicts
        alt conflict identified, perform resolution
            Alignment->>+ Alignment: Perform automated conflict resolution and alignment operation
            Alignment-->>+ Validation: Return response (triples with resolved conflicts)
            

        else conflict identified, perform resolution-requires human oversight
            Alignment->>+ Expert: Send to expert for manual conflict resolution
            Expert-->>+Alignment: Return response (triples with resolved conflicts)
            Alignment-->>+ Validation: Return response (triples with resolved conflicts) 
            
        end 
            Validation-->>+ Mapping: Return response (validated and conflict resolved KG triples)
            Mapping-->>+ KG_Construction: Updated KG triples
            KG_Construction->>+Triplestore: Store KG in database
            Triplestore-->>+KG_Construction: Return acknowledgement
            KG_Construction-->>+UI: Return response (operation status notification)
            UI-->>+User:Send notification
    end

Hosting Infrastructure

We recognize the current beta site’s issues and are working towards improving it. In particular, we are working on improving the following problems of the beta site.

Performance: We are currently using the free version of GraphDB. It has a limitation of two simultaneous queries. Because fetching the details, such as Library Aliquot and their inter-related information, requires running more than two queries, this limitation impacts the beta site's knowledge base page. We are currently considering the premium version of GraphDB and other opensource triple stores alternatives. The performance of future version of BrainKB will be significantly improved.

Github Repository

Source code
- https://github.com/sensein/BrainKB - Backend services
- https://github.com/sensein/brainkb-ui/tree/admin-ui - UI
Developer documentation
- https://github.com/sensein/brainkbdocs

Work Plan

Features	Status
UI with NextJS	Implementation in progress
KG construction from scientific publication	Designed the approach and implementation in progress
Structured models (or ontology) design	Implementation in progress
BrainKB documentation including deployment instructions and lessons learned	Practically complete. Updates will be made as the work progresses

Timelines

Date	Event
2024-03-26	Project Conceptualization
2024-04-05	Initial Architecture Design Phase Completed
2024-04-23	Work on Design Document
2024-04-25	Development Phase Started
2024-05-25	First version of BrainKB
2024-12-25	Second version of BrainKB
2025-04-10	First complete version of BrainKB with all conceptualized features

Status

Status	Event
Completed	~~Project Conceptualization~~
Completed, updated the architecture	~~Initial Architecture Design Phase Completed~~
Initial version completed and is updating	Work on Design Document
Completed	~~Development Phase Started~~
Completed and first version has been deployed to AWS	~~First version of BrainKB~~
2024-12-25	Second version of BrainKB
2025-04-10	First complete version of BrainKB with all conceptualized features

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
visio		visio
.codespellrc		.codespellrc
.gitignore		.gitignore
README.md		README.md
acc.png		acc.png
brainkb-arch.png		brainkb-arch.png
circular_logo.png		circular_logo.png
cognitive_burden.png		cognitive_burden.png
ecosystem.png		ecosystem.png
entity-card.png		entity-card.png
ingest.png		ingest.png
link_prediction.png		link_prediction.png
logo.png		logo.png
multiagentacc.png		multiagentacc.png

sensein/brainkb-design-document

Folders and files

Latest commit

History

Repository files navigation