Tackling global challenges, including rare disease and climate change, requires the integration of a large number of disparate data sources. Due to the decentralised nature of data standardisation, where different data providers inevitably employ different controlled vocabularies and ontologies to standardise their data, it becomes crucial to integrate such "semantic spaces" (i.e. data spaces that are described using divergent sets of ontologies).
Linking entities across often huge semantic spaces at scale is crucial. For example, integrating genetic associations for disease provided by a disease data resource such as the Online Mendelian Inheritance in Man (OMIM) with the phenotypic associations to the same disease provided by Orphanet requires mapping different disease identifiers that refer to the exact same real-world disease concept. Manually mapping thousands of disease concepts for two semantic spaces is potentially doable manually, but in the real world, dozens of resources providing information about the same data type (disease, genes, environment, organisms) need to be integrated, which makes a purely manual approach infeasible.
Semantic entity matching is the process of associating a term or identifier A in one semantic space to one or more terms or identifiers B in another, where A and B refer to the same or related real-world concepts. A common method to automate semantic entity matching is to use lexical methods, in particular matching on primary or alternative labels (synonyms) that have been assigned to concepts, sometimes in combination with lexical normalization. These methods can often provide very high recall, but low precision, due to lexical ambiguity. Examples are provided in @tbl:example-matches, including a false match between an aeroplane part and an insect part due to sharing the same name (wing) based on analogous function.
Resource A | Concept A | Resource B | Concept B | Predicted | True Predicate |
---|---|---|---|---|---|
UK Auto Ontology | Car | Industrial Ontology | Automobile | n/a | exactMatch |
Train Ontology | Car | Industrial Ontology | RailwayCarriage | n/a | closeMatch |
Fly ontology | Wing | Industrial ontology | Wing | exactMatch |
differentFrom |
Table: Example of entity matching problem {#tbl:example-matches}
An example of this approach is the LOOM algorithm used in the Bioportal ontology resource [@pubmed:20351849], which provides very high recall mappings across over a thousand ontologies and other controlled vocabularies.
A number of approaches can give higher precision mappings, many of these make use of other relationships or properties in the ontology. The Ontology Alignment Evaluation Initiative (OAEI) provides a yearly evaluation of different methods for ontology matching. One of the top-performing methods in OAEI is the LogMap tool, which makes use of logical axioms in the ontology to assist in mapping.
Deep learning approaches and in particular Language Models (LMs) have been applied to ontology matching tasks. Some methods make use of embedding distance OntoEmma [@doi:10.48550/arXiv.1806.07976], DeepAlignment [@doi:10.18653/v1/N18-1072], VeeAlign [@doi:10.48550/arXiv.2010.11721]. More recently the Truveta Mapper [@doi:10.48550/arXiv.2301.09767] treats matching as a Translation task and involves pre-training on ontology structures.
The most recent development in LMs are so-called Large Language Models (LLMs), exemplified by ChatGPT, which involved billions of parameters and pre-training on instruction-prompting tasks. The resulting models have generalizable abilities to perform a wide range of tasks, including question answering, information extraction. However, one challenge with LLMs is the problem of hallucination. An hallucination describes a situation where an AI model “fabricates” information that does not directly correspond to the provided input. He et al (2023) [@doi:10.48550/arXiv.2309.07172] have noted the need for investigating LLM's for the Ontology Matching problem, and determined that their use is "promising", while some challenges (like prompt tuning and overall framework design) still need to be addressed.
Given their performance on any tasks related to the understanding and generation natural language, it seems obvious to employ LLMs directly as a powerful, scalable alternative to current state-of-the-art (SOTA) methods for entity matching. One possibility is using LLMs like the ones employed by ChatGPT to generate mappings de-novo. However, the problem of hallucination makes this unreliable, in particular, due to the propensity for LLMs to hallucinate database or ontology identifiers when these are requested.
We devised an alternative approach called MapperGPT that does not use GPT models to generate mappings de-novo, but instead works in concert with existing high-recall methods such as LOOM [@pubmed:20351849]. We use a GPT model to refine and predict relationships (predicates) as a post-processing step, essentially for the purpose of isolating and removing false positive mappings. We use an in-context knowledge-driven semantic approach, in which examples of different mapping categories and information about the two concepts in a mapping is provided to the model to determine an appropriate mapping relationship. We use the Simple Standard for Sharing Ontological Mapping (SSSOM) [@doi:10.1093/database/baac035] for sharing and comparing entity mappings across systems.
We evaluated this on a series of alignment tasks from different domains, including anatomy, developmental biology, and renal diseases. We devised a collection of tasks that are designed to be particularly challenging for lexical methods. We show that when used in combination with high-recall methods such as LOOM or OAK LexMatch [@doi:10.5281/zenodo.8310471], MapperGPT can provide a substantial improvement in accuracy beating SOTA methods such as LogMap.
Our contributions are as follows:
- The creation of a series of new matching tasks expressed using the SSSOM standard
- An algorithm and tool, MapperGPT, that uses a GPT model to predict relationships between concepts