Skip to content

Commit

Permalink
Fix chain implementation and improve docs (#66)
Browse files Browse the repository at this point in the history
  • Loading branch information
cthoyt authored Sep 7, 2023
1 parent 2cf7923 commit d10d4b7
Show file tree
Hide file tree
Showing 9 changed files with 532 additions and 74 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ jobs:
- name: Test with pytest and generate coverage file
run:
tox run -e py-pydantic${{ matrix.pydantic }}
- name: Doctests
run:
tox run -e doctests
- name: Upload coverage report to codecov
uses: codecov/codecov-action@v1
if: success()
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ prune docs/source/api
recursive-include docs/source *.py
recursive-include docs/source *.rst
recursive-include docs/source *.png
recursive-include docs/source *.svg

global-exclude *.py[cod] __pycache__ *.so *.dylib .DS_Store *.gpickle

Expand Down
42 changes: 41 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,50 @@ will return `GO:0032571` instead of `OBO:GO_0032571`.

Full documentation is available at [curies.readthedocs.io](https://curies.readthedocs.io).

### Chaining

This package implements a faultless chain operation `curies.chain` that is configurable for case
sensitivity and fully considers all synonyms.

`chain()` prioritizes based on the order given. Therefore, if two prefix maps
having the same prefix but different URI prefixes are given, the first is retained. The second
is retained as a synonym:

```python
from curies import Converter, chain

c1 = Converter.from_prefix_map({"GO": "http://purl.obolibrary.org/obo/GO_"})
c2 = Converter.from_prefix_map({"GO": "https://identifiers.org/go:"})
converter = chain([c1, c2])

>>> converter.expand("GO:1234567")
'http://purl.obolibrary.org/obo/GO_1234567'
>>> converter.compress("http://purl.obolibrary.org/obo/GO_1234567")
'GO:1234567'
>>> converter.compress("https://identifiers.org/go:1234567")
'GO:1234567'
```

Chain is the perfect tool if you want to override parts of an existing extended
prefix map. For example, if you want to use most of the Bioregistry, but you
would like to specify a custom URI prefix (e.g., using Identifiers.org), you
can do the following:

```python
from curies import Converter, chain, get_bioregistry_converter

overrides = Converter.from_prefix_map({"pubmed": "https://identifiers.org/pubmed:"})
bioregistry_converter = get_bioregistry_converter()
converter = chain([overrides, bioregistry_converter])

>>> converter.expand("pubmed:1234")
'https://identifiers.org/pubmed:1234'
```

### Standardization

The `curies.Converter` data structure supports prefix and URI prefix synonyms.
The following exampl demonstrates
The following example demonstrates
using these synonyms to support standardizing prefixes, CURIEs, and URIs. Note below,
the colloquial prefix `gomf`, sometimes used to represent the subspace in the
[Gene Ontology (GO)](https://obofoundry.org/ontology/go) corresponding to molecular
Expand Down
117 changes: 115 additions & 2 deletions docs/source/struct.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,117 @@
Data Structures
===============
To do: add an explanation of prefix maps, bimaps, reverse prefix maps, extended prefix maps.
In the meantime, see https://cthoyt.com/2023/01/10/curies-package.html.
A *semantic space* is a collections of identifiers for concepts. For example,
the Chemical Entities of Biomedical Interest (ChEBI) has a semantic space
including identifiers for chemicals. Within ChEBI's semantic space,
`138488` corresponds to the chemical `alsterpaullone <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=138488>`_.

.. warning::

`138488` is a *local unique identifier*. Other semantic spaces might use the same local
unique identifier to refer to a different concept in their respective domain.

Therefore, local unique identifiers should be qualified with some additional information saying what semantic space
it comes from. The two common formalisms for doing this are Uniform Resource Identifiers (URIs) and
Compact URIs (CURIEs):

.. image:: syntax_demo.svg
:alt: Demo of URI and CURIE for alsterpaullone.

In many applications, it's important to be able to convert between CURIEs and URIs.
Therefore, we need a data structure that connects the CURIE prefixes like ``CHEBI``
to the URI prefixes like ``http://purl.obolibrary.org/obo/CHEBI_``.

Prefix Maps
-----------
A prefix map is a dictionary data structure where keys represent CURIE prefixes
and their associated values represent URI prefixes. Ideally, these are constrained
to be bijective (i.e., no duplicate keys, no duplicate values), but this is not always
done in practice. Here's an example prefix map containing information about semantic
spaces from a small selection of OBO Foundry ontologies:

.. code-block:: json
{
"CHEBI": "http://purl.obolibrary.org/obo/CHEBI_",
"MONDO": "http://purl.obolibrary.org/obo/MONDO_",
"GO": "http://purl.obolibrary.org/obo/GO_"
}
Prefix maps have the benefit of being simple and straightforward.
They appear in many linked data applications, including:

- the ``@prefix`` declarations at the top of Turtle (RDF) documents and SPARQL queries
- `JSON-LD <https://www.w3.org/TR/json-ld11/#prefix-definitions>`_
- XML documents
- OWL ontologies

.. note::

Prefix maps can be loaded using :meth:`curies.Converter.from_prefix_map`.

*However*, prefix maps have the main limitation that they do not have first-class support for
synonyms of CURIE prefixes or URI prefixes. In practice, a variety of synonyms are used
for both. For example, the NCBI Taxonomy database appears with many different CURIE prefixes:

============== ====================================
CURIE Prefix Resource(s)
============== ====================================
``taxonomy`` Identifiers.org, Name-to-Thing
``taxon`` Gene Ontology Registry
``NCBITaxon`` OBO Foundry, Prefix Commons, OntoBee
``NCBITAXON`` BioPortal
``NCBI_TaxID`` Cellosaurus
``ncbitaxon`` OLS
``P685`` Wikidata
``fj07xj`` FAIRsharing
============== ====================================

Similarly, many different URIs can be constructed for the same ChEBI local unique identifier. Using
alsterpaullone as an example, this includes (many omitted):

==================================================== ===================
URI Prefix Provider
==================================================== ===================
``https://www.ebi.ac.uk/chebi/searchId.do?chebiId=`` ChEBI (first-party)
``https://identifiers.org/CHEBI:`` Identifiers.org
``https://identifiers.org/CHEBI/`` Identifiers.org
``http://identifiers.org/CHEBI:`` Identifiers.org
``http://identifiers.org/CHEBI/`` Identifiers.org
``http://purl.obolibrary.org/obo/CHEBI_`` OBO Foundry
``https://n2t.net/chebi:`` Name-to-thing
==================================================== ===================

In practice, we need to be able to support the fact that there are many CURIE prefixes
and URI prefixes for most semantic spaces as well as specify which CURIE prefix and
URI prefix is the "preferred" one in a given context. Prefix maps, unfortunately, have no way to
address this. Therefore, we're going to introduce a new data structure.

Extended Prefix Maps
--------------------
Extended Prefix Maps (EPMs) address the issues with prefix maps by including explicit
fields for CURIE prefix synonyms and URI prefix synonyms while maintaining an explicit
field for the preferred CURIE prefix and URI prefix. An abbreviated example (just
containing an entry for ChEBI) looks like:

.. code-block:: json
[
{
"prefix": "CHEBI",
"uri_prefix": "http://purl.obolibrary.org/obo/CHEBI_",
"prefix_synonyms": ["chebi"],
"uri_prefix_synonyms": [
"https://identifiers.org/chebi:"
]
}
]
EPMs have the benefit that they are still encoded in JSON and can easily be encoded in
YAML, TOML, RDF, and other schemata.

.. note::

We are introducing this as a new standard in the :mod:`curies` package. They
can be loaded using :meth:`curies.Converter.from_extended_prefix_map`.
We provide a Pydantic model representing it. Later, we hope to have an external, stable definition
of this data schema.
1 change: 1 addition & 0 deletions docs/source/syntax_demo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit d10d4b7

Please sign in to comment.