Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of CURIEs that include square brackets don't conform to W3C specs and are incompatible with semantic web tools #103

Open
cmungall opened this issue Feb 15, 2024 · 1 comment

Comments

@cmungall
Copy link
Contributor

cmungall commented Feb 15, 2024

There are some cases of bioregistry "CURIEs" allowing square brackets in the local id. This is questionable if we follow the (IMO frustratingly opaque) W3C specs.

Here are some examples of what is permitted in bioregistry

(it is of course a stretch to call these IDs (biopragmatics/bioregistry#460))

These work perfectly well in the context of bioregistry; clicking on this will resolve to a nice picture of a molecule, which is what most bioregistry users want.

https://bioregistry.io/reference/smiles:CC(=O)NC([H])(C)C(=O)O

Let's see what happens when we try and use this with tooling that actually supports W3C specs:

{
  "@context": {
    "@base": "http://example.org",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "smiles": "https://bioregistry.io/smiles:"
  },
  "@id": "smiles:CC(=O)NC([H])(C)C(=O)O",
  "@type": "Molecule",
  "rdfs:label": "Acetaminophen"
}

using Jena:

riot --strict smiles.jsonld
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:33:06 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Molecule> .
<https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> <http://www.w3.org/2000/01/rdf-schema#label> "Acetaminophen" .

not pretty.. but it does process it, even in strict mode

however, it refuses to validate it

riot --validate smiles.jsonld || echo fail
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
16:38:10 WARN  riot            :: Bad IRI: <https://bioregistry.io/smiles:CC(=O)NC([H])(C)C(=O)O> Code: 0/ILLEGAL_CHARACTER in PATH: The character violates the grammar rules for URIs/IRIs.
fail

In contrast, https://json-ld.org/playground/ does not complain

I suspect the rust toolchains are stricter

Removing or escaping the []s allows it to validate (note that ()s are frequently URL encoded but they are still valid)

What are our options?

  1. Make curies always strict. Forbid [] or encodings thereof. These are poor choices for bona-fide IDs. Don't try and overload the CURIE concept for languages like HGVS, UCUM, SMILES, InChi, etc
  2. go your own way. Explicitly document that curies isn't for CURIEs as defined by W3C specs, it's just prefixed IDs that expand to URLs that work in browsers with no commitments to any specifications outside those in this repo.
  3. Make curies conform to W3C specs, and force []s to be encoded (as the UOM people are doing for UCUM, Discussion about how to improve UCUM bioregistry#648). This could retroactively break things, and confuse people who want to use curies in its intended YOLO fashion
  4. Attempt some formalization where we have loose CURIEs and strict CURIEs and a formal mapping between them (basically URL encoding []s, probably spaces while we are at it)

I think these are all horrible but then I've always said the decision to couple identifiers to networking protocols was a terrible one.

I think 4 is likely the most practical, but this will take some careful planning. There will essentially be the following transforms:

 looseCURIE <-> strictCURIE
    ^.     \.  /.    ^
    |        X       |
    v      /  \.     v
 looseURI   <-> strictURI

(likely implemented with flags on existing expand/contract, with new methods for like-to-like)

What is annoying is that there is AFAICT no way to get json-ld-contexts to specify the diagonal conversion

@cthoyt
Copy link
Member

cthoyt commented Feb 27, 2024

@cmungall thanks for the comment.

I don't think that making this package strict by default will make many people happy, almost everyone in this space is in YOLO mode.

However, CURIEs can be used in both a "correct" way and an incorrect way, this is a choice of the user. We can try and help them make better choices by providing an alternate implementation of the Converter class that follows strict rules and also provides some appropriate utilities for encoding CURIEs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants