-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to datacite v4.5 serialization from inveniosoftware #261
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #261 +/- ##
==========================================
- Coverage 97.71% 97.66% -0.06%
==========================================
Files 16 16
Lines 1753 1753
==========================================
- Hits 1713 1712 -1
- Misses 40 41 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
8aebc6d
to
6ef121e
Compare
dandischema/datacite.py
Outdated
attributes["publisher"] = { | ||
"name": "DANDI Archive", | ||
"schemeUri": "https://scicrunch.org/resolver/", | ||
"publisherIdentifier": "https://scicrunch.org/resolver/RRID:SCR_017571", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really the identifier? or just SCR_017571
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in all of the examples https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/publisher/#a-publisheridentifier they used a URL, so I just followed the trend... it is indeed quite an odd schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some memory bank item says something about alternateIdentifiers
that we implemented and then reverted because the API service did not match the schema requirements. so simply putting a flag here for checking.
FWIW -- search comes empty https://github.com/search?q=repo%3Adandi%2Fdandi-schema%20alternateIdentifiers&type=code :-/ I could be wrong but I think we are unittesting this against jsonschema and test fabric of datacite, right @djarecka ? |
we have some unit test that were previously based on the schema that allowed for additional field. |
I also just want to clarify, that the datacite documentation v4.5 still have so it's not completely clear to me if we should follow the schema from |
Situation with "identifiers" is messy. We relied on it but it was not in datacite schema, but was allowed by API: https://support.datacite.org/docs/what-is-the-identifiers-attribute-in-the-rest-api > When creating or updating DOI alternateIdentifier metadata, the REST API accepts values in either the alternateIdentifiers or identifiers attributes. Including metadata in either attribute will populate the identifiers and alternateIdentifiers attributes in the REST API response and the alternateIdentifiers property in DataCite XML. And in jsonschema serialization of 4.5 "identifiers" was removed, see more in inveniosoftware/datacite#81 (comment) and there in. But I guess currently used 4.3 from datacite (not inveniosoftware) is still requiring identifiers, and hence this commit/solution is incomplete since does fail validation (see below). "identifiers" was removed from required only in 4.5 from inveniosoftware. ❯ python -m pytest -s -v dandischema/tests/test_datacite.py ============================================================= test session starts ============================================================== platform linux -- Python 3.12.6, pytest-8.3.3, pluggy-1.5.0 -- /home/yoh/proj/dandi/dandischema/venv/3/bin/python cachedir: .pytest_cache rootdir: /home/yoh/proj/dandi/dandischema configfile: tox.ini plugins: rerunfailures-14.0, cov-6.0.0 collected 14 items dandischema/tests/test_datacite.py::test_datacite[000004] FAILED dandischema/tests/test_datacite.py::test_datacite[000008] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta0-datacite_checks0] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta1-datacite_checks1] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta2-datacite_checks2] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta3-datacite_checks3] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta4-datacite_checks4] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta5-datacite_checks5] FAILED dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta6-datacite_checks6] FAILED dandischema/tests/test_datacite.py::test_datacite_publish PASSED dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url0-related_ident_exp0] PASSED dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url1-related_ident_exp1] PASSED dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url2-related_ident_exp2] PASSED dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url3-related_ident_exp3] PASSED =================================================================== FAILURES =================================================================== ____________________________________________________________ test_datacite[000004] _____________________________________________________________ dandischema/tests/test_datacite.py:160: in test_datacite datacite = to_datacite(meta=meta, validate=True) dandischema/datacite.py:238: in to_datacite validate_datacite(datacite_dict) dandischema/datacite.py:258: in validate_datacite validator.validate(datacite_dict["data"]["attributes"]) venv/3/lib/python3.12/site-packages/jsonschema/validators.py:451: in validate raise error E jsonschema.exceptions.ValidationError: 'identifiers' is a required property E E Failed validating 'required' in schema: E {'$schema': 'http://json-schema.org/draft-07/schema#', E 'definitions': {'nameType': {'type': 'string', E 'enum': ['Organizational', 'Personal']}, E 'nameIdentifiers': {'type': 'array', E 'items': {'type': 'object', E 'properties': {'nameIdentifier': {'type': 'string'}, E 'nameIdentifierScheme': {'type': 'string'}, E 'schemeURI': {'type': 'string', E 'format': 'uri'}}, E 'required': ['nameIdentifier', E 'nameIdentifierScheme']},
…eniosoftware Done in hope to see "non-standard" identifiers being gone but immediate fail is ___ test_dandimeta_datacite[additional_meta6-datacite_checks6] _ dandischema/tests/test_datacite.py:407: in test_dandimeta_datacite validator.validate(datacite["data"]["attributes"]) venv/3/lib/python3.12/site-packages/jsonschema/validators.py:451: in validate raise error E jsonschema.exceptions.ValidationError: 'DANDI Archive' is not of type 'object' E E Failed validating 'type' in schema['properties']['publisher']: E {'type': 'object', E 'additionalProperties': False, E 'properties': {'name': {'type': 'string'}, E 'publisherIdentifier': {'type': 'string'}, E 'publisherIdentifierScheme': {'type': 'string'}, E 'schemeUri': {'type': 'string', 'format': 'uri'}, E 'lang': {'type': 'string'}}, E 'required': ['name']} E E On instance['publisher']: E 'DANDI Archive' So we need to standardize "publisher" better
6ef121e
to
dc19168
Compare
but there is no |
yes, you're right, |
yes, as I think we ran into before on a number of occasions they have differences between schema, documentation, and jsonschema serializations since documentation and jsonschema are not automatically produced from the schema unfortunately IIRC. Thus all the differences. I think it would be best to rely on schema (not documentation) as the "ultimate ground truth" with the hope that jsonschema would eventually become autogenerated/fully reflective of the original schema. FWIW -- tests pass here. That means that we are in compliance with test fabric of datacite and can proceed right? |
FTR, here is
|
@yarikoptic - so are you going to revert back to identifiers? was there a rationale for changing it? |
no, I do not see the point of reverting to accommodate API interface for which we do not have a validator. The rationale for my change is to be able to validate our metadata against jsonschema of datacite record we have access to. Do you have concerns/reservations? |
as i said, in the past we did this against the schema and they said it's not supported in the api. since we test against the api, that's what i would hold as our functional target. i believe this happened as we were suggesting changes to the schema. at least at that time, almost 5 years ago, they considered their documentation to be the standard, not the schema. and their api reflected the documentation. if things have changed, i would go with that. but i don't see how see the schema here: https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/identifier/ they used to consider the schema and the XSD file as what they support. anything json schema was secondary. if they have changed that, go ahead. but i would check the XSD. that's what we did and it did ensure API success. |
XSD 4.5
In json schema
Here is the XML of the datacite record we get for a random sample dandiset we have published❯ curl -LH "Accept: application/vnd.datacite.datacite+xml" https://doi.org/10.48324/DANDI.000897/0.240605.1710 <?xml version="1.0" encoding="UTF-8"?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="DOI">10.48324/DANDI.000897/0.240605.1710</identifier>
<creators>
<creator>
<creatorName nameType="Personal">Neupane, Sujaya</creatorName>
<givenName>Sujaya</givenName>
<familyName>Neupane</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-0052-3122</nameIdentifier>
</creator>
<creator>
<creatorName nameType="Personal">Fiete, Ila</creatorName>
<givenName>Ila</givenName>
<familyName>Fiete</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0003-4738-2539</nameIdentifier>
</creator>
<creator>
<creatorName nameType="Personal">Jazayeri, Mehrdad</creatorName>
<givenName>Mehrdad</givenName>
<familyName>Jazayeri</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-9764-6961</nameIdentifier>
</creator>
</creators>
<titles>
<title>Neupane_Fiete_Jazayeri_Mental navigation_NHP_EntorhinalCortex</title>
</titles>
<publisher>DANDI Archive</publisher>
<publicationYear>2024</publicationYear>
<resourceType resourceTypeGeneral="Dataset">Neural Data</resourceType>
<subjects>
<subject>entorhinal cortex, cognitive map, mental navigation,</subject>
</subjects>
<contributors>
<contributor contributorType="ContactPerson">
<contributorName nameType="Personal">Neupane, Sujaya</contributorName>
<givenName>Sujaya</givenName>
<familyName>Neupane</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-0052-3122</nameIdentifier>
</contributor>
<contributor contributorType="ContactPerson">
<contributorName nameType="Personal">Jazayeri, Mehrdad</contributorName>
<givenName>Mehrdad</givenName>
<familyName>Jazayeri</familyName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-9764-6961</nameIdentifier>
</contributor>
</contributors>
<alternateIdentifiers>
<alternateIdentifier alternateIdentifierType="URL">https://identifiers.org/DANDI:000897/0.240605.1710</alternateIdentifier>
<alternateIdentifier alternateIdentifierType="URL">https://dandiarchive.org/dandiset/000897/0.240605.1710</alternateIdentifier>
</alternateIdentifiers>
<sizes/>
<formats/>
<version/>
<rightsList>
<rights rightsIdentifier="cc_by_40" rightsIdentifierScheme="SPDX"/>
</rightsList>
<descriptions>
<description descriptionType="Abstract">The dataset contains electrophysiology data recorded from the entorhinal cortex of two NHPs performing a mental navigation task. The recording probes used were V-probe with 32 channels or 64 channels, manufactured by Plexon Inc. </description>
</descriptions>
<fundingReferences>
<fundingReference>
<funderName>National Institute of Mental Health</funderName>
<funderIdentifier funderIdentifierType="ROR">https://ror.org/05xj56w78</funderIdentifier>
<awardNumber>NIMH-MH129046</awardNumber>
</fundingReference>
<fundingReference>
<funderName>Natural Science and Engineering Council of Canada</funderName>
<funderIdentifier funderIdentifierType="ROR">https://ror.org/01h531d29</funderIdentifier>
<awardNumber>NSERC PDF-516867-2018</awardNumber>
</fundingReference>
</fundingReferences>
</resource>
so you can see that
AFAIK we are testing submission of our datacite records against test fabric, so AFAIK switching to alternateIdentifiers here should work just fine. |
thank you @yarikoptic for the explanation. i'm now aligned. |
"name": "DANDI Archive", | ||
"schemeUri": "https://scicrunch.org/resolver/", | ||
"publisherIdentifier": "https://scicrunch.org/resolver/RRID:SCR_017571", | ||
"publisherIdentifierScheme": "RRID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RRID technically isn't supported yet, though it should be very soon (when 4.6 is released). It doesn't fail out on the jsonschema because of a bug inveniosoftware/datacite#103
It may work right now because DataCite is rolling out 4.6 support, but is worth double checking it goes through to Fabrica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We checked, nothing in 4.6 PR seems to suggest for it to become a controlled dictionary, just a "string" ATM, so I think we should keep it.
FWIW -- I did open and close
I guess we could have switched to https://www.re3data.org/repository/r3d100013638 which also points to the RRID. But it also shows that may be here it could be publisherIdentifiers
list of records? (I don't want even to suggest that "officially" ;-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I got messed up by the examples and that RRID is going to be supported as a related identifier type in 4.6. I was just on the ROR call today, where they showed the data in this field is a mess. But that's that happens without a controlled list.
This looks like the right changes to me. From the inveniosoftware/datacite side, we test that all the example json records make it to Fabrica. Around 4.3-4.4 there were issues with parts of the json representations not making it through (it's why we never did a 4.4 jsonschema), but I think all those issues other then GeoLocationPolygon have been resolved. If you find anything weird though just open an issue and we'll fix it. The goal is that anything that validates with the jsonschema goes through to Fabrica. |
This makes it all consistend with funderIdentifier, alternateIdentifier and may be others. rightsIdentifier was found to be different (thereis rightsURI, no schemeUri)
It kinda historically dragged through time from the original 1056afb49fd945afc471e200d782bbff9de43cf1 in dandi-cli where it was added to "identifiers". As the wise @tmorell has mentioned, since it is the datacite which is to provide DOI, it would ignore DOI "alternateIdentifiers". I think it makes sensse overall, although there could potentially be multiple DOIs for a single dandiset -- nothing in DOI principle forbids it. But since original purpose here is not clear -- we better just strip it away since it should be the DOI minted by datacite fabrica as the one for the PublishedDandiset
Some extra info on "dichotomy" of datacite json serializations could be found in
and there in. Part of this effort is RFing use of "identifiers" (removed in 4.5 serialization)
Might want to wait for addressing need to include schemas as stated in #260 (comment)
@djarecka you were looking into upgrades also on Friday -- what other items were due?