Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to datacite v4.5 serialization from inveniosoftware #261

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Member

Some extra info on "dichotomy" of datacite json serializations could be found in

and there in. Part of this effort is RFing use of "identifiers" (removed in 4.5 serialization)

Might want to wait for addressing need to include schemas as stated in #260 (comment)

@djarecka you were looking into upgrades also on Friday -- what other items were due?

Copy link

codecov bot commented Nov 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.66%. Comparing base (9d068b9) to head (c48079c).
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #261      +/-   ##
==========================================
- Coverage   97.71%   97.66%   -0.06%     
==========================================
  Files          16       16              
  Lines        1753     1753              
==========================================
- Hits         1713     1712       -1     
- Misses         40       41       +1     
Flag Coverage Δ
unittests 97.66% <100.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

attributes["publisher"] = {
"name": "DANDI Archive",
"schemeUri": "https://scicrunch.org/resolver/",
"publisherIdentifier": "https://scicrunch.org/resolver/RRID:SCR_017571",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really the identifier? or just SCR_017571

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in all of the examples https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/publisher/#a-publisheridentifier they used a URL, so I just followed the trend... it is indeed quite an odd schema

Copy link
Member

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some memory bank item says something about alternateIdentifiers that we implemented and then reverted because the API service did not match the schema requirements. so simply putting a flag here for checking.

@yarikoptic
Copy link
Member Author

FWIW -- search comes empty https://github.com/search?q=repo%3Adandi%2Fdandi-schema%20alternateIdentifiers&type=code :-/ I could be wrong but I think we are unittesting this against jsonschema and test fabric of datacite, right @djarecka ?

@djarecka
Copy link
Member

FWIW -- search comes empty https://github.com/search?q=repo%3Adandi%2Fdandi-schema%20alternateIdentifiers&type=code :-/ I could be wrong but I think we are unittesting this against jsonschema and test fabric of datacite, right @djarecka ?

we have some unit test that were previously based on the schema that allowed for additional field.
We also upload to testing datacite api, but as I mentioned before, I had some memories that they validation at the testing api was not exactly the same as the one for proper datacite

@djarecka
Copy link
Member

djarecka commented Nov 12, 2024

I also just want to clarify, that the datacite documentation v4.5 still have Identifier: https://datacite-metadata-schema.readthedocs.io/_/downloads/en/4.5/pdf/ (page 11)

so it's not completely clear to me if we should follow the schema from inveniosoftware

Situation with "identifiers" is messy.  We relied on it but it was not in
datacite schema, but was allowed by API:

https://support.datacite.org/docs/what-is-the-identifiers-attribute-in-the-rest-api

> When creating or updating DOI alternateIdentifier metadata, the REST API accepts values in either the alternateIdentifiers or identifiers attributes. Including metadata in either attribute will populate the identifiers and alternateIdentifiers attributes in the REST API response and the alternateIdentifiers property in DataCite XML.

And in jsonschema serialization of 4.5 "identifiers" was removed, see more in

inveniosoftware/datacite#81 (comment)

and there in.  But I guess currently used 4.3 from datacite (not
inveniosoftware) is still requiring identifiers, and hence this commit/solution
is incomplete since does fail validation (see below). "identifiers" was removed
from required only in 4.5 from inveniosoftware.

	❯ python -m pytest -s -v dandischema/tests/test_datacite.py
	============================================================= test session starts ==============================================================
	platform linux -- Python 3.12.6, pytest-8.3.3, pluggy-1.5.0 -- /home/yoh/proj/dandi/dandischema/venv/3/bin/python
	cachedir: .pytest_cache
	rootdir: /home/yoh/proj/dandi/dandischema
	configfile: tox.ini
	plugins: rerunfailures-14.0, cov-6.0.0
	collected 14 items

	dandischema/tests/test_datacite.py::test_datacite[000004] FAILED
	dandischema/tests/test_datacite.py::test_datacite[000008] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta0-datacite_checks0] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta1-datacite_checks1] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta2-datacite_checks2] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta3-datacite_checks3] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta4-datacite_checks4] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta5-datacite_checks5] FAILED
	dandischema/tests/test_datacite.py::test_dandimeta_datacite[additional_meta6-datacite_checks6] FAILED
	dandischema/tests/test_datacite.py::test_datacite_publish PASSED
	dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url0-related_ident_exp0] PASSED
	dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url1-related_ident_exp1] PASSED
	dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url2-related_ident_exp2] PASSED
	dandischema/tests/test_datacite.py::test_datacite_related_res_url[related_res_url3-related_ident_exp3] PASSED

	=================================================================== FAILURES ===================================================================
	____________________________________________________________ test_datacite[000004] _____________________________________________________________
	dandischema/tests/test_datacite.py:160: in test_datacite
		datacite = to_datacite(meta=meta, validate=True)
	dandischema/datacite.py:238: in to_datacite
		validate_datacite(datacite_dict)
	dandischema/datacite.py:258: in validate_datacite
		validator.validate(datacite_dict["data"]["attributes"])
	venv/3/lib/python3.12/site-packages/jsonschema/validators.py:451: in validate
		raise error
	E   jsonschema.exceptions.ValidationError: 'identifiers' is a required property
	E
	E   Failed validating 'required' in schema:
	E       {'$schema': 'http://json-schema.org/draft-07/schema#',
	E        'definitions': {'nameType': {'type': 'string',
	E                                     'enum': ['Organizational', 'Personal']},
	E                        'nameIdentifiers': {'type': 'array',
	E                                            'items': {'type': 'object',
	E                                                      'properties': {'nameIdentifier': {'type': 'string'},
	E                                                                     'nameIdentifierScheme': {'type': 'string'},
	E                                                                     'schemeURI': {'type': 'string',
	E                                                                                   'format': 'uri'}},
	E                                                      'required': ['nameIdentifier',
	E                                                                   'nameIdentifierScheme']},
…eniosoftware

Done in hope to see "non-standard" identifiers being gone but immediate fail is

	___ test_dandimeta_datacite[additional_meta6-datacite_checks6] _
	dandischema/tests/test_datacite.py:407: in test_dandimeta_datacite
		validator.validate(datacite["data"]["attributes"])
	venv/3/lib/python3.12/site-packages/jsonschema/validators.py:451: in validate
		raise error
	E   jsonschema.exceptions.ValidationError: 'DANDI Archive' is not of type 'object'
	E
	E   Failed validating 'type' in schema['properties']['publisher']:
	E       {'type': 'object',
	E        'additionalProperties': False,
	E        'properties': {'name': {'type': 'string'},
	E                       'publisherIdentifier': {'type': 'string'},
	E                       'publisherIdentifierScheme': {'type': 'string'},
	E                       'schemeUri': {'type': 'string', 'format': 'uri'},
	E                       'lang': {'type': 'string'}},
	E        'required': ['name']}
	E
	E   On instance['publisher']:
	E       'DANDI Archive'

So we need to standardize "publisher" better
@yarikoptic
Copy link
Member Author

yarikoptic commented Nov 12, 2024

but there is no identifiers (note: plural) defined in the schema, hence switching to alternateIdentifiers as a list of alternateIdentifier entities which is defined in the schema (source/meta/kernel-4/metadata.xsd: <xs:element name="alternateIdentifiers" minOccurs="0">).

@djarecka
Copy link
Member

yes, you're right, identifier is mandatory but singular, see your point.
But this actually has not changed since v4.3 (i mean the official documentation), so again the validations they used are not fully compatible with their documentation.

@yarikoptic
Copy link
Member Author

yes, as I think we ran into before on a number of occasions they have differences between schema, documentation, and jsonschema serializations since documentation and jsonschema are not automatically produced from the schema unfortunately IIRC. Thus all the differences. I think it would be best to rely on schema (not documentation) as the "ultimate ground truth" with the hope that jsonschema would eventually become autogenerated/fully reflective of the original schema.

FWIW -- tests pass here. That means that we are in compliance with test fabric of datacite and can proceed right?

@yarikoptic
Copy link
Member Author

FTR, here is

@satra
Copy link
Member

satra commented Nov 18, 2024

datacite API doc which says that it also accepts "identifiers" as alternative to "alternateIdentifiers": https://support.datacite.org/docs/what-is-the-identifiers-attribute-in-the-rest-api

@yarikoptic - so are you going to revert back to identifiers? was there a rationale for changing it?

@yarikoptic
Copy link
Member Author

no, I do not see the point of reverting to accommodate API interface for which we do not have a validator.

The rationale for my change is to be able to validate our metadata against jsonschema of datacite record we have access to. Do you have concerns/reservations?

@satra
Copy link
Member

satra commented Nov 18, 2024

as i said, in the past we did this against the schema and they said it's not supported in the api. since we test against the api, that's what i would hold as our functional target. i believe this happened as we were suggesting changes to the schema. at least at that time, almost 5 years ago, they considered their documentation to be the standard, not the schema. and their api reflected the documentation. if things have changed, i would go with that. but i don't see how alternateIdentifier replacing identifier seems like a reasonable change.

see the schema here: https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/identifier/
identifer is mandatory
alternateidentifier is optional

they used to consider the schema and the XSD file as what they support. anything json schema was secondary. if they have changed that, go ahead. but i would check the XSD. that's what we did and it did ensure API success.

@yarikoptic
Copy link
Member Author

identifer is mandatory

identifier is singular and thus Occurrences: 1.
We are talking about identifiers (plural) for which there is only indeed that https://datacite-metadata-schema.readthedocs.io/en/4.5/properties/alternateidentifier/ which allows for multiple Occurrences: 0-n.

XSD 4.5

In json schema

Here is the XML of the datacite record we get for a random sample dandiset we have published
❯ curl -LH "Accept: application/vnd.datacite.datacite+xml" https://doi.org/10.48324/DANDI.000897/0.240605.1710 
<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
  <identifier identifierType="DOI">10.48324/DANDI.000897/0.240605.1710</identifier>
  <creators>
    <creator>
      <creatorName nameType="Personal">Neupane, Sujaya</creatorName>
      <givenName>Sujaya</givenName>
      <familyName>Neupane</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-0052-3122</nameIdentifier>
    </creator>
    <creator>
      <creatorName nameType="Personal">Fiete, Ila</creatorName>
      <givenName>Ila</givenName>
      <familyName>Fiete</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0003-4738-2539</nameIdentifier>
    </creator>
    <creator>
      <creatorName nameType="Personal">Jazayeri, Mehrdad</creatorName>
      <givenName>Mehrdad</givenName>
      <familyName>Jazayeri</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-9764-6961</nameIdentifier>
    </creator>
  </creators>
  <titles>
    <title>Neupane_Fiete_Jazayeri_Mental navigation_NHP_EntorhinalCortex</title>
  </titles>
  <publisher>DANDI Archive</publisher>
  <publicationYear>2024</publicationYear>
  <resourceType resourceTypeGeneral="Dataset">Neural Data</resourceType>
  <subjects>
    <subject>entorhinal cortex, cognitive map, mental navigation,</subject>
  </subjects>
  <contributors>
    <contributor contributorType="ContactPerson">
      <contributorName nameType="Personal">Neupane, Sujaya</contributorName>
      <givenName>Sujaya</givenName>
      <familyName>Neupane</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-0052-3122</nameIdentifier>
    </contributor>
    <contributor contributorType="ContactPerson">
      <contributorName nameType="Personal">Jazayeri, Mehrdad</contributorName>
      <givenName>Mehrdad</givenName>
      <familyName>Jazayeri</familyName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org/">0000-0002-9764-6961</nameIdentifier>
    </contributor>
  </contributors>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="URL">https://identifiers.org/DANDI:000897/0.240605.1710</alternateIdentifier>
    <alternateIdentifier alternateIdentifierType="URL">https://dandiarchive.org/dandiset/000897/0.240605.1710</alternateIdentifier>
  </alternateIdentifiers>
  <sizes/>
  <formats/>
  <version/>
  <rightsList>
    <rights rightsIdentifier="cc_by_40" rightsIdentifierScheme="SPDX"/>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">The dataset contains electrophysiology data recorded from the entorhinal cortex of two NHPs performing a mental navigation task. The recording probes used were V-probe with 32 channels or 64 channels, manufactured by Plexon Inc. </description>
  </descriptions>
  <fundingReferences>
    <fundingReference>
      <funderName>National Institute of Mental Health</funderName>
      <funderIdentifier funderIdentifierType="ROR">https://ror.org/05xj56w78</funderIdentifier>
      <awardNumber>NIMH-MH129046</awardNumber>
    </fundingReference>
    <fundingReference>
      <funderName>Natural Science and Engineering Council of Canada</funderName>
      <funderIdentifier funderIdentifierType="ROR">https://ror.org/01h531d29</funderIdentifier>
      <awardNumber>NSERC PDF-516867-2018</awardNumber>
    </fundingReference>
  </fundingReferences>
</resource>

so you can see that

  • "identifier" is the DOI assigned by datacite
  • "alternateIdentifiers" is what we were providing via identifiers list of "identifier" records

AFAIK we are testing submission of our datacite records against test fabric, so AFAIK switching to alternateIdentifiers here should work just fine.

@satra
Copy link
Member

satra commented Nov 19, 2024

thank you @yarikoptic for the explanation. i'm now aligned.

"name": "DANDI Archive",
"schemeUri": "https://scicrunch.org/resolver/",
"publisherIdentifier": "https://scicrunch.org/resolver/RRID:SCR_017571",
"publisherIdentifierScheme": "RRID",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RRID technically isn't supported yet, though it should be very soon (when 4.6 is released). It doesn't fail out on the jsonschema because of a bug inveniosoftware/datacite#103

It may work right now because DataCite is rolling out 4.6 support, but is worth double checking it goes through to Fabrica.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We checked, nothing in 4.6 PR seems to suggest for it to become a controlled dictionary, just a "string" ATM, so I think we should keep it.

FWIW -- I did open and close

I guess we could have switched to https://www.re3data.org/repository/r3d100013638 which also points to the RRID. But it also shows that may be here it could be publisherIdentifiers list of records? (I don't want even to suggest that "officially" ;-) )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I got messed up by the examples and that RRID is going to be supported as a related identifier type in 4.6. I was just on the ROR call today, where they showed the data in this field is a mess. But that's that happens without a controlled list.

@tmorrell
Copy link

This looks like the right changes to me. From the inveniosoftware/datacite side, we test that all the example json records make it to Fabrica. Around 4.3-4.4 there were issues with parts of the json representations not making it through (it's why we never did a 4.4 jsonschema), but I think all those issues other then GeoLocationPolygon have been resolved. If you find anything weird though just open an issue and we'll fix it. The goal is that anything that validates with the jsonschema goes through to Fabrica.

This makes it all consistend with funderIdentifier, alternateIdentifier and may be others.
rightsIdentifier was found to be different (thereis rightsURI, no schemeUri)
It kinda historically dragged through time from the original
1056afb49fd945afc471e200d782bbff9de43cf1 in dandi-cli where it
was added to "identifiers".  As the wise @tmorell has mentioned,
since it is the datacite which is to provide DOI, it would ignore
DOI "alternateIdentifiers".  I think it makes sensse overall, although
there could potentially be multiple DOIs for a single dandiset -- nothing
in DOI principle forbids it. But since original purpose here is not
clear -- we better just strip it away since it should be the DOI
minted by datacite fabrica as the one for the PublishedDandiset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants