Incorrect summaries (at least number of subjects) in some dandisets #172

yarikoptic · 2023-04-03T22:12:10Z

It was mentioned by the dandiset author for https://dandiarchive.org/dandiset/000212 that there is only 1013 files but summary has 1097 subjects!

I did check across all dandisets for which we have datalad representations:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0*; do ( cd $d; n1=$(awk '/numberOfSubjects:/{print $2}' dandiset.yaml || echo 0 ); n2=$(/bin/ls -1d sub-* 2>/dev/null | wc -l || echo 0); [ -n "$n1" ] && { [ "$n1" = "$n2" ] || echo "`pwd`: Mismatch $n1 != $n2"; } ); done
/mnt/backup/dandi/dandisets/000025: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000043: Mismatch 22 != 13
/mnt/backup/dandi/dandisets/000051: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000052: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000058: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000168: Mismatch 30 != 0
/mnt/backup/dandi/dandisets/000212: Mismatch 1097 != 1013
/mnt/backup/dandi/dandisets/000228: Mismatch 64 != 32
/mnt/backup/dandi/dandisets/000246: Mismatch 12 != 11
/mnt/backup/dandi/dandisets/000288: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121
/mnt/backup/dandi/dandisets/000297: Mismatch 197 != 118
/mnt/backup/dandi/dandisets/000341: Mismatch 310 != 0
/mnt/backup/dandi/dandisets/000362: Mismatch 11 != 0
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4

so it seems that the issue manifests itself quite wildly.

satra · 2023-04-04T00:29:03Z

let's check this with the dandischema aggregate function run locally on asset metadata to ensure this is not a dandischema issue. the aggregate summary function works a list of asset metadata.

then we can circle back here to see if this is some postgres query issue that should not be including some assets. also we should ask submitters to make sure there is a match between local and remote files for complete dandisets, or that they confirm the right number of files.

satra · 2023-04-04T00:32:24Z

seems like a schema issue, while loading assets.jsonld from s3 bucket.

In [1]: from dandischema.metadata import aggregate_assets_summary

In [2]: import json

In [3]: with open("assets.jsonld") as fp:
   ...:     data = json.load(fp)
   ...: 

In [4]: len(data)
Out[4]: 1013

In [5]: aggregate_assets_summary(data)
Out[5]: 
{'schemaKey': 'AssetsSummary',
 'numberOfBytes': 9004401256,
 'numberOfFiles': 1013,
 'numberOfSubjects': 1097,
 'dataStandard': [{'schemaKey': 'StandardsType',
   'identifier': 'RRID:SCR_015242',
   'name': 'Neurodata Without Borders (NWB)'}],
 'approach': [{'schemaKey': 'ApproachType', 'name': 'behavioral approach'}],
 'measurementTechnique': [{'schemaKey': 'MeasurementTechniqueType',
   'name': 'analytical technique'},
  {'schemaKey': 'MeasurementTechniqueType', 'name': 'behavioral technique'}],
 'variableMeasured': ['Position', 'ProcessingModule', 'SpatialSeries'],
 'species': [{'schemaKey': 'SpeciesType',
   'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_7227',
   'name': 'Drosophila melanogaster - Fruit fly'},
  {'schemaKey': 'SpeciesType',
   'identifier': 'http://purl.obolibrary.org/obo/NCBITaxon_28584',
   'name': 'Drosophila suzukii'}]}

satra · 2023-04-04T00:56:25Z

@yarikoptic - it's very likely happening because of normalization of subject identifier. perhaps something has changed between the way we were normalizing before and now. dandischema assumes subject id in path can be generated simply by replacing "_" with "-". i suspect cli is doing more than that and stripping other characters.

that normalization has to be moved to dandi schema and used in the asset summary generation. moving this issue to dandi-schema.

yarikoptic · 2023-04-04T18:48:11Z

sorry -- I am not following how normalization has anything to do with it since we have higher number in the summary than there is really of sub- folders on the drive (1097 > 1013). So needs to look on where it comes up with "extra subjects", likely counting the same files twice...

yarikoptic · 2023-04-04T18:58:37Z

I see now what you mean about normalization e.g. '0p8--1p4--CS-fly#-16', '0p8%-1p4%-CS-fly#-19',...

satra · 2023-04-06T13:08:27Z

@yarikoptic - can you point me to the function in dandi-cli where the filename normalization occurs? as a start i will first simply copy the function to dandi-schema, or you could do it and replace the place where it is used.

this is where the mismatches are:

from asset metadata record:

dandi-schema/dandischema/metadata.py

Line 302 in b310e3e

subject = value["identifier"].replace("_", "-")
the other place we are getting subjects from is the asset path:

dandi-schema/dandischema/metadata.py

Line 314 in b310e3e

for part in Path(assetmeta["path"]).name.split(".")[0].split("_"):

the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.

yarikoptic · 2023-04-06T16:18:07Z

do you mean where those % could come from? isn't all the code involved for the used above aggregate_assets_summary is in dandi-schema, thus gotcha is somewhere there?

but as for what creates filenames in dandi-cli -- AFAIK only organize and the code is at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/organize.py#L68:5

2. the other place we are getting subjects from is the asset path:

dandi-schema/dandischema/metadata.py

Line 314 in b310e3e

for part in Path(assetmeta["path"]).name.split(".")[0].split("_"):

the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure.

that is where I thought (didn't yet) to check/propose fix -- to not go through all parts of the file path, but only look at the top level folder names, since with above, some inconsistent dandiset with sub-1/sub-2.dat would result in 2 subjects.

satra · 2023-04-06T16:28:10Z

it's this function that is taking subject from metadata and sanitizing for a string in a filename: https://github.com/dandi/dandi-cli/blob/eef8443a16f0968891f4ddfca43d663df3f07f2b/dandi/organize.py#L392

i'll copy that over. that should at least temporarily fix the issue.

regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid. that part was created as we did not have any bids validation verification in place nor metadata extraction. so indeed cli should extract and populate relevant metadata from a bids asset path or other relevant files. but there are a few things this function is doing that should be done by asset metadata extractor.

yarikoptic · 2023-04-06T16:59:29Z

regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid.

well, we generate summaries for invalid ones as well. The question is on how robust we should be in such cases. And IMHO relying just on sub- at top level sounds like the easiest and most robust approach.

yarikoptic · 2023-04-06T18:01:50Z

the only cons (or may be pros) -- we would not pick up subjects from within under the folders like derivatives/ etc, so some datasets of that kind would not be considered. So the "robustification" could be -- take first subject indicator found in the path, not all of them ;)

Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172

This seems to Closes #172 as well

yarikoptic · 2024-04-10T15:51:28Z

I don't think we neither fixed or boiled this issue down.

looking down at current output -- we have some times number greater and some time lesser - but overall we have more inconsistencies reported than before!

dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0*; do ( cd $d; n1=$(awk '/numberOfSubjects:/{print $2}' dandiset.yaml || echo 0 ); n2=$(/bin/ls -1d sub-* 2>/dev/null | wc -l || echo 0); [ -n "$n1" ] && { [ "$n1" = "$n2" ] || echo "`pwd`: Mismatch $n1 != $n2"; } ); done
/mnt/backup/dandi/dandisets/000025: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000043: Mismatch 22 != 13
/mnt/backup/dandi/dandisets/000051: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000052: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000058: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000168: Mismatch 30 != 0
/mnt/backup/dandi/dandisets/000212: Mismatch 1097 != 1013
/mnt/backup/dandi/dandisets/000228: Mismatch 64 != 32
/mnt/backup/dandi/dandisets/000246: Mismatch 146 != 101
/mnt/backup/dandi/dandisets/000288: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121
/mnt/backup/dandi/dandisets/000297: Mismatch 197 != 118
/mnt/backup/dandi/dandisets/000341: Mismatch 310 != 0
/mnt/backup/dandi/dandisets/000342: Mismatch 3 != 2
/mnt/backup/dandi/dandisets/000344: Mismatch 4 != 3
/mnt/backup/dandi/dandisets/000361: Mismatch 37 != 30
/mnt/backup/dandi/dandisets/000362: Mismatch 11 != 0
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4
/mnt/backup/dandi/dandisets/000541: Mismatch 19 != 21
/mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13
/mnt/backup/dandi/dandisets/000582: Mismatch 14 != 15
/mnt/backup/dandi/dandisets/000686: Mismatch 22 != 18
/mnt/backup/dandi/dandisets/000687: Mismatch 8 != 10
/mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9
/mnt/backup/dandi/dandisets/000696: Mismatch 5 != 6
/mnt/backup/dandi/dandisets/000710: Mismatch 4 != 6
/mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9
/mnt/backup/dandi/dandisets/000715: Mismatch 9 != 10
/mnt/backup/dandi/dandisets/000719: Mismatch 1 != 3
/mnt/backup/dandi/dandisets/000724: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000731: Mismatch 2 != 0
/mnt/backup/dandi/dandisets/000766: Mismatch 3 != 2
/mnt/backup/dandi/dandisets/000776: Mismatch 37 != 38
/mnt/backup/dandi/dandisets/000873: Mismatch 1 != 2
/mnt/backup/dandi/dandisets/000874: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000875: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000888: Mismatch 104 != 52
/mnt/backup/dandi/dandisets/000889: Mismatch 104 != 52
/mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29
/mnt/backup/dandi/dandisets/000946: Mismatch 37 != 38

I improved script to https://github.com/dandi/dandisets/blob/draft/tools/check_numberOfSubjects so it states also last recent published version, which might not correspond to draft but still -- good hint that we assumed it all kosher . And we have

dandi@drogon:/mnt/backup/dandi/dandisets$ tools/check_numberOfSubjects | grep -v 'version: null'
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121   MRP version: "0.220708.1652"
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4   MRP version: "0.230302.2331"
/mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13   MRP version: "0.231010.1811"
/mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9   MRP version: "0.240402.2118"
/mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9   MRP version: "0.240402.2115"
/mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29   MRP version: "0.240327.2229"

and for https://dandiarchive.org/dandiset/000293/draft we do not have any warnings etc. So I think we simply do not validate consistency of "dandi-layout" and nwb metadata.

but overall it is indeed a "wild west" there on how people seems simply do not care to make their dandisets "valid".
I think we would benefit from some formalized way to provide feedback to the dandiset owners for all those which remain invalid and authors do not really concern themselves.

my attempt to figure out why on sample /mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121 -- and check if there is different subject ids somehow

(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ for sub in 000293/sub-*; do echo -ne "$sub\t"; dandi -l ERROR ls -f json -r --metadata all dandi://dandi/$sub/ | jq -r '.[] | .metadata.wasAttributedTo[] | select(.schemaKey == "Participant") | .identifier' | uniq ; done
000293/sub-1801-18129009        1801-18129009
000293/sub-1802-18201004        1802-18201004
000293/sub-1802-18201011        1802-18201011
000293/sub-1803-18220008        1803-18220008
000293/sub-1803-18220019        1803-18220019
000293/sub-1808-18320005        1808-18320005
000293/sub-1808-18320021        1808-18320021
000293/sub-1809-18320031        1809-18320031
000293/sub-1813-18329014        1813-18329014
000293/sub-1813-18329044        1813-18329044
000293/sub-1813-18329051        1813-18329051
000293/sub-1813-18329062        1813-18329062
000293/sub-1815-18426017        1815-18426017
000293/sub-1822-18o22001        1822-18o22001
000293/sub-1822-18o22010        1822-18o22010
000293/sub-1822-18o22020        1822-18o22020
000293/sub-1902-19128006        1902-19128006
000293/sub-1902-19128040        1902-19128040
000293/sub-1903-19129058        1903-19129058
000293/sub-1903-19129072        1903-19129072
000293/sub-1906-19228044        1906-19228044
000293/sub-1906-19228058        1906-19228058
000293/sub-1906-19228068        1906-19228068
000293/sub-1907-19319025        1907-19319025
000293/sub-1908-19320022        1908-19320022
000293/sub-1909-19328001        1909-19328001
000293/sub-1909-19328009        1909-19328009
000293/sub-1909-19328019        1909-19328019
000293/sub-1909-19328034        1909-19328034
000293/sub-1909-19328046        1909-19328046
000293/sub-1911-19o10010        1911-19o10010
000293/sub-1911-19o10045        1911-19o10045
000293/sub-1911-19o10054        1911-19o10054
000293/sub-1911-19o10065        1911-19o10065
000293/sub-1912-2019-11-04-0001 1912-2019_11_04_0001
000293/sub-1912-2019-11-04-0083 1912-2019_11_04_0083
000293/sub-1913-2019-11-26-0006 1913-2019_11_26_0006
000293/sub-1913-2019-11-26-0019 1913-2019_11_26_0019
000293/sub-1913-2019-11-26-0037 1913-2019_11_26_0037
000293/sub-1914-2019-11-28-0010 1914-2019_11_28_0010
000293/sub-1914-2019-11-28-0038 1914-2019_11_28_0038
000293/sub-X2015-10-08-15o08002 X2015.10.08-15o08002
000293/sub-X2015-10-08-15o08007 X2015.10.08-15o08007
000293/sub-X2015-10-08-15o08011 X2015.10.08-15o08011
000293/sub-X2015-10-08-15o08017 X2015.10.08-15o08017
000293/sub-X2015-10-08-15o08020 X2015.10.08-15o08020
000293/sub-X2015-10-08-15o08022 X2015.10.08-15o08022
000293/sub-X2015-10-08-15o08032 X2015.10.08-15o08032
000293/sub-X2015-10-08-15o08038 X2015.10.08-15o08038
000293/sub-X2015-11-09-2015-11-09-0003  X2015.11.09-2015_11_09_0003
000293/sub-X2015-11-09-2015-11-09-0017  X2015.11.09-2015_11_09_0017
000293/sub-X2015-11-09-2015-11-09-0053  X2015.11.09-2015_11_09_0053
000293/sub-X2015-11-09-2015-11-09-0078  X2015.11.09-2015_11_09_0078
000293/sub-X2015-11-09-2015-11-09-0106  X2015.11.09-2015_11_09_0106
000293/sub-X2015-11-09-2015-11-09-0107  X2015.11.09-2015_11_09_0107
000293/sub-X2016-01-28-2016-01-28-0004  X2016.01.28-2016_01_28_0004
000293/sub-X2016-01-28-2016-01-28-0012  X2016.01.28-2016_01_28_0012
000293/sub-X2016-01-28-2016-01-28-0019  X2016.01.28-2016_01_28_0019
000293/sub-X2016-02-04-2016-02-04-0009  X2016.02.04-2016_02_04_0009
000293/sub-X2016-02-04-2016-02-04-0015  X2016.02.04-2016_02_04_0015
000293/sub-X2016-02-04-2016-02-04-0018  X2016.02.04-2016_02_04_0018
000293/sub-X2016-02-04-2016-02-04-0021  X2016.02.04-2016_02_04_0021
000293/sub-X2016-02-04-2016-02-04-0029  X2016.02.04-2016_02_04_0029
000293/sub-X2016-02-04-2016-02-04-0033  X2016.02.04-2016_02_04_0033
000293/sub-X2016-02-04-2016-02-04-0042  X2016.02.04-2016_02_04_0042
000293/sub-X2016-02-04-2016-02-04-0045  X2016.02.04-2016_02_04_0045
000293/sub-X2016-02-25-2016-02-25-0005  X2016.02.25-2016_02_25_0005
000293/sub-X2016-02-25-2016-02-25-0007  X2016.02.25-2016_02_25_0007
000293/sub-X2016-02-25-2016-02-25-0073  X2016.02.25-2016_02_25_0073
000293/sub-X2016-02-25-2016-02-25-0082  X2016.02.25-2016_02_25_0082
000293/sub-X2016-02-25-2016-02-25-0134  X2016.02.25-2016_02_25_0134
000293/sub-X2016-02-25-2016-02-25-0255  X2016.02.25-2016_02_25_0255
000293/sub-X2016-02-29-2016-02-29-0032  X2016.02.29-2016_02_29_0032
000293/sub-X2016-02-29-2016-02-29-0065  X2016.02.29-2016_02_29_0065
000293/sub-X2016-03-01-2016-03-01-0000  X2016.03.01-2016_03_01_0000
000293/sub-X2016-03-01-2016-03-01-0047  X2016.03.01-2016_03_01_0047
000293/sub-X2016-03-03-2016-03-03-0002  X2016.03.03-2016_03_03_0002
000293/sub-X2016-03-03-2016-03-03-0054  X2016.03.03-2016_03_03_0054
000293/sub-X2016-03-03-2016-03-03-0100  X2016.03.03-2016_03_03_0100
000293/sub-X2016-03-03-2016-03-03-0103  X2016.03.03-2016_03_03_0103
000293/sub-X2019-01-22-19122003 X2019.01.22-19122003
000293/sub-X2019-01-22-19122017 X2019.01.22-19122017
000293/sub-X2019-01-22-19122024 X2019.01.22-19122024
000293/sub-X2019-01-22-19122026 X2019.01.22-19122026
000293/sub-X2019-01-28-19128003 X2019.01.28-19128003
000293/sub-X2019-01-28-19128061 X2019.01.28-19128061
000293/sub-X2019-01-28-19128068 X2019.01.28-19128068
000293/sub-X2019-01-29-19129004 X2019.01.29-19129004
000293/sub-X2019-01-29-19129015 
....

so that relates to discussion above on sanitizaiton -- but it leads to doublecount now if metadata subj id is different from the filename one, and that is a bug.

satra transferred this issue from dandi/dandi-archive Apr 4, 2023

yarikoptic added a commit that referenced this issue Apr 6, 2023

Count only for the first hit for subject or tissue within filename

fecfb26

Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172

yarikoptic mentioned this issue Apr 6, 2023

Count only for the first hit for subject or tissue within filename #173

Open

yarikoptic added a commit that referenced this issue Apr 6, 2023

Count only for the first hit for subject or tissue within filename

7b9a460

Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172

yarikoptic added a commit that referenced this issue Apr 6, 2023

Count only for the first hit for subject or tissue within filename

ed02dec

Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172

yarikoptic added a commit that referenced this issue Apr 6, 2023

ENH: use sanitize_value helper instead of adhoc replacement of _

e8aba8e

This seems to Closes #172 as well

yarikoptic mentioned this issue Apr 6, 2023

ENH: adopt sanitize_value from dandi-cli and use for sanitization of identifier #175

Merged

2 tasks

dandibot pushed a commit to dandi/dandisets that referenced this issue Apr 10, 2024

A script for dandi/dandi-schema#172 to check consistency of nsubjects

fddb7fb

yarikoptic closed this as completed in #175 Aug 13, 2024

yarikoptic closed this as completed in 61c374d Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect summaries (at least number of subjects) in some dandisets #172

Incorrect summaries (at least number of subjects) in some dandisets #172

yarikoptic commented Apr 3, 2023

satra commented Apr 4, 2023

satra commented Apr 4, 2023

satra commented Apr 4, 2023

yarikoptic commented Apr 4, 2023

yarikoptic commented Apr 4, 2023

satra commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

satra commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

yarikoptic commented Apr 10, 2024

Incorrect summaries (at least number of subjects) in some dandisets #172

Incorrect summaries (at least number of subjects) in some dandisets #172

Comments

yarikoptic commented Apr 3, 2023

satra commented Apr 4, 2023

satra commented Apr 4, 2023

satra commented Apr 4, 2023

yarikoptic commented Apr 4, 2023

yarikoptic commented Apr 4, 2023

satra commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

satra commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

yarikoptic commented Apr 6, 2023

yarikoptic commented Apr 10, 2024