-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect summaries (at least number of subjects) in some dandisets #172
Comments
let's check this with the dandischema aggregate function run locally on asset metadata to ensure this is not a dandischema issue. the aggregate summary function works a list of asset metadata. then we can circle back here to see if this is some postgres query issue that should not be including some assets. also we should ask submitters to make sure there is a match between local and remote files for complete dandisets, or that they confirm the right number of files. |
seems like a schema issue, while loading
|
@yarikoptic - it's very likely happening because of normalization of subject identifier. perhaps something has changed between the way we were normalizing before and now. dandischema assumes subject id in path can be generated simply by replacing "_" with "-". i suspect cli is doing more than that and stripping other characters. that normalization has to be moved to dandi schema and used in the asset summary generation. moving this issue to dandi-schema. |
sorry -- I am not following how normalization has anything to do with it since we have higher number in the summary than there is really of sub- folders on the drive (1097 > 1013). So needs to look on where it comes up with "extra subjects", likely counting the same files twice... |
I see now what you mean about normalization e.g. |
@yarikoptic - can you point me to the function in dandi-cli where the filename normalization occurs? as a start i will first simply copy the function to dandi-schema, or you could do it and replace the place where it is used. this is where the mismatches are:
the reason we do the last bit is that the bids-asset stream does not populate asset metadata properly with participant and other goodies from the bids structure. |
do you mean where those but as for what creates filenames in dandi-cli -- AFAIK only
that is where I thought (didn't yet) to check/propose fix -- to not go through all parts of the file path, but only look at the top level folder names, since with above, some inconsistent dandiset with |
it's this function that is taking subject from metadata and sanitizing for a string in a filename: https://github.com/dandi/dandi-cli/blob/eef8443a16f0968891f4ddfca43d663df3f07f2b/dandi/organize.py#L392 i'll copy that over. that should at least temporarily fix the issue. regarding the reading subject from a file path, this entire function assumes the dandiset is valid as its supposed to provide a summary of such a state. it can do certain things even with invalid dandisets, but i wouldn't consider the summary valid if the dandiset is invalid. that part was created as we did not have any bids validation verification in place nor metadata extraction. so indeed cli should extract and populate relevant metadata from a bids asset path or other relevant files. but there are a few things this function is doing that should be done by asset metadata extractor. |
well, we generate summaries for invalid ones as well. The question is on how robust we should be in such cases. And IMHO relying just on |
the only cons (or may be pros) -- we would not pick up subjects from within under the folders like |
Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172
Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172
Also added assertion so we do not count incorrectly. But may be should be just a warning? Closes #172
I don't think we neither fixed or boiled this issue down. looking down at current output -- we have some times number greater and some time lesser - but overall we have more inconsistencies reported than before!dandi@drogon:/mnt/backup/dandi/dandisets$ for d in 0*; do ( cd $d; n1=$(awk '/numberOfSubjects:/{print $2}' dandiset.yaml || echo 0 ); n2=$(/bin/ls -1d sub-* 2>/dev/null | wc -l || echo 0); [ -n "$n1" ] && { [ "$n1" = "$n2" ] || echo "`pwd`: Mismatch $n1 != $n2"; } ); done
/mnt/backup/dandi/dandisets/000025: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000043: Mismatch 22 != 13
/mnt/backup/dandi/dandisets/000051: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000052: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000058: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000168: Mismatch 30 != 0
/mnt/backup/dandi/dandisets/000212: Mismatch 1097 != 1013
/mnt/backup/dandi/dandisets/000228: Mismatch 64 != 32
/mnt/backup/dandi/dandisets/000246: Mismatch 146 != 101
/mnt/backup/dandi/dandisets/000288: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121
/mnt/backup/dandi/dandisets/000297: Mismatch 197 != 118
/mnt/backup/dandi/dandisets/000341: Mismatch 310 != 0
/mnt/backup/dandi/dandisets/000342: Mismatch 3 != 2
/mnt/backup/dandi/dandisets/000344: Mismatch 4 != 3
/mnt/backup/dandi/dandisets/000361: Mismatch 37 != 30
/mnt/backup/dandi/dandisets/000362: Mismatch 11 != 0
/mnt/backup/dandi/dandisets/000454: Mismatch 5 != 4
/mnt/backup/dandi/dandisets/000541: Mismatch 19 != 21
/mnt/backup/dandi/dandisets/000575: Mismatch 12 != 13
/mnt/backup/dandi/dandisets/000582: Mismatch 14 != 15
/mnt/backup/dandi/dandisets/000686: Mismatch 22 != 18
/mnt/backup/dandi/dandisets/000687: Mismatch 8 != 10
/mnt/backup/dandi/dandisets/000692: Mismatch 7 != 9
/mnt/backup/dandi/dandisets/000696: Mismatch 5 != 6
/mnt/backup/dandi/dandisets/000710: Mismatch 4 != 6
/mnt/backup/dandi/dandisets/000714: Mismatch 8 != 9
/mnt/backup/dandi/dandisets/000715: Mismatch 9 != 10
/mnt/backup/dandi/dandisets/000719: Mismatch 1 != 3
/mnt/backup/dandi/dandisets/000724: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000731: Mismatch 2 != 0
/mnt/backup/dandi/dandisets/000766: Mismatch 3 != 2
/mnt/backup/dandi/dandisets/000776: Mismatch 37 != 38
/mnt/backup/dandi/dandisets/000873: Mismatch 1 != 2
/mnt/backup/dandi/dandisets/000874: Mismatch 1 != 0
/mnt/backup/dandi/dandisets/000875: Mismatch 12 != 9
/mnt/backup/dandi/dandisets/000888: Mismatch 104 != 52
/mnt/backup/dandi/dandisets/000889: Mismatch 104 != 52
/mnt/backup/dandi/dandisets/000939: Mismatch 28 != 29
/mnt/backup/dandi/dandisets/000946: Mismatch 37 != 38 I improved script to https://github.com/dandi/dandisets/blob/draft/tools/check_numberOfSubjects so it states also last recent published version, which might not correspond to draft but still -- good hint that we assumed it all kosher . And we have
and for https://dandiarchive.org/dandiset/000293/draft we do not have any warnings etc. So I think we simply do not validate consistency of "dandi-layout" and nwb metadata. but overall it is indeed a "wild west" there on how people seems simply do not care to make their dandisets "valid". my attempt to figure out why on sample /mnt/backup/dandi/dandisets/000293: Mismatch 201 != 121 -- and check if there is different subject ids somehow(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ for sub in 000293/sub-*; do echo -ne "$sub\t"; dandi -l ERROR ls -f json -r --metadata all dandi://dandi/$sub/ | jq -r '.[] | .metadata.wasAttributedTo[] | select(.schemaKey == "Participant") | .identifier' | uniq ; done
000293/sub-1801-18129009 1801-18129009
000293/sub-1802-18201004 1802-18201004
000293/sub-1802-18201011 1802-18201011
000293/sub-1803-18220008 1803-18220008
000293/sub-1803-18220019 1803-18220019
000293/sub-1808-18320005 1808-18320005
000293/sub-1808-18320021 1808-18320021
000293/sub-1809-18320031 1809-18320031
000293/sub-1813-18329014 1813-18329014
000293/sub-1813-18329044 1813-18329044
000293/sub-1813-18329051 1813-18329051
000293/sub-1813-18329062 1813-18329062
000293/sub-1815-18426017 1815-18426017
000293/sub-1822-18o22001 1822-18o22001
000293/sub-1822-18o22010 1822-18o22010
000293/sub-1822-18o22020 1822-18o22020
000293/sub-1902-19128006 1902-19128006
000293/sub-1902-19128040 1902-19128040
000293/sub-1903-19129058 1903-19129058
000293/sub-1903-19129072 1903-19129072
000293/sub-1906-19228044 1906-19228044
000293/sub-1906-19228058 1906-19228058
000293/sub-1906-19228068 1906-19228068
000293/sub-1907-19319025 1907-19319025
000293/sub-1908-19320022 1908-19320022
000293/sub-1909-19328001 1909-19328001
000293/sub-1909-19328009 1909-19328009
000293/sub-1909-19328019 1909-19328019
000293/sub-1909-19328034 1909-19328034
000293/sub-1909-19328046 1909-19328046
000293/sub-1911-19o10010 1911-19o10010
000293/sub-1911-19o10045 1911-19o10045
000293/sub-1911-19o10054 1911-19o10054
000293/sub-1911-19o10065 1911-19o10065
000293/sub-1912-2019-11-04-0001 1912-2019_11_04_0001
000293/sub-1912-2019-11-04-0083 1912-2019_11_04_0083
000293/sub-1913-2019-11-26-0006 1913-2019_11_26_0006
000293/sub-1913-2019-11-26-0019 1913-2019_11_26_0019
000293/sub-1913-2019-11-26-0037 1913-2019_11_26_0037
000293/sub-1914-2019-11-28-0010 1914-2019_11_28_0010
000293/sub-1914-2019-11-28-0038 1914-2019_11_28_0038
000293/sub-X2015-10-08-15o08002 X2015.10.08-15o08002
000293/sub-X2015-10-08-15o08007 X2015.10.08-15o08007
000293/sub-X2015-10-08-15o08011 X2015.10.08-15o08011
000293/sub-X2015-10-08-15o08017 X2015.10.08-15o08017
000293/sub-X2015-10-08-15o08020 X2015.10.08-15o08020
000293/sub-X2015-10-08-15o08022 X2015.10.08-15o08022
000293/sub-X2015-10-08-15o08032 X2015.10.08-15o08032
000293/sub-X2015-10-08-15o08038 X2015.10.08-15o08038
000293/sub-X2015-11-09-2015-11-09-0003 X2015.11.09-2015_11_09_0003
000293/sub-X2015-11-09-2015-11-09-0017 X2015.11.09-2015_11_09_0017
000293/sub-X2015-11-09-2015-11-09-0053 X2015.11.09-2015_11_09_0053
000293/sub-X2015-11-09-2015-11-09-0078 X2015.11.09-2015_11_09_0078
000293/sub-X2015-11-09-2015-11-09-0106 X2015.11.09-2015_11_09_0106
000293/sub-X2015-11-09-2015-11-09-0107 X2015.11.09-2015_11_09_0107
000293/sub-X2016-01-28-2016-01-28-0004 X2016.01.28-2016_01_28_0004
000293/sub-X2016-01-28-2016-01-28-0012 X2016.01.28-2016_01_28_0012
000293/sub-X2016-01-28-2016-01-28-0019 X2016.01.28-2016_01_28_0019
000293/sub-X2016-02-04-2016-02-04-0009 X2016.02.04-2016_02_04_0009
000293/sub-X2016-02-04-2016-02-04-0015 X2016.02.04-2016_02_04_0015
000293/sub-X2016-02-04-2016-02-04-0018 X2016.02.04-2016_02_04_0018
000293/sub-X2016-02-04-2016-02-04-0021 X2016.02.04-2016_02_04_0021
000293/sub-X2016-02-04-2016-02-04-0029 X2016.02.04-2016_02_04_0029
000293/sub-X2016-02-04-2016-02-04-0033 X2016.02.04-2016_02_04_0033
000293/sub-X2016-02-04-2016-02-04-0042 X2016.02.04-2016_02_04_0042
000293/sub-X2016-02-04-2016-02-04-0045 X2016.02.04-2016_02_04_0045
000293/sub-X2016-02-25-2016-02-25-0005 X2016.02.25-2016_02_25_0005
000293/sub-X2016-02-25-2016-02-25-0007 X2016.02.25-2016_02_25_0007
000293/sub-X2016-02-25-2016-02-25-0073 X2016.02.25-2016_02_25_0073
000293/sub-X2016-02-25-2016-02-25-0082 X2016.02.25-2016_02_25_0082
000293/sub-X2016-02-25-2016-02-25-0134 X2016.02.25-2016_02_25_0134
000293/sub-X2016-02-25-2016-02-25-0255 X2016.02.25-2016_02_25_0255
000293/sub-X2016-02-29-2016-02-29-0032 X2016.02.29-2016_02_29_0032
000293/sub-X2016-02-29-2016-02-29-0065 X2016.02.29-2016_02_29_0065
000293/sub-X2016-03-01-2016-03-01-0000 X2016.03.01-2016_03_01_0000
000293/sub-X2016-03-01-2016-03-01-0047 X2016.03.01-2016_03_01_0047
000293/sub-X2016-03-03-2016-03-03-0002 X2016.03.03-2016_03_03_0002
000293/sub-X2016-03-03-2016-03-03-0054 X2016.03.03-2016_03_03_0054
000293/sub-X2016-03-03-2016-03-03-0100 X2016.03.03-2016_03_03_0100
000293/sub-X2016-03-03-2016-03-03-0103 X2016.03.03-2016_03_03_0103
000293/sub-X2019-01-22-19122003 X2019.01.22-19122003
000293/sub-X2019-01-22-19122017 X2019.01.22-19122017
000293/sub-X2019-01-22-19122024 X2019.01.22-19122024
000293/sub-X2019-01-22-19122026 X2019.01.22-19122026
000293/sub-X2019-01-28-19128003 X2019.01.28-19128003
000293/sub-X2019-01-28-19128061 X2019.01.28-19128061
000293/sub-X2019-01-28-19128068 X2019.01.28-19128068
000293/sub-X2019-01-29-19129004 X2019.01.29-19129004
000293/sub-X2019-01-29-19129015
.... so that relates to discussion above on sanitizaiton -- but it leads to doublecount now if metadata subj id is different from the filename one, and that is a bug. |
It was mentioned by the dandiset author for https://dandiarchive.org/dandiset/000212 that there is only 1013 files but summary has 1097 subjects!
I did check across all dandisets for which we have datalad representations:
so it seems that the issue manifests itself quite wildly.
The text was updated successfully, but these errors were encountered: