Validating assets calls github repeatedly #153

danlamanna · 2023-01-09T18:06:34Z

When calling dandischema.metadata.validate on n assets, n requests are made to github to fetch the schema. This makes validating assets take significantly longer than it should. The request also has no default timeout, meaning a call to validate can hang indefinitely.

dandi-schema/dandischema/metadata.py

Lines 184 to 187 in d34658c

    
           schema = requests.get( 
        
               f"https://raw.githubusercontent.com/dandi/schema/" 
        
               f"master/releases/{schema_version}/{schema_filename}" 
        
           ).json()

Can dandi-schema be modified to avoid relying on the network for validation? Either by bundling the schemas from dandi/schema into package data, allowing the caller of validate to pass a schema directly, or some other means?

FWIW this problem appears to exist with migrate as well.

The text was updated successfully, but these errors were encountered:

satra · 2023-01-10T03:51:01Z

@danlamanna - we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

it's also the case that this download happens when an asset is using a different schema than the current one. this is true for many assets currently that were submitted a while back, but should not in theory be true for new assets being uploaded. i.e. the schema version should be the latest.

we have been planning to run a metadata update by processing the files with the latest extractor, but this hasn't been rolled into action.

danlamanna · 2023-01-10T15:02:06Z

we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

Avoiding network requests altogether would be best for maximizing reliability. A cache combined with giving the caller control over how network requests are performed (timeouts, retries, etc) would be the next best option.

satra · 2023-01-10T16:22:30Z

dandischema in general requires access to online resources to carry out it's general work, so it will never be a network free library. but we can optimize it in some ways. we didn't want to make assumptions about availability of storage, persistence etc when we wrote that component, but i can try a few changes. @djarecka and @sooyounga - is this something you folks could take a stab at? happy to discuss details.

waxlamp · 2023-01-11T17:05:19Z

@satra, Dan's idea has a lot of merit: even if the goal is to always be validating against the newest schema version, we are not there yet, and keeping the allowed schema versions as static package data would gain us an immediate and obvious win (while we are still litigating, so to speak, schema autoupgrades etc.).

Dan can create a quick proof of concept so we can observe the benefits/drawbacks of the approach. He can coordinate this idea with whatever Dorota and Sooyoung are looking into as well.

satra · 2023-01-11T18:33:07Z

@waxlamp - i have no issues with a proof of concept.

danlamanna mentioned this issue Jan 18, 2023

Statically bundle dandi/schema json #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating assets calls github repeatedly #153

Validating assets calls github repeatedly #153

danlamanna commented Jan 9, 2023

satra commented Jan 10, 2023

danlamanna commented Jan 10, 2023

satra commented Jan 10, 2023

waxlamp commented Jan 11, 2023

satra commented Jan 11, 2023

Validating assets calls github repeatedly #153

Validating assets calls github repeatedly #153

Comments

danlamanna commented Jan 9, 2023

satra commented Jan 10, 2023

danlamanna commented Jan 10, 2023

satra commented Jan 10, 2023

waxlamp commented Jan 11, 2023

satra commented Jan 11, 2023