Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating assets calls github repeatedly #153

Open
danlamanna opened this issue Jan 9, 2023 · 5 comments
Open

Validating assets calls github repeatedly #153

danlamanna opened this issue Jan 9, 2023 · 5 comments

Comments

@danlamanna
Copy link
Contributor

When calling dandischema.metadata.validate on n assets, n requests are made to github to fetch the schema. This makes validating assets take significantly longer than it should. The request also has no default timeout, meaning a call to validate can hang indefinitely.

schema = requests.get(
f"https://raw.githubusercontent.com/dandi/schema/"
f"master/releases/{schema_version}/{schema_filename}"
).json()

Can dandi-schema be modified to avoid relying on the network for validation? Either by bundling the schemas from dandi/schema into package data, allowing the caller of validate to pass a schema directly, or some other means?

FWIW this problem appears to exist with migrate as well.

@satra
Copy link
Member

satra commented Jan 10, 2023

@danlamanna - we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

it's also the case that this download happens when an asset is using a different schema than the current one. this is true for many assets currently that were submitted a while back, but should not in theory be true for new assets being uploaded. i.e. the schema version should be the latest.

we have been planning to run a metadata update by processing the files with the latest extractor, but this hasn't been rolled into action.

@danlamanna
Copy link
Contributor Author

we could probably package the schemas with dandi-schema. an alternative would be to cache the request on the server side. are the requests all in isolated processes or would a cache to keep a schema once downloaded work?

Avoiding network requests altogether would be best for maximizing reliability. A cache combined with giving the caller control over how network requests are performed (timeouts, retries, etc) would be the next best option.

@satra
Copy link
Member

satra commented Jan 10, 2023

dandischema in general requires access to online resources to carry out it's general work, so it will never be a network free library. but we can optimize it in some ways. we didn't want to make assumptions about availability of storage, persistence etc when we wrote that component, but i can try a few changes. @djarecka and @sooyounga - is this something you folks could take a stab at? happy to discuss details.

@waxlamp
Copy link
Member

waxlamp commented Jan 11, 2023

@satra, Dan's idea has a lot of merit: even if the goal is to always be validating against the newest schema version, we are not there yet, and keeping the allowed schema versions as static package data would gain us an immediate and obvious win (while we are still litigating, so to speak, schema autoupgrades etc.).

Dan can create a quick proof of concept so we can observe the benefits/drawbacks of the approach. He can coordinate this idea with whatever Dorota and Sooyoung are looking into as well.

@satra
Copy link
Member

satra commented Jan 11, 2023

@waxlamp - i have no issues with a proof of concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants