-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSoC 2025: Investigating Schema Normalization #857
Comments
I love this, and I think it has some interesting overlap with my linting proposal: #856. The things that should be normalised can make very good linting rules that we can aim to auto-fix for schemas. If I can help in any way, please count me in. |
Thank you for your submission Julian |
Hello, I'm Vishwapriya, currently pursuing 3rd year of BTech. I want to contribute to this idea in this GSoC, this my very first time of trying an open source contribution. @Julian or anyone please let me know where to start and how to start. |
This sounds like a great idea, here's what I found and think. After a bit of digging around, I found some common patterns where different schemas produce identical validation outcomes
Existing Normalization ApproachesThough I haven't yet gone through existing approaches, I'd examine the existing approaches mentioned in the description, with a particular focus on:
Proposed Normalization RulesHere's an initial set of normalization rules I would implement:
Implementation ApproachI would develop a Python library for it, but rightnow I'm still thinking about an approach. I would love to get some constructive criticism of this idea and some good feedback. |
Hi @Illucious ! Your exploration is a good starting point! However, I do have some other thoughts:
Additionally, I hope to get some clarifications as well if possible.@Julian What does |
Hey @Julian and the team👋 I’m Idan Levi, a software engineering undergrad with a passion for JSON Schema and open-source. I’ve worked with both front-end and back-end tech like Next.js, Flask, and MongoDB, and I’ve used JSON Schema in some of my projects as well. I really enjoy diving into technologies, understanding how they work, and contributing to improve them. Additionally, I have a small question the project Could we clarify what "better" means in this context? Are we optimizing for code readability, historical precedence, performance in validation engines, or something else? For example, Thanks in advance for your help! 😊 |
Hi everyone, I have a question about handling the empty object
I'd like to know which approach you think is most reasonable, and what factors might be considered in different contexts? |
I think keeping it {} is most appropriate because it follows the JSON Schema specification and keeps the schema as minimal as possible. |
Yeah, I think it makes sense, that's why I mark it as the number one implementation. |
hi @Julian , My Understanding of JSON Schema Normalization : To me schema normalization means transforming different JSON Schemas that mean the same thing into a standard(canonical) form. For example, normalization may involve:
Implementation Approach : I am thinking for this a python-based library could be made
Is my understanding of schema normalization and the approach correct? |
@Julian |
Hi all! Thanks for your interest, glad to hear some are excited! Here's a qualification task for this idea: Review the Hypothesis normalizer that I linked and document as many of the transformation rules you see it doing If you manage that, another good thing to review is this egg tutorial which we'll likely review more carefully during the project itself, as I'd like to understand whether it can help with either the normalization process itself or else with doing optimizations on top of whatever we define. As for those of you asking for precise definitions of "better": that is up to us to define as part of the project and it is likely there are multiple different normalizations and rulesets we can and will define. "Fastest for boolean validation" is certainly the most obvious one, and "Fastest for validation while preserving equivalent annotation results" is likely another. Good luck to all of you and thanks again for your interest. |
Hi @Julian ! I can't wait to start the qualification task! Would you prefer submissions directly in this issue thread (For example, submit a gist link, or directly attach a PDF file?) or through another channel? |
The easiest is probably if you respond with a gist or repository I suppose yep! |
This project seems very interesting. Here are the normalisation rules I found in the linked repository: https://gist.github.com/tthijm/a2c2d16db2242753ae940d0a763aef6c. |
Do we have to share the gist in public? |
I suppose yep, as this is an open-source project |
Hey @Julian ! |
@cbum-dev it should not really matter as long as you share the gist link here. |
Hi all, Here is my gist for the first task: Normalization Rules For the second task, I reviewed the egg crate and I think that an e-graph-based approach could be beneficial for the normalization process. E-graphs are widely used in similar problems such as compiler optimization.
Useful Links I found for reference to e-graphs: Link 1, Link 2 |
Hello @Julian I have completed the Qualification task by reviewing the Hypothesis normalizer and documenting its transformation rules. Now, I am exploring how Egg can be used to apply or optimize these rules. |
Hi @Julian, I’ve been working on the proposal for this project and wanted to ask how you’d prefer the qualification task to be presented. Would you like a dedicated section summarizing my findings and approach, or would a brief mention with a link to the gist work better? Also, are there any specific aspects you’d like me to highlight? Thanks! |
Either is fine! If you shared a link with me already you can just put the link there yeah, no need to repeat what you found (but if not putting it in the proposal is also good with me) |
@Julian |
Brief Description
JSON Schema is a rich language for expressing constraints on JSON data. If we strictly consider JSON Schema validation (rather than any other use of JSON Schema), in many cases there are multiple ways to express the same constraints. For example, the schema:
will have the same validation outcome on all instances as the schema:
One might say that this second schema is in some way "better" than the first one in some way that could be made precise.
The same is true for the schemas
{"required": ["foo"]}
and{"title": "My Schema", "required": ["foo"]}
, and one might say the first one is "better" than the second for the purpose of validation.We can define two schemas to be "equivalent" if they have this property that any instance is valid under one if and only if it is valid under the other, and if we have two equivalent schemas
S
andS'
we might wish to define an algorithm for transforming these schemas into a form which is "canonical" or "normal" such as above.There are existing attempts to do this for various use cases, but no central place where a self-contained set of normalization rules are written down and a self-contained tool exists to perform the procedure. Let's try and write a simple one!
Expected Outcomes
Skills Required
Mentors
@Julian
Expected Difficulty
medium
Expected Time Commitment
175
The text was updated successfully, but these errors were encountered: