Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2025: Investigating Schema Normalization #857

Open
Julian opened this issue Jan 16, 2025 · 26 comments
Open

GSoC 2025: Investigating Schema Normalization #857

Julian opened this issue Jan 16, 2025 · 26 comments
Labels
gsoc Google Summer of Code Project Idea

Comments

@Julian
Copy link
Member

Julian commented Jan 16, 2025

Brief Description
JSON Schema is a rich language for expressing constraints on JSON data. If we strictly consider JSON Schema validation (rather than any other use of JSON Schema), in many cases there are multiple ways to express the same constraints. For example, the schema:

{
  "oneOf": [
    {"const": "foo"},
    {"const": "bar"}
  ]
}

will have the same validation outcome on all instances as the schema:

{"enum": ["foo", "bar"]}

One might say that this second schema is in some way "better" than the first one in some way that could be made precise.

The same is true for the schemas {"required": ["foo"]} and {"title": "My Schema", "required": ["foo"]}, and one might say the first one is "better" than the second for the purpose of validation.

We can define two schemas to be "equivalent" if they have this property that any instance is valid under one if and only if it is valid under the other, and if we have two equivalent schemas S and S' we might wish to define an algorithm for transforming these schemas into a form which is "canonical" or "normal" such as above.

There are existing attempts to do this for various use cases, but no central place where a self-contained set of normalization rules are written down and a self-contained tool exists to perform the procedure. Let's try and write a simple one!

Expected Outcomes

  • Investigate the existing implementations of normalization in the wild. There are at least two known ones, one being here.
  • Define a set of normalization rules, with configurability for cases where there are multple reasonable canonical forms
  • Define a set of test cases for schemas which are equivalent under these rules, and for the target canonical form for each set of schemas
  • Write a Python library which performs the normalization and emits the normalized schema
  • Empirically test our normalization procedure by running normalized schemas through Bowtie and comparing whether a given implementation returns the same results

Skills Required

  • An existing understanding of JSON Schema's keywords, which can be used to think about areas which might create possible "denormalization" (e.g. keywords which when used together overlap)
  • Familiarity writing Python, and ideally using JSON Schema from Python
  • Experience testing pieces of software by writing test cases, here likely in the form of writing JSON Schema + instance examples
  • Careful diligence in reading and understanding the existing procedures used (in the link above, as well as in a number of JSON Schema journal articles) and the ability to compare the previous work with each other

Mentors
@Julian

Expected Difficulty
medium

Expected Time Commitment
175

@jviotti
Copy link
Member

jviotti commented Jan 16, 2025

I love this, and I think it has some interesting overlap with my linting proposal: #856. The things that should be normalised can make very good linting rules that we can aim to auto-fix for schemas. If I can help in any way, please count me in.

@benjagm benjagm added the gsoc Google Summer of Code Project Idea label Jan 17, 2025
@Honyii
Copy link
Contributor

Honyii commented Jan 18, 2025

Thank you for your submission Julian

@VishwapriyaVelumula
Copy link

Hello, I'm Vishwapriya, currently pursuing 3rd year of BTech. I want to contribute to this idea in this GSoC, this my very first time of trying an open source contribution. @Julian or anyone please let me know where to start and how to start.
Thankyou in advance for letting me know!!

@Illucious
Copy link

This sounds like a great idea, here's what I found and think.

After a bit of digging around, I found some common patterns where different schemas produce identical validation outcomes

  1. Enumeration Equivalences
{"oneOf": [{"const": "foo"}, {"const": "bar"}]} ≡ {"enum": ["foo", "bar"]}
{"anyOf": [{"const": "foo"}, {"const": "bar"}]} ≡ {"enum": ["foo", "bar"]}
  1. Type Equivalences
    {"type": ["string", "number"]} ≡ {"anyOf": [{"type": "string"}, {"type": "number"}]}

  2. Required Properties
    {"properties": {"foo": {}}, "required": ["foo"]} ≡ {"properties": {"foo": {"type": ["null", "string", "number", "integer", "object", "array", "boolean"]}}, "required": ["foo"]}

  3. Non-validation Keywords
    {"title": "My Schema", "description": "...", "required": ["foo"]} ≡ {"required": ["foo"]}

  4. Default Values
    {"type": "string", "default": "foo"} ≡ {"type": "string"} (for validation purposes only)

  5. Empty Schemas

{} ≡ {"additionalProperties": true}
{} ≡ {"type": ["null", "string", "number", "integer", "object", "array", "boolean"]}

Existing Normalization Approaches

Though I haven't yet gone through existing approaches, I'd examine the existing approaches mentioned in the description, with a particular focus on:

  1. How they handle nested schemas
  2. Their treatment of metadata keywords
  3. Special handling of boolean schemas

Proposed Normalization Rules

Here's an initial set of normalization rules I would implement:

  1. Enumeration Normalization: Convert any oneOf or anyOf with only const schemas to an enum
  2. Type Consolidation: Normalize type specifications to the most concise form
  3. Redundant Keyword Removal: Strip keywords that don't affect validation outcomes
  4. Property Requirement Simplification: Standardize ways of expressing required properties
  5. Logical Operation Flattening: Flatten nested logical operators when possible
  6. Empty Schema Normalization: Standardize representation of unrestricted schemas

Implementation Approach

I would develop a Python library for it, but rightnow I'm still thinking about an approach.

I would love to get some constructive criticism of this idea and some good feedback.

@TheGreatCorrine
Copy link

Hi @Illucious ! Your exploration is a good starting point! However, I do have some other thoughts:

  1. Non-validation Keywords:
    JSON schema is not only used for validation, it can be used for many other tasks, including API documentation generation and UI generation. So it is important to keep these keywords (description, etc.).
  2. Empty Schema Equivalence:
    While {} ≡ {"additionalProperties": true} is correct
    {} ≡ {"type": ["null", "string", "number", "integer", "object", "array", "boolean"]} is not entirely accurate.
    An empty schema accepts any valid JSON value, but listing all types doesn't fully capture the semantic meaning of an empty schema, especially in nested contexts.

Additionally, I hope to get some clarifications as well if possible.@Julian What does better mean? Historical precedence? Code readability? Or something else? Is the second schema better just because it improves the code readability? :) Any insight will be highly appreciated!

@idanidan29
Copy link

idanidan29 commented Feb 28, 2025

Hey @Julian and the team👋

I’m Idan Levi, a software engineering undergrad with a passion for JSON Schema and open-source. I’ve worked with both front-end and back-end tech like Next.js, Flask, and MongoDB, and I’ve used JSON Schema in some of my projects as well.

I really enjoy diving into technologies, understanding how they work, and contributing to improve them.
and I would love to apply for this project for GSoC!

Additionally, I have a small question the project

Could we clarify what "better" means in this context? Are we optimizing for code readability, historical precedence, performance in validation engines, or something else? For example, { "enum": ["foo", "bar"] } is more concise than { "oneOf": [{ "const": "foo" }, { "const": "bar" }] }, but does that alone make it the preferred form? Defining the exact criteria for "better" would help refine the normalization rules.

Thanks in advance for your help! 😊

@TheGreatCorrine
Copy link

Hi everyone, I have a question about handling the empty object {}. In JSON Schema, an empty object accepts any valid JSON value. I see several possible normalization approaches:

  1. Keep as is: Maintain {} as {}
  2. Expand to explicit form: Convert {} to {"type": ["null", "boolean", "object", "array", "string", "number", "integer"]}
  3. Use other equivalent forms: For example {"additionalProperties": true}

I'd like to know which approach you think is most reasonable, and what factors might be considered in different contexts?
Thanks in advance for your insights!

@cbum-dev
Copy link

cbum-dev commented Mar 3, 2025

Hi everyone, I have a question about handling the empty object {}. In JSON Schema, an empty object accepts any valid JSON value. I see several possible normalization approaches:

1. Keep as is: Maintain `{}` as `{}`

2. Expand to explicit form: Convert `{}` to `{"type": ["null", "boolean", "object", "array", "string", "number", "integer"]}`

3. Use other equivalent forms: For example `{"additionalProperties": true}`

I'd like to know which approach you think is most reasonable, and what factors might be considered in different contexts? Thanks in advance for your insights!

I think keeping it {} is most appropriate because it follows the JSON Schema specification and keeps the schema as minimal as possible.

@TheGreatCorrine
Copy link

Hi everyone, I have a question about handling the empty object {}. In JSON Schema, an empty object accepts any valid JSON value. I see several possible normalization approaches:

1. Keep as is: Maintain `{}` as `{}`

2. Expand to explicit form: Convert `{}` to `{"type": ["null", "boolean", "object", "array", "string", "number", "integer"]}`

3. Use other equivalent forms: For example `{"additionalProperties": true}`

I'd like to know which approach you think is most reasonable, and what factors might be considered in different contexts? Thanks in advance for your insights!

I think keeping it {} is most appropriate because it follows the JSON Schema specification and keeps the schema as minimal as possible.

Yeah, I think it makes sense, that's why I mark it as the number one implementation.

@KajalMishra-29
Copy link

KajalMishra-29 commented Mar 4, 2025

hi @Julian ,
I recently came across the "Investigating Schema Normalization" project and found it really interesting. I have some experience working with the MERN stack (MongoDB, Express, React, Node.js), where I worked with JSON data in API requests and stored JSON-like data in MongoDB. This experience gave me a strong connection to the concepts of JSON schema, which is why I find this project particularly appealing.
After going through the project details, I’d love to share my understanding of the project and how I plan to approach its implementation.

My Understanding of JSON Schema Normalization :

To me schema normalization means transforming different JSON Schemas that mean the same thing into a standard(canonical) form.

For example, normalization may involve:

  1. Merging anyOf into a single type definition
    {"anyOf":[{"type": "String"}, {"type": "number"}]} -> {"type":["String","Number"]}
  2. Sorting required fields alphabetically
    { "required": ["age", "name"] } → { "required": ["name", "age"] }
  3. Removing redundant properties
    {"type":"String","enum":["apple"]} -> {"const": "apple"}
  4. Removing unnecessary boolean values
    {"not": { type: "null" } → type: "object"}

Implementation Approach :

I am thinking for this a python-based library could be made
Identifying scenarios where normalization could be applied and applying normalization rules to each scenario independently
for ex : replace enum with a single value by const

def remove_redundant_properties(schema):
       if "enum" in schema and len(schema["enum"]) == 1:
        return {"const": schema["enum"][0]}
    return schema

{ "type": "string", "enum": ["apple"]} -> { "const": "apple" }

Is my understanding of schema normalization and the approach correct?
What tools or python libraries should I familiarize myself for this project?
I’d love to get your insights on how I can further refine my approach. Looking forward to your guidance.

@Kashika23
Copy link

@Julian
Hi, I am interested in contributing to the "Investigating Schema Normalization" project for GSoC 2025. JSON Schema provides multiple ways to define equivalent constraints, and this project aims to create a self-contained normalization tool that transforms schemas into a canonical form.
As a beginner in Python and JSON Schema, I want to improve my skills in schema validation, algorithm design, and software testing. I am eager to study existing normalization approaches, define rules, and implement a Python library for normalization. This project will deepen my understanding of structured data validation and improve my Python development skills.
I am looking forward to your guidance! I am eager to learn, collaborate, and deliver a tool that simplifies schema exploration for developers worldwide.

@Julian
Copy link
Member Author

Julian commented Mar 6, 2025

Hi all! Thanks for your interest, glad to hear some are excited!

Here's a qualification task for this idea:

Review the Hypothesis normalizer that I linked and document as many of the transformation rules you see it doing

If you manage that, another good thing to review is this egg tutorial which we'll likely review more carefully during the project itself, as I'd like to understand whether it can help with either the normalization process itself or else with doing optimizations on top of whatever we define.

As for those of you asking for precise definitions of "better": that is up to us to define as part of the project and it is likely there are multiple different normalizations and rulesets we can and will define. "Fastest for boolean validation" is certainly the most obvious one, and "Fastest for validation while preserving equivalent annotation results" is likely another.

Good luck to all of you and thanks again for your interest.

@TheGreatCorrine
Copy link

TheGreatCorrine commented Mar 6, 2025

Hi all! Thanks for your interest, glad to hear some are excited!

Here's a qualification task for this idea:

Review the Hypothesis normalizer that I linked and document as many of the transformation rules you see it doing

If you manage that, another good thing to review is this egg tutorial which we'll likely review more carefully during the project itself, as I'd like to understand whether it can help with either the normalization process itself or else with doing optimizations on top of whatever we define.

As for those of you asking for precise definitions of "better": that is up to us to define as part of the project and it is likely there are multiple different normalizations and rulesets we can and will define. "Fastest for boolean validation" is certainly the most obvious one, and "Fastest for validation while preserving equivalent annotation results" is likely another.

Good luck to all of you and thanks again for your interest.

Hi @Julian ! I can't wait to start the qualification task! Would you prefer submissions directly in this issue thread (For example, submit a gist link, or directly attach a PDF file?) or through another channel?

@Julian
Copy link
Member Author

Julian commented Mar 6, 2025

The easiest is probably if you respond with a gist or repository I suppose yep!

@tthijm
Copy link

tthijm commented Mar 6, 2025

This project seems very interesting. Here are the normalisation rules I found in the linked repository: https://gist.github.com/tthijm/a2c2d16db2242753ae940d0a763aef6c.

@cbum-dev
Copy link

cbum-dev commented Mar 7, 2025

This project seems very interesting. Here are the normalisation rules I found in the linked repository: https://gist.github.com/tthijm/a2c2d16db2242753ae940d0a763aef6c.

Do we have to share the gist in public?

@TheGreatCorrine
Copy link

This project seems very interesting. Here are the normalisation rules I found in the linked repository: https://gist.github.com/tthijm/a2c2d16db2242753ae940d0a763aef6c.

Do we have to share the gist in public?

I suppose yep, as this is an open-source project

@techmannih
Copy link

Hey @Julian !
I'm really excited about contributing to this project! The idea of normalizing JSON schemas to a canonical form is both practical and fascinating. I already have a strong understanding of JSON Schema and its keywords, and I’m eager to apply this knowledge to identify and transform schemas into more consistent forms. I’m also excited to dive into testing the normalization process and ensuring it works reliably.

@tthijm
Copy link

tthijm commented Mar 12, 2025

@cbum-dev it should not really matter as long as you share the gist link here.

@thornxyz
Copy link

Hi all,
I found the problem statement of converting JSON schema into a canonical form very interesting, and I hope I can contribute to this task.

Here is my gist for the first task: Normalization Rules

For the second task, I reviewed the egg crate and I think that an e-graph-based approach could be beneficial for the normalization process. E-graphs are widely used in similar problems such as compiler optimization.

  • E-graphs can store and unify logically identical JSON Schemas, eliminating redundancy.
  • Equality saturation explores all possible transformations efficiently.
  • Structured rewrites improve performance and clarity.

Useful Links I found for reference to e-graphs: Link 1, Link 2

@TheGreatCorrine
Copy link

Hi @Julian and all, here is my gist link for the qualification tasks: here

It discusses the current normalization rules in the _canonicalise.py file, proposes some potential configuration rulesets, and mentions egg's application in JSON Schema (I'm still woking on it)

@cbum-dev
Copy link

Hi @Julian and all,
Here is my gist for the qualification task gist
For the second task I am still reviewing the egg tutorial and will discuss with you in the coming days.

@MadhavDhatrak
Copy link

Hello @Julian

I have completed the Qualification task by reviewing the Hypothesis normalizer and documenting its transformation rules. Now, I am exploring how Egg can be used to apply or optimize these rules.

@idanidan29
Copy link

Hi @Julian, I’ve been working on the proposal for this project and wanted to ask how you’d prefer the qualification task to be presented. Would you like a dedicated section summarizing my findings and approach, or would a brief mention with a link to the gist work better? Also, are there any specific aspects you’d like me to highlight? Thanks!

@Julian
Copy link
Member Author

Julian commented Mar 27, 2025

Either is fine! If you shared a link with me already you can just put the link there yeah, no need to repeat what you found (but if not putting it in the proposal is also good with me)

@Illucious
Copy link

@Julian
I have read the qualification task and made some normalization rules and tests referencing the hypothesis JSON schema you provided, now I will try to make a sample short Python library as required in the project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Google Summer of Code Project Idea
Projects
None yet
Development

No branches or pull requests