Documenting how to validate TOML with JSON schema #1038

ncoghlan · 2024-09-09T08:50:50Z

ncoghlan
Sep 9, 2024

(Inspired by #792, but as a discussion rather than an issue since I don't think it should even be a documentation proposal yet until there's an initial agreement that this is a good path to take)

Defining comprehensive data schemas is difficult (especially if they can reference each other), so using JSON Schema to validate TOML documents seems like a more pragmatic path forward than attempting to build a separate TOML-specific schema validation ecosystem.

A version of this idea is already implemented in taplo, which uses #:schema ./foo-schema.json comments to reference JSON schema documents: https://taplo.tamasfe.dev/configuration/directives.html#the-schema-directive

(While the Python standard library's tomllib module doesn't provide access to TOML comments, the feature is available by iterating over the body attribute of a tomlkit.TOMLDocument instance, allowing scanning for schema references using the same format as taplo)

Given a JSON schema reference, validating a TOML document against a JSON schema specification at runtime is going to be fairly straightforward: load the data from the TOML file, load the schema file into your preferred JSON schema validation library, and then check the data matches the schema.

What's missing is a clear explanation of how the different pieces of a TOML document map to different concepts in JSON schema, since the two specifications sometimes use different terminology for the same things, and there are some features of TOML that need to be skipped if you want the data read from the document to validate as JSON at all (let alone against a specific schema).

The TOML mapping for the basic JSON Schema types is straightforward (TOML type -> JSON type):

overall TOML document -> object
string -> string
float -> number
integer -> integer
table (including inline tables) -> object
array -> array
array of tables -> array of object specifications
boolean -> boolean
omitting optional keys from a document or table -> either null or omission from the schema's required properties (depending on the default value used when a key is missing)

All of the regular JSON schema features for these types can be applied to TOML documents, remembering that they apply to the parsed values, not the exact text as written into the TOML file (so things like the string quoting format or whether a table is inline or not don't matter).

Notable caveats and limitations for the basic types:

the TOML conversions of floating point NaN and infinity directly to the relevant floating point instances in the host language runtime will not validate as part of a standard float JSON schema field. To pass schema validation, nan and inf (and their positive and negative variants) need to be encoded as dual type ["float", "string"] fields rather than using the native float representations of the special values.
there's no TOML representation that allows for arrays to contain null values. The closest TOML has to a representation of null is "omit that key", which only applies to tables and the top level keys of a document.

The final case to consider is how dates, times, and their optional timezone offsets should be matched to the JSON schema RFC 3339 guidelines in https://json-schema.org/draft/2020-12/json-schema-validation#name-defined-formats

This last part isn't actually a TOML question, it's a question of how the structured date/time objects emitted by a compliant TOML parser are serialised to strings before being passed to the chosen JSON Schema validator (passing the structured date/time objects directly will always fail, since they're not a valid JSON type).

For Python, for example, making jsonschema happy with serialised datetime values requires ensuring that they're converted to strings which comply with RFC 3339 as JSON Schema specifies (the ISO 8601 based isoformat() methods are sufficient for this, since they include the separators that RFC 3339 requires)

RivenSkaye · 2024-11-07T12:25:11Z

RivenSkaye
Nov 7, 2024

I think there's a point to be made in favor of a TOML-based schema for TOML. I'm personally also in favor of #792 over #116 in terms of how the schema is defined, as 792 is rather simple and concise. This also fits better with being easily parsable by humans.

there are some features of TOML that need to be skipped if you want the data read from the document to validate as JSON at all

To me, this would seem like a dealbreaker - if you can't validate valid TOML with it, it's not fit for the job imo. I'd also like to re-raise a concern from 792 regarding optional keys and defaults. This would mean output on valid TOML matching the schema would be different between schema-aware and unaware parsers!

0 replies

ncoghlan · 2024-11-11T03:23:15Z

ncoghlan
Nov 11, 2024
Author

To me, this would seem like a dealbreaker - if you can't validate valid TOML with it, it's not fit for the job imo.

For a lot of TOML use cases, "data read from the TOML file can also be serialised as JSON" is a design requirement. Even for those cases where it isn't, the file format frequently won't containing any floating point fields at all, and when it does, "NaN" and "inf" are often going to be invalid values in the floating point fields anyway. In either of those situations, the fact JSON schema intrinsically disallows passing NaN and Inf values through number fields ends up not being a problem.

The null limitation goes in the other direction: JSON Schema allows something to be specified (null values in arrays) that TOML doesn't support. It just means that any schema definition for a TOML file format will necessarily only include non-nullable arrays (and should really only define optional keys in objects, rather than allowing keys to be null).

Compared to spinning up an entire parallel schema validation ecosystem, defining a mapping from TOML floats to a dual number/string JSON format that can handle NaN and Inf is a much smaller task.

0 replies

RivenSkaye · 2024-11-27T15:11:34Z

RivenSkaye
Nov 27, 2024

"data read from the TOML file can also be serialised as JSON" does not, imo, mean "data from TOML maps 1:1 with JSON without processing."

I agree the JSON approach seems like less work, but either the TOML and the JSON Schema features allowed with validation would have to be more limited, or either format would have to be changed to accommodate the other. And since I doubt JSON Schema is going to be changed to accommodate non-JSON formats, and the original linked issues establish that TOML will not be changed over validation challenges, that means forking either one.
Otherwise the sum of the parts would be less than its individual parts.

Don't get me wrong, for where it currently stands I think Taplo is doing an amazing job. I just think that to truly validate all valid TOML a specific validation format (perhaps based on JSON Schema with some different definitions to accomodate TOML features) would be better. Though maybe the people over at JSON Schema could be convinced to broaden their scope slightly to encompass other data formats a bit more. And on that note, as much as I like the idea of 792, I think some features like the default injection should be a for a custom parser to support, and not an mainline TOML (parser) problem or a validator's task.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting how to validate TOML with JSON schema #1038

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Documenting how to validate TOML with JSON schema #1038

ncoghlan Sep 9, 2024

Replies: 3 comments

RivenSkaye Nov 7, 2024

ncoghlan Nov 11, 2024 Author

RivenSkaye Nov 27, 2024

ncoghlan
Sep 9, 2024

RivenSkaye
Nov 7, 2024

ncoghlan
Nov 11, 2024
Author

RivenSkaye
Nov 27, 2024