Skip to content

Conversation

@juankx-bodo
Copy link
Contributor

@juankx-bodo juankx-bodo commented Oct 31, 2025

Update metadata with AI enriched version based on latest metadata generation tool version
Sample values and all definitions are based on the actual BIRD full dataset.

@juankx-bodo juankx-bodo force-pushed the jkx/Update_synthea_and_WDI_metadata branch from 6b1b7a8 to e028c1d Compare November 12, 2025 14:15
Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we changing this? For #457 I'm using synthea metadata from s3 so it will be no longer needed in the repo (actually this metadata will be no longer in the repo). Also I think we should wait for the new [run custom] flag which runs custom tests so this can be tested properly.

"properties": [
{
"name": "ITEM",
"name": "item",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all properties are being change from uppercase to lowecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names generated by the metadata tool are lowercase. No manual changes were done to htese files.

"name": "year_",
"type": "table column",
"column name": "Year",
"column name": "\"Year\"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is quoted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata generation tool have Python, Pydough and SQL keyword lists. If a table path or column name value matches that list or have special characters then the value is double-quote enclosed and any double-quote character in the name is escaped. With the quoting fix in Pydough this should not be an issue.

@hadia206
Copy link
Contributor

I agree with John, do we need that after the changes he made in this PR?

@juankx-bodo
Copy link
Contributor Author

Why are we changing this? For #457 I'm using synthea metadata from s3 so it will be no longer needed in the repo (actually this metadata will be no longer in the repo). Also I think we should wait for the new [run custom] flag which runs custom tests so this can be tested properly.

This is actually true. With reserved words dataset and tests from s3 we no longer need synthea and WDI datasets on our repo anymore. We can talk about this. Meanwhile, the purpose of the PR was to update the "original" json files with the latest version used by LLM team, generated by the tool and enriched with help of the BIRD csv files.

@juankx-bodo
Copy link
Contributor Author

I agree with John, do we need that after the changes he made in this PR?

The intention of the PR is to use the latest version of the json used also by the LLM team. Also agree that we should not have Synthea and WDI datasets in our repo, each of them are there only for 1 bug that also are covered with reserved words dataset. In case we want to keep those tests we could move them to use the s3 version of the datasets.

@knassre-bodo
Copy link
Contributor

@juankx-bodo I'm still confused why we need this PR at all since we shouldn't be using the JSON files in the repo, they get downloaded from S3.

@hadia206 hadia206 marked this pull request as draft December 23, 2025 01:00
@juankx-bodo juankx-bodo closed this Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants