Adding custom & s3 tests infrastructure #457

john-sanchez31 · 2025-11-07T23:01:24Z

Includes the infrastructure for testing custom defined locally and s3 datasets from the bucket llm-fixtures. Adding the name of the dataset in S3_DATASETS sets up the needed files (creating the .db file from .sql file). S3_DATASETS_SCRIPTS define the s3 datasets that uses a .sql file to create the .db file. S3 datasets not included in S3_DATASETS_SCRIPTS will be downloaded from s3.

This setup allows local and ci testing using the tests defined in test_pipeline_custom_datasets.py and test_pipeline_s3_datasets.py.

hadia206

Great work John.
I have some minor comments/questions

hadia206 · 2025-11-17T21:29:10Z

tests/conftest.py

+
+            key_data: str = f"data/{dataset}.db"
+            key_metadata: str = f"metadata/{dataset}.json"


nit: move newline to be after key_metadata definition

Suggested change

key_data: str = f"data/{dataset}.db"

key_metadata: str = f"metadata/{dataset}.json"

key_data: str = f"data/{dataset}.db"

key_metadata: str = f"metadata/{dataset}.json"

hadia206 · 2025-11-17T21:31:01Z

tests/conftest.py

+        database_path = f"{db_folder}/{dataset}.db"
+        try:
+            os.remove(database_path)
+        except FileNotFoundError:
+            print(f"Error: File '{database_path}' not found.")
+        except Exception as e:
+            print(f"An error occurred: {e}")


Is that for both local and ones downloaded from S3?

Yes it's for both, the .db files are not really tracked so I'm just deleting everything. The only exception is local metadata (we need those)

Should we delete the local version or keep it? 🤷‍♀️

I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file

I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.

I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file

But in this case we'll be downloading and deleting the file every time we run the test. That's not good. It's okay to keep them locally. And on CI, they'll be deleted anyway after the testing is done.

hadia206 · 2025-11-17T21:31:23Z

tests/conftest.py

+        CUSTOM_DATASETS,
+        CUSTOM_DATASETS_SCRIPTS,
+    )
+    print("Datasetes downloaded")


Suggested change

print("Datasetes downloaded")

print("Datasets downloaded")

hadia206 · 2025-11-17T21:31:34Z

tests/conftest.py

+    )
+    print("Datasetes downloaded")
+    yield
+    print("\nRemoving datasetes")


Suggested change

print("\nRemoving datasetes")

print("\nRemoving datasets")

hadia206 · 2025-11-17T21:32:40Z

tests/test_pipeline_custom_datasets.py

                lambda: pd.DataFrame(
                    {
-                        "condition_description": ["Normal pregnancy"],
+                        "condition_description": ["Viral sinusitis (disorder)"],


Why this change?

Now that we're using synthea from S3 the answer changed since it has other data (I don't really know the details of that data)

hadia206 · 2025-11-17T21:34:35Z

pyproject.toml

 mysql = ["mysql-connector-python"]
 postgres = ["psycopg2-binary"]
 server = ["fastapi", "httpx", "uvicorn"]
+boto3 = ["boto3"]


This should be part of the dev-dependencies.
It's only used in testing, not related to any PyDough functionality.

hadia206

Thanks John!

juankx-bodo · 2025-11-18T15:54:58Z

pytest.ini

    postgres: marks tests that require PostgresSQL credentials
    server: marks tests that require api mock server
    sf_masked: marks tests that require Snowflake Masked credentials
+    custom: marks tests that require custom datasets from s3


Just a context question, should these test be more like s3_datasets instead of custom. Custom makes me think more to something like reserved words, datasets from our repo created by us and customized for edge cases instead of public datasets stored in aws.

I'm okay with that. I believe he just used what the test file name was.

tests/conftest.py

juankx-bodo · 2025-11-18T17:54:42Z

tests/conftest.py

+        db_file: str = f"{data_folder}/{dataset}.db"
+
+        if dataset in scripts:
+            # setting up with script


Is there an scenario where we use scripts other than CUSTOM_DATASETS_SCRIPTS?

You just download from S3 directly, don't need a script to generate the data.

No sure if I understand the question. As mentioned before this variable is used to determine which dataset are build from the local .sql and which ones are downloaded from s3

juankx-bodo · 2025-11-18T17:57:57Z

tests/conftest.py

+        database_path = f"{db_folder}/{dataset}.db"
+        try:
+            os.remove(database_path)
+        except FileNotFoundError:
+            print(f"Error: File '{database_path}' not found.")
+        except Exception as e:
+            print(f"An error occurred: {e}")


I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.

juankx-bodo · 2025-11-18T18:08:23Z

tests/conftest.py

+
 @pytest.fixture(scope="session")
-def sqlite_custom_datasets_connection() -> DatabaseContext:
+def custom_datasets_setup():


Should we split into different test_pipelines and env set-up local custom datasets and s3 datasets? I think we should always run the tests from local datasets and use the pytest mark for s3 ones.

My understanding was that custom datasets where variant version of s3 datasets one so both are the same.
@john-sanchez31 can you clarify?

Yes s3 datasets and custom local datasets are in what we call custom datasets, those tests run with [run all] or [run custom]

Dismissing approval for now till Juan's questions are resolved.

juankx-bodo · 2025-11-18T19:58:53Z

tests/test_pipeline_custom_datasets.py

    return request.param


+@pytest.mark.custom


We should use @pytest.mark.custom only for s3 datasets, that's why I was thinking on it like @pytest.mark.s3_datasets instead. Local datasets like keywords should be tested always in the same way defog and tpch do.

I think both keywords and s3_datesets are on the same category. Nether of both are part of defog or tpch and both are used for edge cases or bug testing that's why there is custom datasets. I though the new flag was created just for that so we don't have to run those every time.

Oh I see. So let's just remove marker (custom) from the keywords since it doesn't really rely on database in S3

But in that case in what type of tests or category falls keywords? What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)

But in that case in what type of tests or category falls keywords?

The general testing category (the execute marker). Tests that are okay to run all the time to make sure there's no regression.

What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)

This can still be part of custom marker. It's using some variation of S3 data. Since keyword is already doing the same bug check.

juankx-bodo

LGTM

knassre-bodo · 2025-11-19T22:57:05Z

tests/conftest.py

 @pytest.fixture(scope="session")
-def get_test_graph_by_name() -> graph_fetcher:
+def get_s3_datasets_graph(s3_datasets_setup) -> graph_fetcher:


NIT: if this fixture depends on s3_datasets_setup, then place its definition right below s3_datasets_setup

knassre-bodo · 2025-11-19T22:57:39Z

tests/conftest.py

    """
-    Returns the SQLITE database connection with all the custom datasets attached.
+    This fixture is used to connect to the sqlite database of the custom datasets.
+    Returns a DatabaseContext for the MySQL TPCH database.


Don't forget to fix that^

knassre-bodo · 2025-11-19T22:59:41Z

tests/conftest.py

+    print("Datasets downloaded")
+    yield
+    print("\nRemoving datasets")


Should we ditch the prints?

hadia206

Thanks John

hadia206 · 2025-11-20T18:59:34Z

tests/conftest.py

    """
-    Returns the SQLITE database connection with all the custom datasets attached.
+    This fixture is used to connect to the sqlite database of the custom datasets.
+    Returns a DatabaseContext for the MySQL TPCH database.


Don't forget to fix that^

john-sanchez31 added 6 commits November 7, 2025 16:59

WIP: downloading from s3

ed53c99

testing infrastructure for s3 custom datasets [run ci]

a05fab6

flag run custom for custom dataset tests created

fddceff

adding all connectors for custom test

26d039c

synthea s3 testing, adding bug test [run custom]

cd894dd

excluding custom test for [run ci] [run custom]

7b1f1f0

john-sanchez31 marked this pull request as ready for review November 13, 2025 15:38

john-sanchez31 requested review from hadia206 and knassre-bodo November 13, 2025 15:54

john-sanchez31 added 6 commits November 14, 2025 15:37

now dialect flag does not run custom tests

277aea7

testing [run all]

4fc6a2f

adding mark and secrets for custom tests [run all]

dce5428

set env for custom dataset [run all]

1af103d

custom ci separated [run all]

2ee3a34

secret name fixed [run all]

5680d1f

john-sanchez31 requested a review from juankx-bodo November 17, 2025 16:04

john-sanchez31 mentioned this pull request Nov 17, 2025

Update synthea and wdi metadata #451

Closed

hadia206 reviewed Nov 17, 2025

View reviewed changes

comments addressed [run all]

78fb6e5

john-sanchez31 requested a review from hadia206 November 17, 2025 22:44

hadia206 previously approved these changes Nov 17, 2025

View reviewed changes

juankx-bodo reviewed Nov 18, 2025

View reviewed changes

john-sanchez31 added 5 commits November 19, 2025 10:34

keeping the db files

447fcb4

custom and s3 datasets separated

654b105

s3 flag created [run s3]

cf2aac2

conlficts solved [run s3]

f116a22

testing [run all]

10b2e23

john-sanchez31 added 4 commits November 19, 2025 15:31

fixture added [run all]

68e2139

testing [run ci]

47fef7f

no initialized db fixed [run all]

25e1474

init script added [run all]

fa6f4d1

john-sanchez31 requested review from hadia206 and juankx-bodo November 19, 2025 22:36

juankx-bodo approved these changes Nov 20, 2025

View reviewed changes

john-sanchez31 changed the title ~~Adding custom tests infrastructure~~ Adding custom & s3 tests infrastructure Nov 20, 2025

knassre-bodo reviewed Nov 20, 2025

View reviewed changes

minor fixes [run all]

339d1a5

john-sanchez31 requested a review from knassre-bodo November 20, 2025 18:15

hadia206 approved these changes Nov 20, 2025

View reviewed changes

comment fixed [run all]

8996c48

john-sanchez31 merged commit 4768e18 into main Nov 20, 2025
22 checks passed

john-sanchez31 deleted the John/s3_testing branch November 20, 2025 20:19


		key_data: str = f"data/{dataset}.db"
		key_metadata: str = f"metadata/{dataset}.json"

Adding custom & s3 tests infrastructure #457

Adding custom & s3 tests infrastructure #457

Uh oh!

Conversation

john-sanchez31 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

juankx-bodo Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juankx-bodo Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juankx-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

john-sanchez31 commented Nov 7, 2025 •

edited

Loading

juankx-bodo Nov 18, 2025 •

edited

Loading

juankx-bodo Nov 18, 2025 •

edited

Loading

hadia206 Nov 18, 2025 •

edited

Loading