Skip to content

Conversation

@john-sanchez31
Copy link
Contributor

@john-sanchez31 john-sanchez31 commented Nov 7, 2025

Includes the infrastructure for testing custom defined locally and s3 datasets from the bucket llm-fixtures. Adding the name of the dataset in S3_DATASETS sets up the needed files (creating the .db file from .sql file). S3_DATASETS_SCRIPTS define the s3 datasets that uses a .sql file to create the .db file. S3 datasets not included in S3_DATASETS_SCRIPTS will be downloaded from s3.

This setup allows local and ci testing using the tests defined in test_pipeline_custom_datasets.py and test_pipeline_s3_datasets.py.

@john-sanchez31 john-sanchez31 marked this pull request as ready for review November 13, 2025 15:38
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work John.
I have some minor comments/questions

Comment on lines 709 to 711

key_data: str = f"data/{dataset}.db"
key_metadata: str = f"metadata/{dataset}.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move newline to be after key_metadata definition

Suggested change
key_data: str = f"data/{dataset}.db"
key_metadata: str = f"metadata/{dataset}.json"
key_data: str = f"data/{dataset}.db"
key_metadata: str = f"metadata/{dataset}.json"

Comment on lines 748 to 754
database_path = f"{db_folder}/{dataset}.db"
try:
os.remove(database_path)
except FileNotFoundError:
print(f"Error: File '{database_path}' not found.")
except Exception as e:
print(f"An error occurred: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that for both local and ones downloaded from S3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's for both, the .db files are not really tracked so I'm just deleting everything. The only exception is local metadata (we need those)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete the local version or keep it? 🤷‍♀️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file

But in this case we'll be downloading and deleting the file every time we run the test. That's not good. It's okay to keep them locally. And on CI, they'll be deleted anyway after the testing is done.

CUSTOM_DATASETS,
CUSTOM_DATASETS_SCRIPTS,
)
print("Datasetes downloaded")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("Datasetes downloaded")
print("Datasets downloaded")

)
print("Datasetes downloaded")
yield
print("\nRemoving datasetes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("\nRemoving datasetes")
print("\nRemoving datasets")

lambda: pd.DataFrame(
{
"condition_description": ["Normal pregnancy"],
"condition_description": ["Viral sinusitis (disorder)"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're using synthea from S3 the answer changed since it has other data (I don't really know the details of that data)

pyproject.toml Outdated
mysql = ["mysql-connector-python"]
postgres = ["psycopg2-binary"]
server = ["fastapi", "httpx", "uvicorn"]
boto3 = ["boto3"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the dev-dependencies.
It's only used in testing, not related to any PyDough functionality.

hadia206
hadia206 previously approved these changes Nov 17, 2025
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks John!

pytest.ini Outdated
postgres: marks tests that require PostgresSQL credentials
server: marks tests that require api mock server
sf_masked: marks tests that require Snowflake Masked credentials
custom: marks tests that require custom datasets from s3
Copy link
Contributor

@juankx-bodo juankx-bodo Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a context question, should these test be more like s3_datasets instead of custom. Custom makes me think more to something like reserved words, datasets from our repo created by us and customized for edge cases instead of public datasets stored in aws.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with that. I believe he just used what the test file name was.

db_file: str = f"{data_folder}/{dataset}.db"

if dataset in scripts:
# setting up with script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an scenario where we use scripts other than CUSTOM_DATASETS_SCRIPTS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just download from S3 directly, don't need a script to generate the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure if I understand the question. As mentioned before this variable is used to determine which dataset are build from the local .sql and which ones are downloaded from s3

Comment on lines 748 to 754
database_path = f"{db_folder}/{dataset}.db"
try:
os.remove(database_path)
except FileNotFoundError:
print(f"Error: File '{database_path}' not found.")
except Exception as e:
print(f"An error occurred: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.


@pytest.fixture(scope="session")
def sqlite_custom_datasets_connection() -> DatabaseContext:
def custom_datasets_setup():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split into different test_pipelines and env set-up local custom datasets and s3 datasets? I think we should always run the tests from local datasets and use the pytest mark for s3 ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that custom datasets where variant version of s3 datasets one so both are the same.
@john-sanchez31 can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes s3 datasets and custom local datasets are in what we call custom datasets, those tests run with [run all] or [run custom]

@hadia206 hadia206 dismissed their stale review November 18, 2025 18:30

Dismissing approval for now till Juan's questions are resolved.

return request.param


@pytest.mark.custom
Copy link
Contributor

@juankx-bodo juankx-bodo Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use @pytest.mark.custom only for s3 datasets, that's why I was thinking on it like @pytest.mark.s3_datasets instead. Local datasets like keywords should be tested always in the same way defog and tpch do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both keywords and s3_datesets are on the same category. Nether of both are part of defog or tpch and both are used for edge cases or bug testing that's why there is custom datasets. I though the new flag was created just for that so we don't have to run those every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. So let's just remove marker (custom) from the keywords since it doesn't really rely on database in S3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in that case in what type of tests or category falls keywords? What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)

Copy link
Contributor

@hadia206 hadia206 Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in that case in what type of tests or category falls keywords?

The general testing category (the execute marker). Tests that are okay to run all the time to make sure there's no regression.

What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)

This can still be part of custom marker. It's using some variation of S3 data. Since keyword is already doing the same bug check.

Copy link
Contributor

@juankx-bodo juankx-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@john-sanchez31 john-sanchez31 changed the title Adding custom tests infrastructure Adding custom & s3 tests infrastructure Nov 20, 2025
Comment on lines 204 to 207
@pytest.fixture(scope="session")
def get_test_graph_by_name() -> graph_fetcher:
def get_s3_datasets_graph(s3_datasets_setup) -> graph_fetcher:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: if this fixture depends on s3_datasets_setup, then place its definition right below s3_datasets_setup

"""
Returns the SQLITE database connection with all the custom datasets attached.
This fixture is used to connect to the sqlite database of the custom datasets.
Returns a DatabaseContext for the MySQL TPCH database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to fix that^

Comment on lines 828 to 830
print("Datasets downloaded")
yield
print("\nRemoving datasets")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ditch the prints?

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks John

"""
Returns the SQLITE database connection with all the custom datasets attached.
This fixture is used to connect to the sqlite database of the custom datasets.
Returns a DatabaseContext for the MySQL TPCH database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to fix that^

@john-sanchez31 john-sanchez31 merged commit 4768e18 into main Nov 20, 2025
22 checks passed
@john-sanchez31 john-sanchez31 deleted the John/s3_testing branch November 20, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants