-
Notifications
You must be signed in to change notification settings - Fork 3
Adding custom & s3 tests infrastructure #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hadia206
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work John.
I have some minor comments/questions
tests/conftest.py
Outdated
|
|
||
| key_data: str = f"data/{dataset}.db" | ||
| key_metadata: str = f"metadata/{dataset}.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move newline to be after key_metadata definition
| key_data: str = f"data/{dataset}.db" | |
| key_metadata: str = f"metadata/{dataset}.json" | |
| key_data: str = f"data/{dataset}.db" | |
| key_metadata: str = f"metadata/{dataset}.json" | |
tests/conftest.py
Outdated
| database_path = f"{db_folder}/{dataset}.db" | ||
| try: | ||
| os.remove(database_path) | ||
| except FileNotFoundError: | ||
| print(f"Error: File '{database_path}' not found.") | ||
| except Exception as e: | ||
| print(f"An error occurred: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that for both local and ones downloaded from S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's for both, the .db files are not really tracked so I'm just deleting everything. The only exception is local metadata (we need those)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete the local version or keep it? 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if we delete the .db file, so we don't have a lot of .db files here and there. Every time we run these tests those are going to be downloaded from s3 or created from .sql file
But in this case we'll be downloading and deleting the file every time we run the test. That's not good. It's okay to keep them locally. And on CI, they'll be deleted anyway after the testing is done.
tests/conftest.py
Outdated
| CUSTOM_DATASETS, | ||
| CUSTOM_DATASETS_SCRIPTS, | ||
| ) | ||
| print("Datasetes downloaded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print("Datasetes downloaded") | |
| print("Datasets downloaded") |
tests/conftest.py
Outdated
| ) | ||
| print("Datasetes downloaded") | ||
| yield | ||
| print("\nRemoving datasetes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print("\nRemoving datasetes") | |
| print("\nRemoving datasets") |
| lambda: pd.DataFrame( | ||
| { | ||
| "condition_description": ["Normal pregnancy"], | ||
| "condition_description": ["Viral sinusitis (disorder)"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're using synthea from S3 the answer changed since it has other data (I don't really know the details of that data)
pyproject.toml
Outdated
| mysql = ["mysql-connector-python"] | ||
| postgres = ["psycopg2-binary"] | ||
| server = ["fastapi", "httpx", "uvicorn"] | ||
| boto3 = ["boto3"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be part of the dev-dependencies.
It's only used in testing, not related to any PyDough functionality.
hadia206
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks John!
pytest.ini
Outdated
| postgres: marks tests that require PostgresSQL credentials | ||
| server: marks tests that require api mock server | ||
| sf_masked: marks tests that require Snowflake Masked credentials | ||
| custom: marks tests that require custom datasets from s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a context question, should these test be more like s3_datasets instead of custom. Custom makes me think more to something like reserved words, datasets from our repo created by us and customized for edge cases instead of public datasets stored in aws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with that. I believe he just used what the test file name was.
tests/conftest.py
Outdated
| db_file: str = f"{data_folder}/{dataset}.db" | ||
|
|
||
| if dataset in scripts: | ||
| # setting up with script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an scenario where we use scripts other than CUSTOM_DATASETS_SCRIPTS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just download from S3 directly, don't need a script to generate the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sure if I understand the question. As mentioned before this variable is used to determine which dataset are build from the local .sql and which ones are downloaded from s3
tests/conftest.py
Outdated
| database_path = f"{db_folder}/{dataset}.db" | ||
| try: | ||
| os.remove(database_path) | ||
| except FileNotFoundError: | ||
| print(f"Error: File '{database_path}' not found.") | ||
| except Exception as e: | ||
| print(f"An error occurred: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep db files from "local" datasets, like those created from CUSTOM_DATASETS_SCRIPTS. It helps to run pdunit once and then you can use those db files with jupyter when you need to reproduce a new bug, for example.
tests/conftest.py
Outdated
|
|
||
| @pytest.fixture(scope="session") | ||
| def sqlite_custom_datasets_connection() -> DatabaseContext: | ||
| def custom_datasets_setup(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split into different test_pipelines and env set-up local custom datasets and s3 datasets? I think we should always run the tests from local datasets and use the pytest mark for s3 ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that custom datasets where variant version of s3 datasets one so both are the same.
@john-sanchez31 can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes s3 datasets and custom local datasets are in what we call custom datasets, those tests run with [run all] or [run custom]
Dismissing approval for now till Juan's questions are resolved.
| return request.param | ||
|
|
||
|
|
||
| @pytest.mark.custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use @pytest.mark.custom only for s3 datasets, that's why I was thinking on it like @pytest.mark.s3_datasets instead. Local datasets like keywords should be tested always in the same way defog and tpch do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both keywords and s3_datesets are on the same category. Nether of both are part of defog or tpch and both are used for edge cases or bug testing that's why there is custom datasets. I though the new flag was created just for that so we don't have to run those every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. So let's just remove marker (custom) from the keywords since it doesn't really rely on database in S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that case in what type of tests or category falls keywords? What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in that case in what type of tests or category falls keywords?
The general testing category (the execute marker). Tests that are okay to run all the time to make sure there's no regression.
What about wdi? Since it's using the local version of it not the s3 version (it's more than 2gb)
This can still be part of custom marker. It's using some variation of S3 data. Since keyword is already doing the same bug check.
juankx-bodo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/conftest.py
Outdated
| @pytest.fixture(scope="session") | ||
| def get_test_graph_by_name() -> graph_fetcher: | ||
| def get_s3_datasets_graph(s3_datasets_setup) -> graph_fetcher: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: if this fixture depends on s3_datasets_setup, then place its definition right below s3_datasets_setup
tests/conftest.py
Outdated
| """ | ||
| Returns the SQLITE database connection with all the custom datasets attached. | ||
| This fixture is used to connect to the sqlite database of the custom datasets. | ||
| Returns a DatabaseContext for the MySQL TPCH database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to fix that^
tests/conftest.py
Outdated
| print("Datasets downloaded") | ||
| yield | ||
| print("\nRemoving datasets") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ditch the prints?
hadia206
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks John
tests/conftest.py
Outdated
| """ | ||
| Returns the SQLITE database connection with all the custom datasets attached. | ||
| This fixture is used to connect to the sqlite database of the custom datasets. | ||
| Returns a DatabaseContext for the MySQL TPCH database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to fix that^
Includes the infrastructure for testing custom defined locally and s3 datasets from the bucket
llm-fixtures. Adding the name of the dataset inS3_DATASETSsets up the needed files (creating the.dbfile from.sqlfile).S3_DATASETS_SCRIPTSdefine the s3 datasets that uses a.sqlfile to create the.dbfile. S3 datasets not included inS3_DATASETS_SCRIPTSwill be downloaded from s3.This setup allows local and ci testing using the tests defined in
test_pipeline_custom_datasets.pyandtest_pipeline_s3_datasets.py.