Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONTRIB] Normalize schema and table names. #10763

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

stejin
Copy link
Contributor

@stejin stejin commented Dec 11, 2024

https://support.greatexpectations.io/hc/en-us/requests/210

Background: I am building tests against SQL data assets (MSSQL). Schema names and tables are in all caps. Table names start with a number and need to be wrapped in square brackets.

Problem 1:

When adding a SQL data asset, the schema name is automatically changed to lowercase. Later, a connection test fails because the lowercase schema name is compared to the original schema name as defined in the database.

Solution: Apply the function “to_lower_if_not_quoted” to the list of schemas returned by the database.

Problem 2:

Since my table names start with a number, I need to wrap them in square brackets when adding data assets in Great Expectations. This causes a problem with the retrieval of column names which expects the actual table name without the brackets.

Solution: Strip square brackets and double quotes from table names before retrieving column metadata via SqlAlchemy.

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, visit our community resources.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

Copy link

netlify bot commented Dec 11, 2024

👷 Deploy request for niobium-lead-7998 pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit ab9b4c7

Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing these changes! I've left a few comments inline.

@@ -995,7 +995,9 @@ def test_connection(self) -> None:
engine: sqlalchemy.Engine = datasource.get_engine()
inspector: sqlalchemy.Inspector = sa.inspect(engine)

if self.schema_name and self.schema_name not in inspector.get_schema_names():
if self.schema_name and self.schema_name not in map(
to_lower_if_not_quoted, inspector.get_schema_names()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also call to_lower_if_not_quoted on the self.schema_name used in the map lookup. Right now, this may fail for snowflake since I think their sqlalchemy connector changes names to uppercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add, but this only failed because to_lower_if_not_quoted was already applied to self.schema_name. Applying it again should be redundant.

@@ -87,6 +87,10 @@ def _spark(
return spark_column_metadata


def _get_normalized_table_name(table_name: str):
return table_name.strip("[]").strip('"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems MSSQL specific. I would rename this method to _get_mssql_normalized_table_name or something similar and check the dialect before calling via execution_engine.dialect_name == GXSqlDialect.MSSQL.

I am not familiar with MSSQL. Can square brackets or double quotes appear in table names if preceded by an escape character? That is, are there table names that will be broken with these strip calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Square brackets are allowed in MSSQL but then need to be escaped when running queries against the table. A better solution would be to let GX add the square brackets to the db object names when needed. Currently, I have to add them when adding a data asset, e.g. asset = datasource.add_table_asset(name="ClientCodes", schema_name="CR", table_name="[247_CLIENT_CODES]").

@@ -106,7 +110,7 @@ def _get_sqlalchemy_column_metadata(

return get_sqlalchemy_column_metadata(
execution_engine=execution_engine,
table_selectable=table_selectable, # type: ignore[arg-type]
table_selectable=_get_normalized_table_name(table_selectable), # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above about only calling this normalization for MSSQL execution engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants