Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check on missing_percent assumes 0.999984 == 1 #2090

Open
lewis-anderson53 opened this issue May 24, 2024 · 3 comments
Open

Check on missing_percent assumes 0.999984 == 1 #2090

lewis-anderson53 opened this issue May 24, 2024 · 3 comments

Comments

@lewis-anderson53
Copy link

Here is my check:

checks for my_table_name [daily]:
  - missing_percent(my_column_name) < 100%:
      name: Column my_column_name contains values that are not NULL

My data is nearly all NULL, but not entirely.

AssertionError: Check results failed: 
[missing_percent(my_column_name) < 100%] FAIL (check_value: 100.0, row_count: 58335240, missing_count: 58334353)

I have a very small percentage of non NULL values, so my check should pass, however it fails because I assume there is some rounding.

0.999984 != 1.

@tools-soda
Copy link

SAS-3526

@benjamin-pirotte
Copy link

benjamin-pirotte commented May 24, 2024

Hi,

Looking at your check, I am wondering if it is the expected behavior
missing_percent(my_column_name) < 100%, this means you are expecting all values to be null. In this case, it is not so it fails.

Aren't you looking for the following check: missing_percent(my_column_name) < 0% ?

@lewis-anderson53
Copy link
Author

missing_percent(my_column_name) < 0% would ensure that I had 0 missing data and that single row was not NULL.
That's not my intention with this check, I want to make sure that at least some rows are non-NULL.

I lowered my threshold and looked at another column, to demonstrate. Here I will say that less than 65% NULL is acceptable, any higher and that's bad.

checks for my_table_name [daily]:
  - missing_percent(my_other_column_name) < 65%:
      name: Feature column my_other_column_name contains values that are not NULL

Which shows the following results:

AssertionError: Check results failed: 
[missing_percent(my_other_column_name) < 65%] FAIL (check_value: 73.92, row_count: 58335240, missing_count: 43121299)

My check_value here is rounded to 2 decimal places, further demonstrating potentially what my problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants