-
Notifications
You must be signed in to change notification settings - Fork 15.6k
fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042
Conversation
Fixes high-volume pandas warnings in production logs: "Could not infer format, so each element will be parsed individually, falling back to `dateutil`" - Added detect_datetime_format() to detect common date formats from data samples - When format is detected, use it explicitly (prevents warning, ~5x faster) - When format can't be detected (mixed formats), suppress the warning - Refactored into _process_datetime_column() helper to reduce complexity This approach aligns with pandas 2.0+ default behavior and industry best practices for handling datetime parsing at scale.
…me-warning-suppression
Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment
|
Bito Automatic Review Skipped - Draft PR |
superset/utils/core.py
Outdated
@@ -1858,6 +1859,112 @@ def get_legacy_time_column( | |||
) | |||
|
|||
|
|||
def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the idea of factoring this out. Given that superset/utils/core.py
is massive already, I'm wondering where it would best fit, maybe somewhere under superset/utils/
, I see there's a lot of pandas-helpers in superset/utils/pandas_postprocessing
, maybe could go there or maybe a new superset/utils/pandas.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #35042 +/- ##
===========================================
+ Coverage 0 72.16% +72.16%
===========================================
Files 0 587 +587
Lines 0 42947 +42947
Branches 0 4550 +4550
===========================================
+ Hits 0 30991 +30991
- Misses 0 10749 +10749
- Partials 0 1207 +1207
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add format detection to avoid pandas warning spam in production logs - Extract detect_datetime_format to new superset/utils/pandas.py module - Suppress warnings when format cannot be detected (mixed formats) - Improve performance by 5x for consistent date formats - Add comprehensive tests for warning suppression and format detection
- Reduce test file from 400+ to 150 lines - Remove verbose helper classes and overly complex test scenarios - Use pytest parametrize for cleaner test organization - Remove performance comparison tests (not essential) - Simplify test names and documentation - Keep only essential test coverage
- Add test for invalid epoch values that trigger ValueError - Covers the logger.warning path in _process_datetime_column - Improves patch coverage to address Codecov report
Bito Automatic Review Skipped - Draft PR |
- Add test for empty series detection in detect_datetime_format - Add test for ValueError handling in datetime conversion - Improves coverage for lines 50-51 in pandas.py and 1887-88 in core.py
Bito Automatic Review Skipped - Draft PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Sample-Based Format Detection May Miss Variations ▹ view | 🧠 Not in standard | |
Inefficient Row-by-Row Timestamp Processing ▹ view | 🧠 Not in scope | |
Hardcoded datetime formats violate Open-Closed Principle ▹ view | 🧠 Not in standard |
Files scanned
File Path | Reviewed |
---|---|
superset/utils/pandas.py | ✅ |
superset/utils/core.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
] | ||
|
||
# Get non-null sample | ||
sample = series.dropna().head(sample_size) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
common_formats = [ | ||
"%Y-%m-%d %H:%M:%S", | ||
"%Y-%m-%d", | ||
"%Y-%m-%dT%H:%M:%S", | ||
"%Y-%m-%dT%H:%M:%SZ", | ||
"%Y-%m-%dT%H:%M:%S.%f", | ||
"%Y-%m-%dT%H:%M:%S.%fZ", | ||
"%m/%d/%Y", | ||
"%d/%m/%Y", | ||
"%Y/%m/%d", | ||
"%m/%d/%Y %H:%M:%S", | ||
"%d/%m/%Y %H:%M:%S", | ||
"%m-%d-%Y", | ||
"%d-%m-%Y", | ||
"%Y%m%d", | ||
] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
df[col.col_label] = dttm_series.apply( | ||
lambda x: pd.Timestamp(x) if pd.notna(x) else pd.NaT | ||
) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review Agent Run #5caaa0
Actionable Suggestions - 1
-
superset/utils/core.py - 1
- Inconsistent datetime parsing · Line 1911-1916
Additional Suggestions - 1
-
superset/utils/pandas.py - 1
-
Performance optimization needed · Line 60-60The current implementation performs redundant double parsing validation. After confirming a format works on a 10-row test sample, it unnecessarily re-parses the entire sample (up to 100 rows) with the same format. This creates computational overhead in `detect_datetime_format` which is called by `superset.utils.core` during DataFrame processing. The optimization removes the redundant second parsing while maintaining the same validation accuracy.
Code suggestion
@@ -57,12 +57,8 @@ - # Try each format - for fmt in common_formats: - try: - # Test on small sample first - test_sample = sample.head(10) - pd.to_datetime(test_sample, format=fmt, errors="raise") - # If successful, verify on larger sample - pd.to_datetime(sample, format=fmt, errors="raise") - return fmt - except (ValueError, TypeError): - continue + # Try each format + for fmt in common_formats: + try: + pd.to_datetime(sample, format=fmt, errors="raise") + return fmt + except (ValueError, TypeError): + continue
-
Review Details
-
Files reviewed - 3 · Commit Range:
afc6bc4..37fe60f
- superset/utils/core.py
- superset/utils/pandas.py
- tests/unit_tests/utils/test_date_parsing.py
-
Files skipped - 0
-
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful
Bito Usage Guide
Commands
Type the following command in the pull request comment and save the comment.
-
/review
- Manually triggers a full AI review. -
/pause
- Pauses automatic reviews on this pull request. -
/resume
- Resumes automatic reviews. -
/resolve
- Marks all Bito-posted review comments as resolved. -
/abort
- Cancels all in-progress reviews.
Refer to the documentation for additional commands.
Configuration
This repository uses Default Agent
You can customize the agent settings here or contact your Bito workspace admin at [email protected].
Documentation & Help
df[col.col_label], | ||
utc=False, | ||
format=None, | ||
errors="coerce", | ||
exact=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datetime parsing fallback behavior when no format is detected uses exact=False
which can lead to inconsistent parsing behavior and potential data corruption. When detect_datetime_format
returns None (indicating no consistent format was found), the current implementation allows pandas to infer formats flexibly, which can result in different parsing outcomes for the same data across different contexts. This affects downstream consumers like normalize_dttm_col
-> _process_datetime_column
-> pandas.to_datetime. Change exact=False
to exact=True
to ensure consistent parsing behavior when no format is specified.
Code suggestion
Check the AI-generated fix before applying
df[col.col_label], | |
utc=False, | |
format=None, | |
errors="coerce", | |
exact=False, | |
) | |
df[col.col_label], | |
utc=False, | |
format=None, | |
errors="coerce", | |
exact=True, | |
) |
Code Review Run #5caaa0
Should Bito avoid suggestions like this for future reviews? (Manage Rules)
- Yes, avoid them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a bit hard to review because this is so edge-casy, and kind of making the point that pandas is more intended for REPL-type use cases than in-production use cases. Approving to help stop the bleeding short term in logs while minimizing impact on current behavior.
Longer term, seems date-format-detection should be limited to DatasetEditor
sync use cases: when people add a dataset or "sync" it's columns/metadata, we'd run detection once on a sample, and carry that forward when dealing with result sets, and probably disable auto-detection at query result-time parsing.
SUMMARY
This PR fixes high-volume pandas warnings (graph) that appear in production logs when parsing datetime columns without explicit formats. The warning
"Could not infer format, so each element will be parsed individually, falling back to dateutil"
was flooding our monitoring systems (500k+ instances in 15 minutes) and masking other important issues.Root Cause:
When
normalize_dttm_col()
processes datetime columns without an explicit format, pandas attempts format inference. When this fails (due to mixed formats or ambiguous data), it falls back to element-by-element parsing using dateutil, triggering a warning for each operation.Solution:
detect_datetime_format()
function that samples 100 rows to detect common date formats (ISO, US, EU, etc.)detect_datetime_format()
to newsuperset/utils/pandas.py
module (per reviewer feedback aboutcore.py
being massive)_process_datetime_column()
helper to reduce complexityPerformance Impact:
This approach aligns with pandas 2.0+ default behavior and industry best practices for datetime parsing at scale.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
Before (Datadog logs):
After:
Performance Comparison (10k rows):
TESTING INSTRUCTIONS
Run the comprehensive test suite:
This includes:
Manual testing with sample data:
Verify in a running Superset instance:
Check existing functionality:
DateColumn(timestamp_format="epoch_s")
DateColumn(timestamp_format="%Y-%m-%d")
ADDITIONAL INFORMATION
Notes: