fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042

aminghadersohi · 2025-09-07T00:16:07Z

SUMMARY

This PR fixes high-volume pandas warnings (graph) that appear in production logs when parsing datetime columns without explicit formats. The warning "Could not infer format, so each element will be parsed individually, falling back to dateutil" was flooding our monitoring systems (500k+ instances in 15 minutes) and masking other important issues.

Root Cause:
When normalize_dttm_col() processes datetime columns without an explicit format, pandas attempts format inference. When this fails (due to mixed formats or ambiguous data), it falls back to element-by-element parsing using dateutil, triggering a warning for each operation.

Solution:

Format Detection: Added detect_datetime_format() function that samples 100 rows to detect common date formats (ISO, US, EU, etc.)
Vectorized Parsing: When format is detected, use it explicitly for ~5x faster vectorized parsing
Warning Suppression: When formats are mixed/ambiguous, suppress the warning while maintaining functionality
Code Organization: Moved detect_datetime_format() to new superset/utils/pandas.py module (per reviewer feedback about core.py being massive)
Code Refactoring: Extracted logic into _process_datetime_column() helper to reduce complexity

Performance Impact:

Consistent date formats: ~5x faster due to vectorized parsing
Mixed formats: Same speed but no warning spam
Detection overhead: Negligible (only samples 100 rows)

This approach aligns with pandas 2.0+ default behavior and industry best practices for datetime parsing at scale.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before (Datadog logs):

WARNING | superset.utils.core:1698 | UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
[Repeated hundreds of times per hour]

After:

[No warnings - clean logs]

Performance Comparison (10k rows):

Before: 4.9ms (element-by-element parsing)
After:  0.9ms (vectorized parsing with format detection)
Speedup: 5.4x

TESTING INSTRUCTIONS

Run the comprehensive test suite:
```
pytest tests/unit_tests/utils/test_date_parsing.py -v
```
This includes:
- Format detection tests
- Warning suppression verification
- Performance comparisons
- Edge case handling

Manual testing with sample data:

import pandas as pd
from superset.utils.core import normalize_dttm_col, DateColumn

# Test with consistent format (should be fast, no warnings)
df = pd.DataFrame({
    "date": ["2023-01-01", "2023-01-02", "2023-01-03"]
})
normalize_dttm_col(df, (DateColumn(col_label="date"),))

# Test with mixed formats (should suppress warnings)
df = pd.DataFrame({
    "date": ["2023-01-01", "01/02/2023", "March 3, 2023"]
})
normalize_dttm_col(df, (DateColumn(col_label="date"),))

Verify in a running Superset instance:
- Create a chart with datetime columns
- Check logs for absence of "Could not infer format" warnings
- Verify dates are parsed correctly
Check existing functionality:
- Epoch timestamps still work: DateColumn(timestamp_format="epoch_s")
- Explicit formats still work: DateColumn(timestamp_format="%Y-%m-%d")
- Timezone offsets still applied correctly

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

Notes:

No breaking changes - all existing functionality preserved
Follows similar approach to pandas 2.0+ built-in behavior
Aligns with Superset's SIP-15A proposal for datetime format inference
All pre-commit hooks pass (mypy, ruff, pylint)

Fixes high-volume pandas warnings in production logs: "Could not infer format, so each element will be parsed individually, falling back to `dateutil`" - Added detect_datetime_format() to detect common date formats from data samples - When format is detected, use it explicitly (prevents warning, ~5x faster) - When format can't be detected (mixed formats), suppress the warning - Refactored into _process_datetime_column() helper to reduce complexity This approach aligns with pandas 2.0+ default behavior and industry best practices for handling datetime parsing at scale.

…me-warning-suppression

korbit-ai · 2025-09-07T00:16:12Z

Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.

Your admin can change your review schedule in the Korbit Console

bito-code-review · 2025-09-07T00:16:30Z

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

mistercrunch · 2025-09-07T00:26:53Z

superset/utils/core.py

@@ -1858,6 +1859,112 @@ def get_legacy_time_column(
        )


+def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | None:


love the idea of factoring this out. Given that superset/utils/core.py is massive already, I'm wondering where it would best fit, maybe somewhere under superset/utils/, I see there's a lot of pandas-helpers in superset/utils/pandas_postprocessing, maybe could go there or maybe a new superset/utils/pandas.py

codecov · 2025-09-07T00:30:41Z

Codecov Report

❌ Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.16%. Comparing base (0fce5ec) to head (37fe60f).
⚠️ Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
superset/utils/core.py	89.47%	2 Missing ⚠️
superset/utils/pandas.py	88.23%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #35042       +/-   ##
===========================================
+ Coverage        0   72.16%   +72.16%     
===========================================
  Files           0      587      +587     
  Lines           0    42947    +42947     
  Branches        0     4550     +4550     
===========================================
+ Hits            0    30991    +30991     
- Misses          0    10749    +10749     
- Partials        0     1207     +1207

Flag	Coverage Δ
hive	`46.60% <13.88%> (?)`
mysql	`71.17% <69.44%> (?)`
postgres	`71.22% <69.44%> (?)`
presto	`50.30% <69.44%> (?)`
python	`72.12% <88.88%> (?)`
sqlite	`70.81% <88.88%> (?)`
unit	`100.00% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add format detection to avoid pandas warning spam in production logs - Extract detect_datetime_format to new superset/utils/pandas.py module - Suppress warnings when format cannot be detected (mixed formats) - Improve performance by 5x for consistent date formats - Add comprehensive tests for warning suppression and format detection

- Reduce test file from 400+ to 150 lines - Remove verbose helper classes and overly complex test scenarios - Use pytest parametrize for cleaner test organization - Remove performance comparison tests (not essential) - Simplify test names and documentation - Keep only essential test coverage

- Add test for invalid epoch values that trigger ValueError - Covers the logger.warning path in _process_datetime_column - Improves patch coverage to address Codecov report

bito-code-review · 2025-09-07T20:15:07Z

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

- Add test for empty series detection in detect_datetime_format - Add test for ValueError handling in datetime conversion - Improves coverage for lines 50-51 in pandas.py and 1887-88 in core.py

bito-code-review · 2025-09-08T01:35:06Z

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Sample-Based Format Detection May Miss Variations ▹ view	🧠 Not in standard
	Inefficient Row-by-Row Timestamp Processing ▹ view	🧠 Not in scope
	Hardcoded datetime formats violate Open-Closed Principle ▹ view	🧠 Not in standard

Files scanned

File Path	Reviewed
superset/utils/pandas.py	✅
superset/utils/core.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

superset/utils/pandas.py

+    ]
+
+    # Get non-null sample
+    sample = series.dropna().head(sample_size)


superset/utils/pandas.py

+    common_formats = [
+        "%Y-%m-%d %H:%M:%S",
+        "%Y-%m-%d",
+        "%Y-%m-%dT%H:%M:%S",
+        "%Y-%m-%dT%H:%M:%SZ",
+        "%Y-%m-%dT%H:%M:%S.%f",
+        "%Y-%m-%dT%H:%M:%S.%fZ",
+        "%m/%d/%Y",
+        "%d/%m/%Y",
+        "%Y/%m/%d",
+        "%m/%d/%Y %H:%M:%S",
+        "%d/%m/%Y %H:%M:%S",
+        "%m-%d-%Y",
+        "%d-%m-%Y",
+        "%Y%m%d",
+    ]


superset/utils/core.py

+                df[col.col_label] = dttm_series.apply(
+                    lambda x: pd.Timestamp(x) if pd.notna(x) else pd.NaT
+                )


bito-code-review

Code Review Agent Run #5caaa0

Actionable Suggestions - 1

superset/utils/core.py - 1
- Inconsistent datetime parsing · Line 1911-1916

Additional Suggestions - 1

superset/utils/pandas.py - 1

Performance optimization needed · Line 60-60

The current implementation performs redundant double parsing validation. After confirming a format works on a 10-row test sample, it unnecessarily re-parses the entire sample (up to 100 rows) with the same format. This creates computational overhead in `detect_datetime_format` which is called by `superset.utils.core` during DataFrame processing. The optimization removes the redundant second parsing while maintaining the same validation accuracy.

Code suggestion

 @@ -57,12 +57,8 @@
-    # Try each format
-    for fmt in common_formats:
-        try:
-            # Test on small sample first
-            test_sample = sample.head(10)
-            pd.to_datetime(test_sample, format=fmt, errors="raise")
-            # If successful, verify on larger sample
-            pd.to_datetime(sample, format=fmt, errors="raise")
-            return fmt
-        except (ValueError, TypeError):
-            continue
+    # Try each format
+    for fmt in common_formats:
+        try:
+            pd.to_datetime(sample, format=fmt, errors="raise")
+            return fmt
+        except (ValueError, TypeError):
+            continue

Review Details

Files reviewed - 3 · Commit Range: afc6bc4..37fe60f
- superset/utils/core.py
- superset/utils/pandas.py
- tests/unit_tests/utils/test_date_parsing.py
Files skipped - 0
Tools
- Whispers (Secret Scanner) - ✔︎ Successful
- Detect-secrets (Secret Scanner) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

/review - Manually triggers a full AI review.
/pause - Pauses automatic reviews on this pull request.
/resume - Resumes automatic reviews.
/resolve - Marks all Bito-posted review comments as resolved.
/abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Default Agent You can customize the agent settings here or contact your Bito workspace admin at [email protected].

Documentation & Help

AI Code Review powered by

bito-code-review · 2025-09-08T16:59:17Z

superset/utils/core.py

+                    df[col.col_label],
+                    utc=False,
+                    format=None,
+                    errors="coerce",
+                    exact=False,
+                )


Inconsistent datetime parsing

The datetime parsing fallback behavior when no format is detected uses exact=False which can lead to inconsistent parsing behavior and potential data corruption. When detect_datetime_format returns None (indicating no consistent format was found), the current implementation allows pandas to infer formats flexibly, which can result in different parsing outcomes for the same data across different contexts. This affects downstream consumers like normalize_dttm_col -> _process_datetime_column -> pandas.to_datetime. Change exact=False to exact=True to ensure consistent parsing behavior when no format is specified.

Code suggestion

Check the AI-generated fix before applying

Suggested change

df[col.col_label],

utc=False,

format=None,

errors="coerce",

exact=False,

)

df[col.col_label],

utc=False,

format=None,

errors="coerce",

exact=True,

)

Code Review Run #5caaa0

Should Bito avoid suggestions like this for future reviews? (Manage Rules)

Yes, avoid them

mistercrunch

LGTM - a bit hard to review because this is so edge-casy, and kind of making the point that pandas is more intended for REPL-type use cases than in-production use cases. Approving to help stop the bleeding short term in logs while minimizing impact on current behavior.

Longer term, seems date-format-detection should be limited to DatasetEditor sync use cases: when people add a dataset or "sync" it's columns/metadata, we'd run detection once on a sample, and carry that forward when dealing with result sets, and probably disable auto-detection at query result-time parsing.

aminghadersohi added 2 commits September 6, 2025 20:08

Merge remote-tracking branch 'upstream/master' into fix/pandas-dateti…

81201ca

…me-warning-suppression

pull-request-size bot added the size/XL label Sep 7, 2025

mistercrunch reviewed Sep 7, 2025

View reviewed changes

aminghadersohi added 2 commits September 7, 2025 15:10

pull-request-size bot added size/L and removed size/XL labels Sep 7, 2025

test: Add coverage for epoch format ValueError handling

d64dba4

- Add test for invalid epoch values that trigger ValueError - Covers the logger.warning path in _process_datetime_column - Improves patch coverage to address Codecov report

test: Add test coverage for uncovered pandas datetime parsing lines

37fe60f

- Add test for empty series detection in detect_datetime_format - Add test for ValueError handling in datetime conversion - Improves coverage for lines 50-51 in pandas.py and 1887-88 in core.py

aminghadersohi marked this pull request as ready for review September 8, 2025 15:47

dosubot bot added the change:backend Requires changing the backend label Sep 8, 2025

korbit-ai bot reviewed Sep 8, 2025

View reviewed changes

bito-code-review bot suggested changes Sep 8, 2025

View reviewed changes

mistercrunch approved these changes Sep 9, 2025

View reviewed changes

mistercrunch merged commit 15e4e8d into apache:master Sep 9, 2025
66 of 107 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042

fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042

aminghadersohi commented Sep 7, 2025 •

edited

Loading

Uh oh!

korbit-ai bot commented Sep 7, 2025

Uh oh!

bito-code-review bot commented Sep 7, 2025

Uh oh!

mistercrunch Sep 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 7, 2025 •

edited

Loading

Uh oh!

bito-code-review bot commented Sep 7, 2025

Uh oh!

bito-code-review bot commented Sep 8, 2025

Uh oh!

korbit-ai bot left a comment •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

bito-code-review bot left a comment •

edited

Loading

Uh oh!

bito-code-review bot Sep 8, 2025

Uh oh!

mistercrunch left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -1858,6 +1859,112 @@ def get_legacy_time_column(
		)


		def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str \| None:

fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042

fix(utils): Suppress pandas date parsing warnings in normalize_dttm_col #35042

Conversation

aminghadersohi commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

korbit-ai bot commented Sep 7, 2025

Uh oh!

bito-code-review bot commented Sep 7, 2025

Uh oh!

mistercrunch Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bito-code-review bot commented Sep 7, 2025

Uh oh!

bito-code-review bot commented Sep 8, 2025

Uh oh!

korbit-ai bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

bito-code-review bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review Agent Run #5caaa0

Uh oh!

bito-code-review bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

mistercrunch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aminghadersohi commented Sep 7, 2025 •

edited

Loading

mistercrunch Sep 7, 2025 •

edited

Loading

codecov bot commented Sep 7, 2025 •

edited

Loading

korbit-ai bot left a comment •

edited

Loading

bito-code-review bot left a comment •

edited

Loading