Skip to content

Conversation

aminghadersohi
Copy link
Contributor

@aminghadersohi aminghadersohi commented Sep 7, 2025

SUMMARY

This PR fixes high-volume pandas warnings (graph) that appear in production logs when parsing datetime columns without explicit formats. The warning "Could not infer format, so each element will be parsed individually, falling back to dateutil" was flooding our monitoring systems (500k+ instances in 15 minutes) and masking other important issues.

Root Cause:
When normalize_dttm_col() processes datetime columns without an explicit format, pandas attempts format inference. When this fails (due to mixed formats or ambiguous data), it falls back to element-by-element parsing using dateutil, triggering a warning for each operation.

Solution:

  1. Format Detection: Added detect_datetime_format() function that samples 100 rows to detect common date formats (ISO, US, EU, etc.)
  2. Vectorized Parsing: When format is detected, use it explicitly for ~5x faster vectorized parsing
  3. Warning Suppression: When formats are mixed/ambiguous, suppress the warning while maintaining functionality
  4. Code Organization: Moved detect_datetime_format() to new superset/utils/pandas.py module (per reviewer feedback about core.py being massive)
  5. Code Refactoring: Extracted logic into _process_datetime_column() helper to reduce complexity

Performance Impact:

  • Consistent date formats: ~5x faster due to vectorized parsing
  • Mixed formats: Same speed but no warning spam
  • Detection overhead: Negligible (only samples 100 rows)

This approach aligns with pandas 2.0+ default behavior and industry best practices for datetime parsing at scale.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before (Datadog logs):

WARNING | superset.utils.core:1698 | UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
[Repeated hundreds of times per hour]

After:

[No warnings - clean logs]

Performance Comparison (10k rows):

Before: 4.9ms (element-by-element parsing)
After:  0.9ms (vectorized parsing with format detection)
Speedup: 5.4x

TESTING INSTRUCTIONS

  1. Run the comprehensive test suite:

    pytest tests/unit_tests/utils/test_date_parsing.py -v

    This includes:

    • Format detection tests
    • Warning suppression verification
    • Performance comparisons
    • Edge case handling
  2. Manual testing with sample data:

    import pandas as pd
    from superset.utils.core import normalize_dttm_col, DateColumn
    
    # Test with consistent format (should be fast, no warnings)
    df = pd.DataFrame({
        "date": ["2023-01-01", "2023-01-02", "2023-01-03"]
    })
    normalize_dttm_col(df, (DateColumn(col_label="date"),))
    
    # Test with mixed formats (should suppress warnings)
    df = pd.DataFrame({
        "date": ["2023-01-01", "01/02/2023", "March 3, 2023"]
    })
    normalize_dttm_col(df, (DateColumn(col_label="date"),))
  3. Verify in a running Superset instance:

    • Create a chart with datetime columns
    • Check logs for absence of "Could not infer format" warnings
    • Verify dates are parsed correctly
  4. Check existing functionality:

    • Epoch timestamps still work: DateColumn(timestamp_format="epoch_s")
    • Explicit formats still work: DateColumn(timestamp_format="%Y-%m-%d")
    • Timezone offsets still applied correctly

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Notes:

  • No breaking changes - all existing functionality preserved
  • Follows similar approach to pandas 2.0+ built-in behavior
  • Aligns with Superset's SIP-15A proposal for datetime format inference
  • All pre-commit hooks pass (mypy, ruff, pylint)

Fixes high-volume pandas warnings in production logs: "Could not infer format,
so each element will be parsed individually, falling back to `dateutil`"

- Added detect_datetime_format() to detect common date formats from data samples
- When format is detected, use it explicitly (prevents warning, ~5x faster)
- When format can't be detected (mixed formats), suppress the warning
- Refactored into _process_datetime_column() helper to reduce complexity

This approach aligns with pandas 2.0+ default behavior and industry best practices
for handling datetime parsing at scale.
Copy link

korbit-ai bot commented Sep 7, 2025

Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.

Your admin can change your review schedule in the Korbit Console

Copy link
Contributor

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

@@ -1858,6 +1859,112 @@ def get_legacy_time_column(
)


def detect_datetime_format(series: pd.Series, sample_size: int = 100) -> str | None:
Copy link
Member

@mistercrunch mistercrunch Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the idea of factoring this out. Given that superset/utils/core.py is massive already, I'm wondering where it would best fit, maybe somewhere under superset/utils/, I see there's a lot of pandas-helpers in superset/utils/pandas_postprocessing, maybe could go there or maybe a new superset/utils/pandas.py

Copy link

codecov bot commented Sep 7, 2025

Codecov Report

❌ Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.16%. Comparing base (0fce5ec) to head (37fe60f).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
superset/utils/core.py 89.47% 2 Missing ⚠️
superset/utils/pandas.py 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #35042       +/-   ##
===========================================
+ Coverage        0   72.16%   +72.16%     
===========================================
  Files           0      587      +587     
  Lines           0    42947    +42947     
  Branches        0     4550     +4550     
===========================================
+ Hits            0    30991    +30991     
- Misses          0    10749    +10749     
- Partials        0     1207     +1207     
Flag Coverage Δ
hive 46.60% <13.88%> (?)
mysql 71.17% <69.44%> (?)
postgres 71.22% <69.44%> (?)
presto 50.30% <69.44%> (?)
python 72.12% <88.88%> (?)
sqlite 70.81% <88.88%> (?)
unit 100.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add format detection to avoid pandas warning spam in production logs
- Extract detect_datetime_format to new superset/utils/pandas.py module
- Suppress warnings when format cannot be detected (mixed formats)
- Improve performance by 5x for consistent date formats
- Add comprehensive tests for warning suppression and format detection
- Reduce test file from 400+ to 150 lines
- Remove verbose helper classes and overly complex test scenarios
- Use pytest parametrize for cleaner test organization
- Remove performance comparison tests (not essential)
- Simplify test names and documentation
- Keep only essential test coverage
@pull-request-size pull-request-size bot added size/L and removed size/XL labels Sep 7, 2025
- Add test for invalid epoch values that trigger ValueError
- Covers the logger.warning path in _process_datetime_column
- Improves patch coverage to address Codecov report
Copy link
Contributor

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

- Add test for empty series detection in detect_datetime_format
- Add test for ValueError handling in datetime conversion
- Improves coverage for lines 50-51 in pandas.py and 1887-88 in core.py
Copy link
Contributor

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

@aminghadersohi aminghadersohi marked this pull request as ready for review September 8, 2025 15:47
@dosubot dosubot bot added the change:backend Requires changing the backend label Sep 8, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Functionality Sample-Based Format Detection May Miss Variations ▹ view 🧠 Not in standard
Performance Inefficient Row-by-Row Timestamp Processing ▹ view 🧠 Not in scope
Design Hardcoded datetime formats violate Open-Closed Principle ▹ view 🧠 Not in standard
Files scanned
File Path Reviewed
superset/utils/pandas.py
superset/utils/core.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

]

# Get non-null sample
sample = series.dropna().head(sample_size)

This comment was marked as resolved.

Comment on lines +31 to +46
common_formats = [
"%Y-%m-%d %H:%M:%S",
"%Y-%m-%d",
"%Y-%m-%dT%H:%M:%S",
"%Y-%m-%dT%H:%M:%SZ",
"%Y-%m-%dT%H:%M:%S.%f",
"%Y-%m-%dT%H:%M:%S.%fZ",
"%m/%d/%Y",
"%d/%m/%Y",
"%Y/%m/%d",
"%m/%d/%Y %H:%M:%S",
"%d/%m/%Y %H:%M:%S",
"%m-%d-%Y",
"%d-%m-%Y",
"%Y%m%d",
]

This comment was marked as resolved.

Comment on lines +1884 to +1886
df[col.col_label] = dttm_series.apply(
lambda x: pd.Timestamp(x) if pd.notna(x) else pd.NaT
)

This comment was marked as resolved.

Copy link
Contributor

@bito-code-review bito-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Agent Run #5caaa0

Actionable Suggestions - 1
  • superset/utils/core.py - 1
Additional Suggestions - 1
  • superset/utils/pandas.py - 1
    • Performance optimization needed · Line 60-60
      The current implementation performs redundant double parsing validation. After confirming a format works on a 10-row test sample, it unnecessarily re-parses the entire sample (up to 100 rows) with the same format. This creates computational overhead in `detect_datetime_format` which is called by `superset.utils.core` during DataFrame processing. The optimization removes the redundant second parsing while maintaining the same validation accuracy.
      Code suggestion
       @@ -57,12 +57,8 @@
      -    # Try each format
      -    for fmt in common_formats:
      -        try:
      -            # Test on small sample first
      -            test_sample = sample.head(10)
      -            pd.to_datetime(test_sample, format=fmt, errors="raise")
      -            # If successful, verify on larger sample
      -            pd.to_datetime(sample, format=fmt, errors="raise")
      -            return fmt
      -        except (ValueError, TypeError):
      -            continue
      +    # Try each format
      +    for fmt in common_formats:
      +        try:
      +            pd.to_datetime(sample, format=fmt, errors="raise")
      +            return fmt
      +        except (ValueError, TypeError):
      +            continue
Review Details
  • Files reviewed - 3 · Commit Range: afc6bc4..37fe60f
    • superset/utils/core.py
    • superset/utils/pandas.py
    • tests/unit_tests/utils/test_date_parsing.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Default Agent You can customize the agent settings here or contact your Bito workspace admin at [email protected].

Documentation & Help

AI Code Review powered by Bito Logo

Comment on lines +1911 to +1916
df[col.col_label],
utc=False,
format=None,
errors="coerce",
exact=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent datetime parsing

The datetime parsing fallback behavior when no format is detected uses exact=False which can lead to inconsistent parsing behavior and potential data corruption. When detect_datetime_format returns None (indicating no consistent format was found), the current implementation allows pandas to infer formats flexibly, which can result in different parsing outcomes for the same data across different contexts. This affects downstream consumers like normalize_dttm_col -> _process_datetime_column -> pandas.to_datetime. Change exact=False to exact=True to ensure consistent parsing behavior when no format is specified.

Code suggestion
Check the AI-generated fix before applying
Suggested change
df[col.col_label],
utc=False,
format=None,
errors="coerce",
exact=False,
)
df[col.col_label],
utc=False,
format=None,
errors="coerce",
exact=True,
)

Code Review Run #5caaa0


Should Bito avoid suggestions like this for future reviews? (Manage Rules)

  • Yes, avoid them

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - a bit hard to review because this is so edge-casy, and kind of making the point that pandas is more intended for REPL-type use cases than in-production use cases. Approving to help stop the bleeding short term in logs while minimizing impact on current behavior.

Longer term, seems date-format-detection should be limited to DatasetEditor sync use cases: when people add a dataset or "sync" it's columns/metadata, we'd run detection once on a sample, and carry that forward when dealing with result sets, and probably disable auto-detection at query result-time parsing.

@mistercrunch mistercrunch merged commit 15e4e8d into apache:master Sep 9, 2025
66 of 107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:backend Requires changing the backend size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants