-
-
Couldn't load subscription status.
- Fork 19.2k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
When using pd.to_datetime with a specific format parameter, the function strictly expects all data in the input series/column to adhere to that exact format. If even a single value deviates (e.g., due to dirty/corrupted data), the function raises an error (if errors='raise', the default) or potentially converts all problematic values to NaT (if errors='coerce'). This often forces users to perform manual data cleaning or validation before calling to_datetime to handle these outliers, which can be cumbersome and interrupt the data processing workflow, especially when the user believes the majority of the data should follow the specified format.
Feature Description
I propose modifying the internal behavior of pd.to_datetime when a format parameter is explicitly provided. The goal is to make the function more robust to a small proportion of dirty data that does not match the specified format, while still catching cases where the format is fundamentally incorrect for the data. This change would be internal and would not modify the errors parameter's existing behavior.
The suggested internal logic would be:
- Check for Explicit Format: If the
formatargument is provided (notNoneormixed), proceed to the next steps. Otherwise, maintain the current behavior. - Initial Parsing Attempt: Call the internal datetime parsing logic using the provided
formatanderrors='coerce'. This produces an intermediate result series, let's call ittemp_result. - Calculate Success Rate: Calculate the proportion of non-NA/NaT values in
temp_resultcompared to the total number of input values. - Apply Threshold Logic:
- If the success rate (proportion of non-NaT values in
temp_result) is greater than a defined threshold (e.g., 50%): Returntemp_result. This means the specifiedformatis considered appropriate for the majority of the data, and the few entries that failed to parse are marked asNaT. - If the success rate is less than or equal to the threshold: Revert to the current behavior, which is equivalent to
errors='raise'. An error should be raised, indicating that the data format does not match the specified format (or only a very small minority does). This ensures strictness when the format is likely incorrect.
- If the success rate (proportion of non-NaT values in
Pseudocode for the core logic:
def to_datetime(..., format=None, errors='raise', ...):
# ... (existing validation and setup code) ...
if format is not None and format != 'mixed':
# --- New Robust Logic Starts Here ---
# Step 1: Parse with format using errors='coerce'
temp_result = _internal_parse_function(input_arg, format=format, errors='coerce')
# Step 2: Calculate success rate
total_values = len(temp_result)
successful_parses = temp_result.notna().sum() # Count non-NaT values
success_rate = successful_parses / total_values
# Step 3: Apply threshold (e.g., 50%)
THRESHOLD = 0.5 # This value might need discussion/configurability later
if success_rate > THRESHOLD:
# Most data matched the format, return with NaT for failures
return temp_result
else:
# Too little data matched, raise error like errors='raise'
# This would likely involve re-parsing with errors='raise' or
# raising a specific error calculated from temp_result's NaT locations
return _internal_parse_function(input_arg, format=format, errors='raise') # Or equivalent error logic
# --- New Robust Logic Ends Here ---
else:
# ... (existing logic for format=None, format='mixed', etc.) ...
return _internal_parse_function(input_arg, format=format, errors=errors)This approach aims to make to_datetime more robust to small amounts of dirty data when a specific format is expected, reducing the need for pre-processing, while still catching cases where the format is fundamentally wrong.
Describe alternatives you've considered
Currently, users must handle this scenario manually:
- Pre-process the data to clean or identify/replace problematic entries before calling
pd.to_datetime. - Call
pd.to_datetimewitherrors='coerce', check the number ofNaTvalues generated, and decide if the result is acceptable or if further action is needed. - Use
pd.to_datetimewithout aformat(relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.
The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based only on the user providing an explicit format.
Alternative Solutions
Currently, users must handle this scenario manually:
- Pre-process the data to clean or identify/replace problematic entries before calling
pd.to_datetime. - Call
pd.to_datetimewitherrors='coerce', check the number ofNaTvalues generated, and decide if the result is acceptable or if further action is needed. - Use
pd.to_datetimewithout aformat(relying on the default flexible parser), which might be slower and less precise if the user knows the intended format.
The proposed solution would encapsulate this common pattern of "try strict format, check success, decide action" into the function itself, based only on the user providing an explicit format.
Additional Context
This enhancement would primarily benefit data analysts and engineers working with potentially messy datasets where a datetime column is expected to follow a specific format, but a few entries might be incorrect. It streamlines the common workflow of parsing datetimes with an expected format while gracefully handling minor inconsistencies. The change is localized to the scenario where format is specified, minimizing the impact on other existing behaviors.
A potential future optimization could involve an initial pass to assess the overall "date-likeness" or format compliance of the data before attempting the main parsing logic. Such a pre-check could potentially offer even better performance or more precise error reporting in cases where the data is largely non-conforming. However, this initial implementation focuses on the simpler threshold-based approach described above, which provides immediate benefits with a more contained change. The pre-check idea could be explored in a follow-up enhancement if deemed necessary.