Column type coercion is applied before DataFrame-wide parser #1903

c-pletinckx · 2025-01-31T09:34:45Z

Describe the bug
Hello,

I discovered pandera a couple of days ago and I encounter the same kind of bug within my implementation. I provide the full working example below.

I use pandera to validate a DataFrame I receive from an external source of data. The size of the received DataFrame might vary. In my example, we are interested about the SINSGA column. In the DataFrame I receive (as a CSV file), empty values can be denoted as zero-length strings (""), empty strings (" ") or string representations of NaN (i.e. "nan").

In order to standardize the representation of empty values, I first define my DataFrameModel with a DataFrame-wide parser replacing the representations of empty values with pd.NA. As I understood from the documentation, DataFrame-wide parsers are applied first, before column parsers and checks and thus it should be applied before the type coercion of the SINSGA column.

However, it seems the coercion is applied before my parser as the line "In parser" is never written to my standard output and the error below is thrown:

[V] I have checked that this issue has not already been reported.
[V] I have confirmed this bug exists on the latest version of pandera.
[V] (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame


class RawSINTableSchema(pa.DataFrameModel):
    SINCDE: int = pa.Field(ge=0, nullable=False, unique=True)
    SINGAR: int = pa.Field(isin=[1, 2, 3, 4, 5], nullable=False)
    SINSGA: pd.Int64Dtype = pa.Field(isin=[100, 101, 102], nullable=True, coerce=True)

    @pa.dataframe_parser
    @classmethod
    def replace_empty_with_na(cls, df: pd.DataFrame) -> pd.DataFrame:
        print("In parser")
        return df.replace(["", " ", "nan"], pd.NA)


DataFrame[RawSINTableSchema](
    {
        "SINCDE": [1, 2, 3, 4, 5],
        "SINGAR": [1, 2, 3, 3, 2],
        "SINSGA": ["", " ", "nan", 100, ""],
    }
)
print("END")

Expected behavior

I expect the DataFrame-wide parser to be executed before column parser, including data type coercion, as specified here:

You can specify both dataframe- and column-level parsers, where dataframe-level parsers are performed before column-level parsers. Assuming that a schema contains parsers and checks, the validation process consists of the following steps:

dataframe-level parsing

column-level parsing

dataframe-level checks

column-level and index-level checks

Did I miss anything ?

Desktop (please complete the following information):

OS: iOS
Python: 3.10.12 (withing poetry environment created with poetry version 1.7.1)
Pandera: 0.22.1
Pandas: 2.2.3

Additional context

Error thrown:

Traceback (most recent call last):
  File "/Users/cpletinckx/Documents/ML-project/test.py", line 18, in <module>
    DataFrame[RawSINTableSchema](
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/typing/common.py", line 129, in __patched_generic_alias_call
    result.__orig_class__ = self
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/typing/common.py", line 181, in __setattr__
    self.__dict__ = schema_model.validate(self).__dict__
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/dataframe/model.py", line 289, in validate
    cls.to_schema().validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/pandas/container.py", line 126, in validate
    return self._validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/pandas/container.py", line 147, in _validate
    return self.get_backend(check_obj).validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/container.py", line 89, in validate
    error_handler.collect_errors(exc.schema_errors)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/base/error_handler.py", line 99, in collect_errors
    self.collect_error(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/container.py", line 646, in _try_coercion
    return coerce_fn(obj)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/dataframe/components.py", line 118, in coerce_dtype
    return self.get_backend(check_obj).coerce_dtype(check_obj, schema=self)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/components.py", line 217, in coerce_dtype
    return super(ColumnBackend, self).coerce_dtype(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/array.py", line 169, in coerce_dtype
    raise SchemaError(
pandera.errors.SchemaError: Error while coercing 'SINSGA' to type Int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type Int64:
   index failure_case
0      0             
1      1             
2      2          nan
3      4

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2025-01-31T21:07:02Z

weird, looking

c-pletinckx added the bug Something isn't working label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column type coercion is applied before DataFrame-wide parser #1903

Column type coercion is applied before DataFrame-wide parser #1903

c-pletinckx commented Jan 31, 2025

cosmicBboy commented Jan 31, 2025

Column type coercion is applied before DataFrame-wide parser #1903

Column type coercion is applied before DataFrame-wide parser #1903

Comments

c-pletinckx commented Jan 31, 2025

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Additional context

cosmicBboy commented Jan 31, 2025