Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type coercion is applied before DataFrame-wide parser #1903

Open
c-pletinckx opened this issue Jan 31, 2025 · 1 comment
Open

Column type coercion is applied before DataFrame-wide parser #1903

c-pletinckx opened this issue Jan 31, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@c-pletinckx
Copy link

Describe the bug
Hello,

I discovered pandera a couple of days ago and I encounter the same kind of bug within my implementation. I provide the full working example below.

I use pandera to validate a DataFrame I receive from an external source of data. The size of the received DataFrame might vary. In my example, we are interested about the SINSGA column. In the DataFrame I receive (as a CSV file), empty values can be denoted as zero-length strings (""), empty strings (" ") or string representations of NaN (i.e. "nan").

In order to standardize the representation of empty values, I first define my DataFrameModel with a DataFrame-wide parser replacing the representations of empty values with pd.NA. As I understood from the documentation, DataFrame-wide parsers are applied first, before column parsers and checks and thus it should be applied before the type coercion of the SINSGA column.

However, it seems the coercion is applied before my parser as the line "In parser" is never written to my standard output and the error below is thrown:

  • [V] I have checked that this issue has not already been reported.
  • [V] I have confirmed this bug exists on the latest version of pandera.
  • [V] (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame


class RawSINTableSchema(pa.DataFrameModel):
    SINCDE: int = pa.Field(ge=0, nullable=False, unique=True)
    SINGAR: int = pa.Field(isin=[1, 2, 3, 4, 5], nullable=False)
    SINSGA: pd.Int64Dtype = pa.Field(isin=[100, 101, 102], nullable=True, coerce=True)

    @pa.dataframe_parser
    @classmethod
    def replace_empty_with_na(cls, df: pd.DataFrame) -> pd.DataFrame:
        print("In parser")
        return df.replace(["", " ", "nan"], pd.NA)


DataFrame[RawSINTableSchema](
    {
        "SINCDE": [1, 2, 3, 4, 5],
        "SINGAR": [1, 2, 3, 3, 2],
        "SINSGA": ["", " ", "nan", 100, ""],
    }
)
print("END")

Expected behavior

I expect the DataFrame-wide parser to be executed before column parser, including data type coercion, as specified here:

You can specify both dataframe- and column-level parsers, where dataframe-level parsers are performed before column-level parsers. Assuming that a schema contains parsers and checks, the validation process consists of the following steps:

dataframe-level parsing

column-level parsing

dataframe-level checks

column-level and index-level checks

Did I miss anything ?

Desktop (please complete the following information):

  • OS: iOS
  • Python: 3.10.12 (withing poetry environment created with poetry version 1.7.1)
  • Pandera: 0.22.1
  • Pandas: 2.2.3

Additional context

Error thrown:

Traceback (most recent call last):
  File "/Users/cpletinckx/Documents/ML-project/test.py", line 18, in <module>
    DataFrame[RawSINTableSchema](
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/typing/common.py", line 129, in __patched_generic_alias_call
    result.__orig_class__ = self
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/typing/common.py", line 181, in __setattr__
    self.__dict__ = schema_model.validate(self).__dict__
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/dataframe/model.py", line 289, in validate
    cls.to_schema().validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/pandas/container.py", line 126, in validate
    return self._validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/pandas/container.py", line 147, in _validate
    return self.get_backend(check_obj).validate(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/container.py", line 89, in validate
    error_handler.collect_errors(exc.schema_errors)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/base/error_handler.py", line 99, in collect_errors
    self.collect_error(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/base/error_handler.py", line 54, in collect_error
    raise schema_error from original_exc
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/container.py", line 646, in _try_coercion
    return coerce_fn(obj)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/api/dataframe/components.py", line 118, in coerce_dtype
    return self.get_backend(check_obj).coerce_dtype(check_obj, schema=self)
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/components.py", line 217, in coerce_dtype
    return super(ColumnBackend, self).coerce_dtype(
  File "/Users/cpletinckx/Documents/ML-project/.venv/lib/python3.10/site-packages/pandera/backends/pandas/array.py", line 169, in coerce_dtype
    raise SchemaError(
pandera.errors.SchemaError: Error while coercing 'SINSGA' to type Int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type Int64:
   index failure_case
0      0             
1      1             
2      2          nan
3      4             
@c-pletinckx c-pletinckx added the bug Something isn't working label Jan 31, 2025
@cosmicBboy
Copy link
Collaborator

weird, looking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants