Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

frlm
Copy link

@frlm frlm commented Nov 18, 2024

Title: fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv

Bug description

Function: apply_post_process

The issue is that pd.read_csv uses the default values of pandas instead of the parameters defined in CSV_EXPORT in superset_config. This problem is rarely noticeable when using the separator , and the decimal .. However, with the configuration CSV_EXPORT='{"encoding": "utf-8", "sep": ";", "decimal": ","}', the issue becomes evident. This change ensures that pd.read_csv uses the parameters defined in CSV_EXPORT.

Steps to reproduce error:

  • Configure CSV_EXPORT with the following parameters:
    CSV_EXPORT = {
        "encoding": "utf-8",
        "sep": ";",
        "decimal": ","
    }
  • Open a default chart in Superset of the Pivot Table type. In this example, we are using Pivot Table v2 within the USA Births Names dashboard:

image

  • Click on Download > Export to Pivoted .CSV
    image

  • Download is blocked by an error.

Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):

,state;name;sum__num
0,other;Michael;1047996
1,other;Christopher;803607
2,other;James;749686

Fix: Added a bug fix to read data with right CSV_EXPORT settings

Code Changes:

        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))

Complete Code

def apply_post_process(
    result: dict[Any, Any],
    form_data: Optional[dict[str, Any]] = None,
    datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
    form_data = form_data or {}

    viz_type = form_data.get("viz_type")
    if viz_type not in post_processors:
        return result

    post_processor = post_processors[viz_type]

    for query in result["queries"]:
        if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
            raise Exception(  # pylint: disable=broad-exception-raised
                f"Result format {query['result_format']} not supported"
            )

        data = query["data"]

        if isinstance(data, str):
            data = data.strip()

        if not data:
            # do not try to process empty data
            continue

        if query["result_format"] == ChartDataResultFormat.JSON:
            df = pd.DataFrame.from_dict(data)
        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))
            
        # convert all columns to verbose (label) name
        if datasource:
            df.rename(columns=datasource.data["verbose_map"], inplace=True)

        processed_df = post_processor(df, form_data, datasource)

        query["colnames"] = list(processed_df.columns)
        query["indexnames"] = list(processed_df.index)
        query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
        query["rowcount"] = len(processed_df.index)

        # Flatten hierarchical columns/index since they are represented as
        # `Tuple[str]`. Otherwise encoding to JSON later will fail because
        # maps cannot have tuples as their keys in JSON.
        processed_df.columns = [
            " ".join(str(name) for name in column).strip()
            if isinstance(column, tuple)
            else column
            for column in processed_df.columns
        ]
        processed_df.index = [
            " ".join(str(name) for name in index).strip()
            if isinstance(index, tuple)
            else index
            for index in processed_df.index
        ]

        if query["result_format"] == ChartDataResultFormat.JSON:
            query["data"] = processed_df.to_dict()
        elif query["result_format"] == ChartDataResultFormat.CSV:
            buf = StringIO()
            processed_df.to_csv(buf)
            buf.seek(0)
            query["data"] = buf.getvalue()

    return result

@dosubot dosubot bot added the data:csv Related to import/export of CSVs label Nov 18, 2024
@frlm frlm mentioned this pull request Nov 18, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:csv Related to import/export of CSVs size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant