Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

frlm · 2024-11-18T13:49:36Z

Title: fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv

Bug description

Function: apply_post_process

The issue is that pd.read_csv uses the default values of pandas instead of the parameters defined in CSV_EXPORT in superset_config. This problem is rarely noticeable when using the separator , and the decimal .. However, with the configuration CSV_EXPORT='{"encoding": "utf-8", "sep": ";", "decimal": ","}', the issue becomes evident. This change ensures that pd.read_csv uses the parameters defined in CSV_EXPORT.

Steps to reproduce error:

Configure CSV_EXPORT with the following parameters:

CSV_EXPORT = {
    "encoding": "utf-8",
    "sep": ";",
    "decimal": ","
}

Open a default chart in Superset of the Pivot Table type. In this example, we are using Pivot Table v2 within the USA Births Names dashboard:

Click on Download > Export to Pivoted .CSV
Download is blocked by an error.

Cause: The error is generated by an anomaly in the input DataFrame df, which has the following format (a single column with all distinct fields separated by a semicolon separator):

,state;name;sum__num
0,other;Michael;1047996
1,other;Christopher;803607
2,other;James;749686

Fix: Added a bug fix to read data with right CSV_EXPORT settings

Code Changes:

        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))

Complete Code

def apply_post_process(
    result: dict[Any, Any],
    form_data: Optional[dict[str, Any]] = None,
    datasource: Optional[Union["BaseDatasource", "Query"]] = None,
) -> dict[Any, Any]:
    form_data = form_data or {}

    viz_type = form_data.get("viz_type")
    if viz_type not in post_processors:
        return result

    post_processor = post_processors[viz_type]

    for query in result["queries"]:
        if query["result_format"] not in (rf.value for rf in ChartDataResultFormat):
            raise Exception(  # pylint: disable=broad-exception-raised
                f"Result format {query['result_format']} not supported"
            )

        data = query["data"]

        if isinstance(data, str):
            data = data.strip()

        if not data:
            # do not try to process empty data
            continue

        if query["result_format"] == ChartDataResultFormat.JSON:
            df = pd.DataFrame.from_dict(data)
        elif query["result_format"] == ChartDataResultFormat.CSV:
            df = pd.read_csv(StringIO(data), 
                             delimiter=superset_config.CSV_EXPORT.get('sep'),
                             encoding=superset_config.CSV_EXPORT.get('encoding'),
                             decimal=superset_config.CSV_EXPORT.get('decimal'))
            
        # convert all columns to verbose (label) name
        if datasource:
            df.rename(columns=datasource.data["verbose_map"], inplace=True)

        processed_df = post_processor(df, form_data, datasource)

        query["colnames"] = list(processed_df.columns)
        query["indexnames"] = list(processed_df.index)
        query["coltypes"] = extract_dataframe_dtypes(processed_df, datasource)
        query["rowcount"] = len(processed_df.index)

        # Flatten hierarchical columns/index since they are represented as
        # `Tuple[str]`. Otherwise encoding to JSON later will fail because
        # maps cannot have tuples as their keys in JSON.
        processed_df.columns = [
            " ".join(str(name) for name in column).strip()
            if isinstance(column, tuple)
            else column
            for column in processed_df.columns
        ]
        processed_df.index = [
            " ".join(str(name) for name in index).strip()
            if isinstance(index, tuple)
            else index
            for index in processed_df.index
        ]

        if query["result_format"] == ChartDataResultFormat.JSON:
            query["data"] = processed_df.to_dict()
        elif query["result_format"] == ChartDataResultFormat.CSV:
            buf = StringIO()
            processed_df.to_csv(buf)
            buf.seek(0)
            query["data"] = buf.getvalue()

    return result

Francesco Lombardo added 2 commits November 18, 2024 11:59

fix export pivoted csv

55b389f

fix csv export updating pd.read_csv arguments

5732122

pull-request-size bot added the size/XS label Nov 18, 2024

dosubot bot added the data:csv Related to import/export of CSVs label Nov 18, 2024

frlm mentioned this pull request Nov 18, 2024

Export pivoted table into csv #30658

Open

3 tasks

rusackas requested a review from kgabryje November 18, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

frlm commented Nov 18, 2024

Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

Are you sure you want to change the base?

Pivoted CSV export fix when CSV_EXPORT values are not default ones #30961

Conversation

frlm commented Nov 18, 2024

Bug description