Skip to content

CSVDataset behaves unexpectedly if src is a dataframe unexpected index #8201

@ashgillman

Description

@ashgillman

Describe the bug
CSVDataset accepts pandas DataFrames as input for src. But it makes assumptions about the index.

This is because convert_tables_to_dicts uses .loc instead of .iloc. It generates ordinal indexes to subset on but treats them as names indices.

data_ = df.loc[rows] if col_names is None else df.loc[rows, col_names]

To Reproduce

import numpy
import pandas
import monai

df = pandas.DataFrame(numpy.random.random((50, 3)))
df_subset = df.iloc[numpy.arange(0, 50, 5)]
print(df_subset.shape)  # (10, 3)

ds = monai.data.CSVDataset(df_subset)
print(len(ds))  # 3

Expected behavior
print(len(ds)) should return 10.
It returns 3 because it looks up indices slice(10), which match indices 0, 5 and 10 from the subset.

Environment
Shouldn't be relevant?

Additional context
Simple fix:

data_ = df.loc[rows] if col_names is None else df.loc[rows, col_names]

The first .loc should be .iloc, and the second should be .iloc[rows][col_names]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions