`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV #3

VibhuJawa · 2019-08-19T22:15:31Z

`nrows` in benchmark_dataframe.py might skew results for test_Read_CSV

We currently use nrows for parameterizing reading dataframes which might skew results for smaller files in test_Read_CSV.

pybench/pybench/benchmarks/benchmark_dataframe.py

Line 16 in 89d65a6

compute_func = lambda data: m.read_csv(data["path"], nrows=data["nrows"])

In the current cudf implementation, with nrows i think we are parsing the entire file in gpu memory to find line terminators (and quote characters).

This might skew our results for reading with nrows so we might want to change it.

See comment : rapidsai/cudf#1643 (comment) and rapidsai/cudf#1643 (comment) on issue: rapidsai/cudf#1643

On the below tests i found quite a bit of performance delta, (72.8 ms vs 1.6 s)

Take head of the file for reading:

import cudf
!head -n 100001 '/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv' > 'yellow_tripdata_2015-01_head_100k.csv'

Timing on reading from a small file file:

%timeit df = cudf.read_csv('yellow_tripdata_2015-01_head_100k.csv',nrows = 100_000)

72.8 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing on reading the whole file:

%timeit df = cudf.read_csv('/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv',nrows = 100_000)

1.6 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV #3

`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV #3

VibhuJawa commented Aug 19, 2019 •

edited

Loading

nrows in benchmark_dataframe.py might skew results for test_Read_CSV #3

nrows in benchmark_dataframe.py might skew results for test_Read_CSV #3

Comments

VibhuJawa commented Aug 19, 2019 • edited Loading

nrows in benchmark_dataframe.py might skew results for test_Read_CSV

Take head of the file for reading:

Timing on reading from a small file file:

Timing on reading the whole file:

`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV #3

`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV #3

VibhuJawa commented Aug 19, 2019 •

edited

Loading

`nrows` in benchmark_dataframe.py might skew results for test_Read_CSV