Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrows in benchmark_dataframe.py might skew results for test_Read_CSV #3

Open
VibhuJawa opened this issue Aug 19, 2019 · 0 comments
Open

Comments

@VibhuJawa
Copy link

VibhuJawa commented Aug 19, 2019

nrows in benchmark_dataframe.py might skew results for test_Read_CSV

We currently use nrows for parameterizing reading dataframes which might skew results for smaller files in test_Read_CSV.

compute_func = lambda data: m.read_csv(data["path"], nrows=data["nrows"])

In the current cudf implementation, with nrows i think we are parsing the entire file in gpu memory to find line terminators (and quote characters).

This might skew our results for reading with nrows so we might want to change it.

See comment : rapidsai/cudf#1643 (comment) and rapidsai/cudf#1643 (comment) on issue: rapidsai/cudf#1643

On the below tests i found quite a bit of performance delta, (72.8 ms vs 1.6 s)

Take head of the file for reading:
import cudf
!head -n 100001 '/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv' > 'yellow_tripdata_2015-01_head_100k.csv'
Timing on reading from a small file file:
%timeit df = cudf.read_csv('yellow_tripdata_2015-01_head_100k.csv',nrows = 100_000)
72.8 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timing on reading the whole file:
%timeit df = cudf.read_csv('/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv',nrows = 100_000)
1.6 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant