Naive implementation of chunking reader #288

jpsamaroo · 2019-07-02T13:07:55Z

Replaces #129

TODO:

Wire up chunked reading to loadtable
Split blocks across multiple workers
Don't scale block size by file size
Write blocks to disk as they're read

shashi · 2019-07-26T20:24:48Z

Possible steps to solve this problem:

Add a method csvread(io::IO,...) to TextParse
Add a method csvread(ios::Vector{<:IO},...) to TextParse
Use these methods in loadtable_serial. Get tests to pass (this will bring it back to the current master state but will use IO objects in place of files.
Use BlockIO chunking with some heuristics to make loadtable chunking.

jpsamaroo · 2019-07-27T19:03:59Z

Per the hackathon discussion, step 1 is already handled; step 2 would probably be a good idea too. 3 and 4 to follow. Additionally, for step 4, we should probably provide a utility function (or just a slightly different kwarg to loadtable) to ensure that the ~~nblocks argument~~ size of each block doesn't increase with the size of the file, as it does right now. A second kwarg like blockmax might be in order to limit how large any individual block can be.

Import BlockIO/ChunkIter from Dagger Wire blocking into loadtable

jpsamaroo · 2019-12-04T04:13:03Z

I almost forgot, I still need to actually implement incremental saving of read blocks to the output file when specified, otherwise we'll still read the whole CSV's data into memory before serializing back out.

jpsamaroo · 2019-12-04T16:57:02Z

Quick update for onlookers: the latest commit attempts to split individual files into blocks before calling _loadtable_serial so that each block can be saved to disk (and thus removed from memory) when output !== nothing, before moving to the next block. This was the main reason I picked up this work: to allow loading enormous single CSVs without having to "buy more RAM". Once this part is working, then this PR will be ready for review.

jpsamaroo · 2019-12-06T13:42:52Z

@tanmaykm @shashi done and ready for review!

jpsamaroo · 2020-01-29T20:19:20Z

Bump, anyone up for reviewing this?

src/io.jl

shashi · 2020-01-29T22:25:01Z

src/io.jl

+                # Break file into blocks of size `blocksize` or less
+                fsize = filesize(file)
+                nblocks = max(div(fsize, blocksize), 1)
+                bios = blocks(file, '\n', nblocks)


nice, I love that '\n' feature.

src/util.jl

src/io.jl

jpsamaroo · 2020-02-04T14:38:15Z

Looks like some change in TextParse 1.0 is breaking the ability to pass nrows=1 during header parsing (since this passes locally with a pre-1.0 TextParse).

EDIT: nrows was renamed, to make that kwarg available for what we actually need from TextParse (previous nrows didn't actually do what I was expecting, it's just an optimization mechanism).

jpsamaroo mentioned this pull request Jul 19, 2019

chunks argument to loadtable not working as expected #257

Open

jpsamaroo force-pushed the jps/chunking-reader branch from 90f3b5d to 08bc67b Compare July 27, 2019 18:59

jpsamaroo force-pushed the jps/chunking-reader branch from 08bc67b to d748ea5 Compare December 4, 2019 03:19

Add blocked CSV reading functionality

b3dffb1

Import BlockIO/ChunkIter from Dagger Wire blocking into loadtable

jpsamaroo force-pushed the jps/chunking-reader branch from d748ea5 to b3dffb1 Compare December 4, 2019 03:21

jpsamaroo changed the title ~~[WIP] Naive implementation of chunking reader~~ Naive implementation of chunking reader Dec 4, 2019

jpsamaroo changed the title ~~Naive implementation of chunking reader~~ [WIP] Naive implementation of chunking reader Dec 4, 2019

jpsamaroo added 2 commits December 5, 2019 22:03

Split files into blocks before loadtable

d6bd1b7

Add blocksize kwarg to loadtable, change nblocks calculation

9c6d5dd

jpsamaroo force-pushed the jps/chunking-reader branch from f9c3bfb to 9c6d5dd Compare December 6, 2019 13:40

jpsamaroo changed the title ~~[WIP] Naive implementation of chunking reader~~ Naive implementation of chunking reader Dec 6, 2019

jpsamaroo mentioned this pull request Dec 6, 2019

[DO NOT MERGE] naive implementation of chunking reader #129

Closed

shashi reviewed Jan 29, 2020

View reviewed changes

src/io.jl Outdated Show resolved Hide resolved

shashi reviewed Jan 29, 2020

View reviewed changes

src/util.jl Outdated Show resolved Hide resolved

shashi reviewed Jan 29, 2020

View reviewed changes

src/io.jl Outdated Show resolved Hide resolved

jpsamaroo added 2 commits February 3, 2020 19:07

Read header with csvread during chunked reading

e7f2dbc

Don't re-parse blocks on workers

0001f5d

jpsamaroo mentioned this pull request Feb 4, 2020

Reading the first N rows queryverse/TextParse.jl#61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive implementation of chunking reader #288

Naive implementation of chunking reader #288

jpsamaroo commented Jul 2, 2019 •

edited

Loading

shashi commented Jul 26, 2019

jpsamaroo commented Jul 27, 2019 •

edited

Loading

jpsamaroo commented Dec 4, 2019

jpsamaroo commented Dec 4, 2019

jpsamaroo commented Dec 6, 2019

jpsamaroo commented Jan 29, 2020

shashi Jan 29, 2020

jpsamaroo commented Feb 4, 2020 •

edited

Loading

Naive implementation of chunking reader #288

Are you sure you want to change the base?

Naive implementation of chunking reader #288

Conversation

jpsamaroo commented Jul 2, 2019 • edited Loading

shashi commented Jul 26, 2019

jpsamaroo commented Jul 27, 2019 • edited Loading

jpsamaroo commented Dec 4, 2019

jpsamaroo commented Dec 4, 2019

jpsamaroo commented Dec 6, 2019

jpsamaroo commented Jan 29, 2020

shashi Jan 29, 2020

Choose a reason for hiding this comment

jpsamaroo commented Feb 4, 2020 • edited Loading

jpsamaroo commented Jul 2, 2019 •

edited

Loading

jpsamaroo commented Jul 27, 2019 •

edited

Loading

jpsamaroo commented Feb 4, 2020 •

edited

Loading