Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naive implementation of chunking reader #288

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jpsamaroo
Copy link
Collaborator

@jpsamaroo jpsamaroo commented Jul 2, 2019

Replaces #129

TODO:

  • Wire up chunked reading to loadtable
  • Split blocks across multiple workers
  • Don't scale block size by file size
  • Write blocks to disk as they're read

@shashi
Copy link
Collaborator

shashi commented Jul 26, 2019

Possible steps to solve this problem:

  1. Add a method csvread(io::IO,...) to TextParse
  2. Add a method csvread(ios::Vector{<:IO},...) to TextParse
  3. Use these methods in loadtable_serial. Get tests to pass (this will bring it back to the current master state but will use IO objects in place of files.
  4. Use BlockIO chunking with some heuristics to make loadtable chunking.

@jpsamaroo jpsamaroo force-pushed the jps/chunking-reader branch from 90f3b5d to 08bc67b Compare July 27, 2019 18:59
@jpsamaroo
Copy link
Collaborator Author

jpsamaroo commented Jul 27, 2019

Per the hackathon discussion, step 1 is already handled; step 2 would probably be a good idea too. 3 and 4 to follow. Additionally, for step 4, we should probably provide a utility function (or just a slightly different kwarg to loadtable) to ensure that the nblocks argument size of each block doesn't increase with the size of the file, as it does right now. A second kwarg like blockmax might be in order to limit how large any individual block can be.

@jpsamaroo jpsamaroo force-pushed the jps/chunking-reader branch from 08bc67b to d748ea5 Compare December 4, 2019 03:19
Import BlockIO/ChunkIter from Dagger
Wire blocking into loadtable
@jpsamaroo jpsamaroo force-pushed the jps/chunking-reader branch from d748ea5 to b3dffb1 Compare December 4, 2019 03:21
@jpsamaroo jpsamaroo changed the title [WIP] Naive implementation of chunking reader Naive implementation of chunking reader Dec 4, 2019
@jpsamaroo jpsamaroo changed the title Naive implementation of chunking reader [WIP] Naive implementation of chunking reader Dec 4, 2019
@jpsamaroo
Copy link
Collaborator Author

I almost forgot, I still need to actually implement incremental saving of read blocks to the output file when specified, otherwise we'll still read the whole CSV's data into memory before serializing back out.

@jpsamaroo
Copy link
Collaborator Author

Quick update for onlookers: the latest commit attempts to split individual files into blocks before calling _loadtable_serial so that each block can be saved to disk (and thus removed from memory) when output !== nothing, before moving to the next block. This was the main reason I picked up this work: to allow loading enormous single CSVs without having to "buy more RAM". Once this part is working, then this PR will be ready for review.

@jpsamaroo jpsamaroo force-pushed the jps/chunking-reader branch from f9c3bfb to 9c6d5dd Compare December 6, 2019 13:40
@jpsamaroo jpsamaroo changed the title [WIP] Naive implementation of chunking reader Naive implementation of chunking reader Dec 6, 2019
@jpsamaroo
Copy link
Collaborator Author

@tanmaykm @shashi done and ready for review!

@jpsamaroo
Copy link
Collaborator Author

Bump, anyone up for reviewing this?

src/io.jl Outdated Show resolved Hide resolved
# Break file into blocks of size `blocksize` or less
fsize = filesize(file)
nblocks = max(div(fsize, blocksize), 1)
bios = blocks(file, '\n', nblocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I love that '\n' feature.

src/util.jl Outdated Show resolved Hide resolved
src/io.jl Outdated Show resolved Hide resolved
@jpsamaroo
Copy link
Collaborator Author

jpsamaroo commented Feb 4, 2020

Looks like some change in TextParse 1.0 is breaking the ability to pass nrows=1 during header parsing (since this passes locally with a pre-1.0 TextParse).

EDIT: nrows was renamed, to make that kwarg available for what we actually need from TextParse (previous nrows didn't actually do what I was expecting, it's just an optimization mechanism).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants