-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multi-threading data generation #10
Comments
Yes I am planning both (multiple tables in parallel and supporting multiple generators to writers). |
I would love to help |
I have been thinking about "how would we use multiple threads to generate line item data" What I was thinking about so far is:
Then we could generate output chunks in parallel like let iter = LineItemGenerator::new(...).iter();
// make first 1000 rows into a buffer
let chunk_1 = generate_lines(iter.clone().take(1000));
// make next 1000 rows ...
let iter = iter.skip_num(1000); // call special skip which just advances the rngs
let chunk_2 = generate_lines(iter.clone().take(1000)
...
let chunk_N = generate_lines(iter.clone().take(1000)
println!("{}", chunk_1);
println!("{}", chunk_2);
...
println!("{}", chunk_n); I think So this solution requires advancing each random number generator 2x, but I think it could be put all the cores to use pretty effectively |
I kind of reached the same conclusion; although I am more inclined to hide |
Perhaps we can start just using the normal iterator |
I actually played around with this this morning -- I think we can use the existing "part count" functionality to do this. I'll monkey with it |
I am quite pleased I have a proof of concept that this approach can be made to work (and it generates data crazy fast) |
- Part of #10 This PR removes some more 1. Consolidates format strings in the cli and tests to use the `Display` impl 2. Avoids some more String copies by postponing formatting Performance of running ```shell time target/release/tpchgen-cli -s 1 --output-dir=/tmp/tpchdbgen-rs ``` | branch | time | |--------|--------| | main | 0m6.637s | | this PR | 0m6.265s |
Another thing that would also likely be helpful for generation is to make the code multi-threaded -- so one thread creates the output stream (of
Arc<String>
or whatever) and another thread actually writes them out to CSV files.It would also be amazing to generate all the tables in parallel (rather than doing it one at a time)
The text was updated successfully, but these errors were encountered: