Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-threading data generation #10

Closed
alamb opened this issue Mar 10, 2025 · 7 comments · Fixed by #58
Closed

Implement multi-threading data generation #10

alamb opened this issue Mar 10, 2025 · 7 comments · Fixed by #58
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@alamb
Copy link
Collaborator

alamb commented Mar 10, 2025

Another thing that would also likely be helpful for generation is to make the code multi-threaded -- so one thread creates the output stream (of Arc<String> or whatever) and another thread actually writes them out to CSV files.

It would also be amazing to generate all the tables in parallel (rather than doing it one at a time)

@clflushopt clflushopt self-assigned this Mar 11, 2025
@clflushopt clflushopt added this to the v0.1.0 milestone Mar 11, 2025
@clflushopt clflushopt added the enhancement New feature or request label Mar 11, 2025
@clflushopt
Copy link
Owner

Yes I am planning both (multiple tables in parallel and supporting multiple generators to writers).

@alamb
Copy link
Collaborator Author

alamb commented Mar 14, 2025

Yes I am planning both (multiple tables in parallel and supporting multiple generators to writers).

I would love to help

@alamb
Copy link
Collaborator Author

alamb commented Mar 17, 2025

I have been thinking about "how would we use multiple threads to generate line item data"

What I was thinking about so far is:

  1. Make sure all generator / iterators are Clone (so you can .clone() to get a copy)
  2. Add a skip_num(num_records: usize) method to each one

Then we could generate output chunks in parallel like

let iter = LineItemGenerator::new(...).iter();

// make first 1000 rows into a buffer
let chunk_1 = generate_lines(iter.clone().take(1000));
// make next 1000 rows ...
let iter = iter.skip_num(1000); // call special skip which just advances the rngs
let chunk_2 = generate_lines(iter.clone().take(1000)
...
let chunk_N = generate_lines(iter.clone().take(1000)

println!("{}", chunk_1);
println!("{}", chunk_2);
...
println!("{}", chunk_n);

I think skip_num would be pretty fast as it just advances the random number generators (and could skip all actual row generation logic)

So this solution requires advancing each random number generator 2x, but I think it could be put all the cores to use pretty effectively

@alamb alamb changed the title Implement multi-threading Implement multi-threading data generation Mar 17, 2025
@clflushopt
Copy link
Owner

I kind of reached the same conclusion; although I am more inclined to hide skip_num logic here. My line of thought is that library users should transparently be able to have parallel and non-parallel iterators (like how rayon exposes par_iter). But I'll have to tinker about it.

@alamb
Copy link
Collaborator Author

alamb commented Mar 18, 2025

I kind of reached the same conclusion; although I am more inclined to hide skip_num logic here. My line of thought is that library users should transparently be able to have parallel and non-parallel iterators (like how rayon exposes par_iter). But I'll have to tinker about it.

Perhaps we can start just using the normal iterator skip (which will do a bunch of work that is not strictly necessary). If that turns out to be too slow we can look into a skip_num optimization -- it could well be a premature optimization

@alamb
Copy link
Collaborator Author

alamb commented Mar 18, 2025

I actually played around with this this morning -- I think we can use the existing "part count" functionality to do this. I'll monkey with it

@alamb
Copy link
Collaborator Author

alamb commented Mar 19, 2025

I am quite pleased I have a proof of concept that this approach can be made to work (and it generates data crazy fast)

alamb added a commit that referenced this issue Mar 19, 2025
- Part of #10

This PR removes some more
1. Consolidates format strings in the cli and tests to use the `Display`
impl
2. Avoids some more String copies by postponing formatting

Performance of running
```shell
time target/release/tpchgen-cli -s 1 --output-dir=/tmp/tpchdbgen-rs
```
| branch | time |
|--------|--------|
| main | 0m6.637s |
| this PR | 0m6.265s |
@alamb alamb closed this as completed in #58 Mar 25, 2025
@alamb alamb closed this as completed in ab720a7 Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants