-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy-from performance with lots of chunks #127
Comments
Hello thanks for the report. I need to spend some time on this. Did you try to optimize the calls through the _writev mechanism ? From my vague understanding, this is a use case where _writev would be useful to batch several raws and was why I added the _writev implementation in the first place. I never used it myself, but according to https://stackoverflow.com/questions/41333617/correct-usage-of-writev-in-node-js, you would need to cork the stream, write your rows, and then uncork it. |
just to keep as a reference - https://youknowfordevs.com/2018/10/29/using__writev-to-create-a-fast-writable-stream-for-elasticsearch.html |
I appreciate it's not a simple change, thanks for looking.
I looked at when _writev was being called (using the code on master) and tried various values of highWaterMark. It only That said, copy-from's writev still sends each chunk to the socket (
That's hard to make work from the application code when piping. There's does not seem to be a way to do it inside copy-fromeither. Once corked you don't get any |
Had another look at the blog post. The reason writev gets called in that elastic search stream is because async _write(body, enc, next) {
try {
await this.send() // I made this just sleep for 5s
next() // this gets called when es has finished the query
} catch(err) {
next(err)
}
} because of that, the elastic search stream falls behind very quickly and In copy-from, we |
hum so this could be the explanation
We never receive ACKs from postgres so the elastic search solution for batching could not work here except if we were doing mini-COPY operations maybe (some people in the past have thought about batching the COPYs). This sounds not easy though because we would need to always meet the row boundaries. I'll look at your benchmarks to see if I can reproduce the problem on my laptop. at this time I don't know what is the best choice between implementing the batching mechanism internally or documenting an official way to batch the rows externally for those who need to push it to eleven ;-) - https://www.youtube.com/watch?v=hW008FcKr3Q Maybe batching externally could bring both optimizations (batching when backpressure is never met, and _writev when backpressure is met) Could it be that you meet this problem because you have a very high quality postgres architecture that never triggers backpressure ? |
Absolutely, your call. If you want I'll default to batches of 1 (backward compatible) and add the tests to make sure it's working correctly when batch > 1. If not, well, it's been fun figuring it out.
I am testing with a local postgres on a laptop, so it's unix domain socket rather than a tcp connection. I reckon the network latency is very low but the postgres is below par. This is the same copy-from.js as on master:Existing benchmark:
new benchmark:
Using the batched one:Existing benchmark:
new benchmark:
From this run it would seem that batching improves the existing benchmark as well. Instead of 11 it's like we made 10 a bit louder1. Footnotes
|
OK I have to grasp the 2 scenarios you mention in you benchmark. the initial benchmark (wrapping seq in a stream) was supposed to mimic a real world scenario using streams but you are maybe right that it does not mimic a scenario with an in-js-loop stream that never ends generating data. Is it the scenario you are having ? In my frame of mind, data usually comes from an external source (a db, a file, ..) and there are breathing rooms from time to time. Do you know of a tool that would draw/debug the dynamics of a stream pipeline ? I have been looking for one for a long time it would be handy to pinpoint the bottlenecks. I'll do some tests to try and to understand what happens in the scenario of the new benchmark. |
The real code reads an mbtiles file (thats a sqlite database full of gzipped protocol buffers), passes each row through guznip stream, then we do some custom processing to generate the csv rows to pass to copy-from. The data does come from an external source but... each of the compressed pbf generates multiple rows of csv so it reads a lot less data than it sends. The benchmark is a result of refining that process down so I could find the bottlenecks and then construct a test case we could focus on. It's not a precise repro of what we have, more an extreme example of the behaviour (what benchmark isn't?). There are a bunch of factors, and cycle time is no doubt one of them, but I think the main one is the way copy-from writes to the socket. Sending many small chunks just has a high overhead: If I write 1,000,000 characters to a socket we get:
The top one will make 1,000,000 calls to socket.write, the last one will make 10 but they send the same amount of data in total. The test script is here: https://gist.github.com/byrney/f4df10bd85292368f4b7a91bc3705c8b If writev makes one call to ps. I have not found any easy way to view this stuff, I wish I had. The chrome profiler is just a mess of tiny calls. In the end I added instrumentation manually to count how many calls we're being made to write, writev, flush etc. |
I looked a bit into the benchmarks. I tested with the stream-chunker module and it seems to be more or less on-par, timewise with you implementation. The internals of this module could be improved because it has systematic row-by-row buffer concat whereas your implementation is more lazy on the concat which is better memory-wise and probably avoids memory fragmentation for large payloads. I usually prefer stream composition over adding more features to an implementation, because I am not totally sure that the feature we would add would meet all use-cases. For instance, some people could prefer some kind of time-based chunker : aggregate and produce size-based chunks, but still produce lower-sized chunks if needed every ms-interval. I am not sure that would make sense because the data is not visible inside postgres before commit but yet I have the impression that we have not yet seen all use cases and that it is a bit too early too integrate the chunking mechanism inside copy-from implementation. So at this stage I would probably document the 2 use cases on the README + give the name of a good module for chunk aggregation that can be used in the row-by-row use case. What is your opinion on this ? During your tests, did you write a custom aggregator or did you use an existing module ? |
Other options for the aggregator |
pipe into psql COPY x 6.16 ops/sec ±15.45% (37 runs sampled) all libs setup with size = 512 * 1024.
for a comparison of the 2 I can see that the batched version does not try to force the exact size of the output buffer + is integrated in the overall Writable mechanism so it avoids the piping mechanism overhead. the block-stream2 maybe benefits from using the I would have liked to compare to something using |
I did some more tests on this. It looks like I lowered the number of generated lines and that it gave wrong results on the benchmarks (only 9999 lines were generated). If I increase the number of lines to 99999, the best results I have so far is with a custom chunker which is pretty close to what you integrated in your modified copy-from.js. On my machine the performance peaks at 128 * 1024. class MyChunker extends stream.Transform {
constructor(options) {
super(options)
this.buf = []
this.bufLength = 0
this.size = (options && options.size) || 1024 * 128
}
_transform(chunk, encoding, callback) {
this.buf.push(chunk)
this.bufLength += Buffer.byteLength(chunk)
if (this.bufLength >= this.size) {
const buf = Buffer.concat(this.buf, this.bufLength)
this.buf.length = 0
this.bufLength = 0
this.push(buf)
}
callback()
}
_flush(callback) {
if (this.bufLength > 0) {
const buf = Buffer.concat(this.buf, this.bufLength)
this.buf = []
this.bufLength = 0
this.push(buf)
}
callback()
}
} |
Thanks for the thorough investigation! I did write a batching transform initially, but it didn't concat the chunks so it didn't have much impact. That's when I started looking at the internals of copy-from and the changes in my branch were the result of that investigation. If you prefer not to incorporate the chunking into copy-from (totally understandable), then I will do a chunker transform like yours. Anyone else with a similar problem might find this issue and get a good answer too. The only downside I can see is that it won't be incorporated into the copy-from benchmarks but that's a minor thing. Thanks again. |
I agree that there are advantages to have it incorporated into copy-from. It is not so easy for new users of the module to know if and when adding the interim chunker brings a benefit. Do you have ideas about how we could ease things even more for the network layer ? Here we build chunks with a minimum size of 512 x 1024 or 128 x 1024 bytes, which seems to boost things up in your case. This is fast because we just concat the buffers without rebuffering to a specific size. Do you observe the ~6x speedup that we observe in the benchmarks during your real world scenario ? But we could try to make things really easy on the network layer by exactly matching the size that the network layer expects for its optimum path. Is there such a thing and can we reach it through node.js ? |
for future reference: In case we need an integrated implementation that exactly respect the chunk size, the https://github.com/ItalyPaleAle/node-stream-chunkify/blob/master/index.js the idea is to create buffers of the expected size as a plaholder for copies of the received chunked data. We could probably adapt it to integrate the 5 bytes copyData+length inside the requested size and thus have 1 call to socket.write instead of 2 per sent copyData. |
We have a node process that generates a lot of data1 and we need to insert it into postgres. Doing an insert per item was terrible (no suprise there), creating big inserts with many rows is better but still very bad. copy-from is better still but not as much of an improvement as I thought it would be. None of them come close to piping the data into a psql child process.
I created a bench to show the problem here: byrney@1086f50#diff-1b32c28cb38c05480eccc1bd60ff97029b57a05c96718b96dad7e9d84894f549
Like our process, this calls
stream.write(...)
once per row and then pipes that to copy-from. Our real process sends much wider rows, but we see the same effect.As a workaround we can pipe through a stream which batches up the chunks before sending them to copy-from. Making a very small number of calls to
copy-from.write
(at the cost of memory) helps a lot with speed.A potential improvement to copy-from is in that branch too: https://github.com/byrney/node-pg-copy-streams/tree/private/copy-from-performance. It moves the batching into the copy-from itself.
It needs a bit of a tidy-up2 but if you think the approach is sound and you're open to a PR, I can do that and submit it.
Footnotes
This is a gross simplification but that's the gist of it ↩
There is still some debugging logic in there to count the number of calls ↩
The text was updated successfully, but these errors were encountered: