-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance analysis #63
Comments
Simple test with using a ramdisk as the temporary directory. Dockerfile:
|
Tests with Docker deamon using a ramdisk as it's temporary directory as well as the squash tool indicated that the build slowed down(!!!) by about 1s. Interesting. |
Docker daemon disk statistics while doing squashing.
And the squashing process:
|
Docker daemon CPU statistics while doing squashing
And the squashing process itself:
|
Now, a bit of CPU stats for the squashing tool itself:
Execution:
|
And disk usage stats:
Execution:
|
A lot of reading in the beginning - Docker prepares the tar archive with the image, so reading all layers. It could be faster (at least twice) than it is, but it seems that there are a lot of things to compute (see next chart). In this case CPU is the bottleneck - we would need to look at the Docker code if the storage plugin could be optimized. At the end a lot of writing - new squashed image is saved. Lighting fast, so it probably skips already existing layers and only unpacks one tar. Good. As said above - creating the tar archive uses a lot of CPU. Same at the end. Squash tool writes the exported tar archive (single file) very fast to the temporary directory. kB_ccwr/s means:
This needs investigation if this is what it should be. Whoah, this is interesting - we keep the CPU busy while squashing. On one hand - good, on the other hand - maybe there is some room for improvement. This is a combined chart that shows CPU usage for two processes Docker and the squashing tool. These are separate processes so ideally we should see usage of up to 200 % of CPU. This is not the case and most probably never be because we do use these processes heavily at different times. Combined disk usage for Docker daemon and squashing tool. It seems that disk is not that used. |
Is there any way to get a graph showing iowait time during these periods and whether the iowait is because of network or storage? |
If I'm reading the overlapping of these graphs correctly, then pretty much the entire cpu chart for the represented builds are totally consumed by the docker pull, squash and docker push jobs. It also looks to me like the docker operations are definitely being limited to a single core as far as the tarsum calculations, and I suspect a similar issue is happening with the squash tool. |
Small info: my time will be limited over next days, but I'll update this ticket asap, including new graphs and replies. Please give me a few days. |
Am I right that there is no |
@twaugh is correct - this is local testing that does not involve any |
No Switching to Docker 1.9 or even 1.10 will not change anything here since there was no change made to the way the tar is generated (well, besides skipping empty layers in 1.10+ - but this should not affect much the time). |
I'll prepare one today. |
I think that's not the case. If you look at the time when the CPU usage is high - you'll see that this a period where the Docker daemon is not involved at all with the image loading/exporting. High CPU usage in this case is due:
In this specific image we copy a lot of files to the new squashed layer (in-memory tar archive). This is a probable place where we could improve a bit performance. This will be very hard though. I was looking at making this parallel in #64. Feel free to take a look at this issue. |
Here is some report that shows iowait too:
And the run:
|
A few words on the environment I test on:
|
Now looking from the Docker storage driver used: xfs
ext4
btrfs
direct-lvm
overlay
It seems that brtfs and overlay perfrorm the best in this case. |
It seems that PR #68 can increase performance by up to 40%, depending on the case. This is huge and already available in master. |
Here is an annotated version of the CPU usage graph that shows what's going on. Orange blocks are CPU cycles used by Docker daemon which is very hard to improve directly in the code. First orange field is saving the tar archive from Docker daemon to local disk. Over this time Docker prepares the tar archive on the fly, it's sent to the squashing tool and saved to the disk. This operation takes about 8 seconds. Hard to improve. Dark blue field is unpacking tar archive. It takes a bit over 1 second. Hard to improve. Removing tar archive takes about 100ms. Brown part is what we can focus on. This is the actual squashing of layers. This task is extrated into #64. Loading image - last orange field is about generating the final tar archive (600ms) and uploading it to Docker daemon (1.5s). Again - hard to improve. Last light blue field is cleanup of the temporary directory (100ms). |
Saving imageThis focuses on the "saving image" process highlighted in above's graph (first orange field). LogsThis is log from Docker daemon. It shows what happens when you execute the /get call.
And here is the log from the squashing tool:
InvestigationVery important is the debug log from requests module. It shows us that the call returned at
This means that preparing the tar archive is a blocking call. The tar archive is not streamed to the client, but instead whole archive is prepared and afterwards send to the client. Over that time client is just idling. Preparing the tar archive on the daemon side took almost exactly 4 seconds in this case. Later in the same log we see:
This means that transferring the tar archive to the client took about 650ms. ConclusionIt's not possible to make this part performing better because most of the time is taken by Docker daemon preparing the tar archive. We do not have any control over it. I tried multiple ways of fetching the image (using API) but it did not make any significant difference. RecommendationTo make this part faster you need to use faster CPU and disks. Focus on CPU. Comment: the issue is that this call is single threaded and layers are prepared serially. It doesn't make sense since that data is read-only and could use multi threading easily. I estimate that this simple change could cut the tar preparation time by 50%. Which would make the squashing shorter by at least 2 seconds. If the daemon would additionally stream the tar archive on the fly we could make it even shorter. |
Unpacking tar archiveInvestigation of the dark blue part of the graph. Logs
InvestigationIn this case unpacking took about 500ms. This is not much considering that the unpacked image size is 565MB. The code looks clean and I do not see any room for improvement here. |
$ python -m cProfile -o foo.cprof docker-squash ...
$ pyprof2calltree -k -i foo.cprof is an easy way to get some hard data. Eg. I found out that 18% of runtime is spent in the |
@rindeal I'll take a closer look on this at some point. I cannot really promise any dates though. Performance is an important thing for this tool. I think it does it's job pretty well even now. The tool itself was undergoing many rounds of performance improvements already, I'm happy to do another one :) |
I love this project @goldmann however I agree with @rindeal's analysis. Thats a good bit of low hanging fruit. I hope you can find time to do another round as you put it. I've noted between 10 and 20% utilization in I'd suggest changing It would be simpler if that directory lookup part wasnt there, and to be honest I'm not entirely sure why it is needed. The other bit of low hangining fruit I suspect is within a layer doing parallel comparisons when looking for changed files. Each layer must still be handled independantly of course (to do otherwise would be complex). |
In this ticket I want to gather all performance analysis that could make squashing even faster.
All performance related issues in one place: https://github.com/goldmann/docker-scripts/issues?q=is%3Aissue+is%3Aopen+label%3Aperformance
The text was updated successfully, but these errors were encountered: