speed improvement to generateNoiseImage #122
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
the following pull request modifies the
generateNoiseImagefunction, where theapplyfunction is replaced with a new implementation which leverages vectorization for computing the dot product. Vectorization allows the matrices to be computed in parallel, whereasapplydoes it sequentially and thus much slower. The impact is high since thegenerateNoiseImagefunction is used for both computing CIs as well as zmaps. Below is a comparison of the outputs of both implementation variants as well as a benchmark script comparing the computation speed. The following RData file has been used for both scripts outlined below and has been extracted from a genuine dataset. This pull request should be a drop-in replacement, since it does not use any additional packages.The script below evaluates the output of the original
applyfunction and that of the newer implementation:the output from running the script on the linked file is:
Differences occur only for 0.003% (or 955 elements) of the noise matrix, with the largest deviation being$\approx1.38\times10^{-17}$ . The results differ on a marginal part of the test data, and the aggregate sum of errors is also negligible ($\approx1.25\times10^{-16}$ ). The reason for this deviation is presumably due to differences in how the last bits are rounded.
Below is a benchmark evaluating the old and new implementation on performing a computation on RData file linked above:
Below are the results:
The newer implementation yields around 9x speed increase. Realistically, since this method is (almost) always called in parallel from multiple threads on multiple different matrices, the speed increase is around 6x due to CPU bottlenecks. In terms of memory usage, both implementations are roughly equal.