Skip to content

CS Performance Optimization research

Josh Gough edited this page May 18, 2017 · 3 revisions

CommitStream Optimization research findings:

We've discovered a number of areas for improving performance in CommitStream. We quickly saw improvements, after we did two patches for the first issue we found, eventStoreClient.js was firing exceptions off into the ether as a result of how we were calling statusCodeValidator. But, after we looked deeper and spoke with Prashanth Madi of Microsoft, we realized there was more we can and should do right now.

These discoveries are not in order of discovery, but in order of ease of implementation and incremental benefit, with emphasis on mitigating risk and maximizing opportunity to observe under real-world conditions before applying successive updates.

csErrorHandler.js causing slow down due to util.inspect

In csErrorHandler.js we have this code:

exception: util.inspect(err, {
  showHidden: true,
  depth: null
}).substr(0, 5000),

Unfortunately, since err is not a simple, limited-depth object, this is a very expensive call that builds up a gigantic string, hence the .substr(0, 5000) to take just a small portion of it.

Comparison results for time ./runner.sh <N> "./post-bitbucket-pull-request.sh <M>"

This test runs N processes that posts a Bitbucket pull-request M times. Note that Bitbucket pull-request generates a thrown exception, since we do not process such payloads. So, this is testing the ability to process errors, which has been the source of latency and occassional process terminiation in production.

N M With util.inspect(err, ...)? Elapsed Time
1 50 Y 0m13.832s
5 50 Y 10m36.091s
1 50 N 0m3.744s
5 50 N 0m6.321s
15 50 N 0m13.459s
40 50 N 0m33.717s
80 50 N 1m6.550s
20 150 N 0m53.930s

Compound test results for variant operations

Elapsed time results for running different types of commands in parallel:

time ./runner.sh 20 "./query-real-instance.sh 150" -> 1m27.087s

time ./runner.sh 20 "./post-bitbucket-pull-request.sh 150" -> 1m40.065s

time ./runner.sh 20 "./query-non-existent-instance.sh 150" -> 8m16.150s

time ./runner.sh 20 "./post-new-digest.sh 150" -> 3m29.306s

However, a little counter in csErrorHandler revealed that it was only called 5947 times, when we would have expected an even 6000 times because of the pull-request errors and the non-existent instance errors.

With csErrorHandler change plus catchAsyncErrors around the middleware validators and commitsCreate and digestCreate, without domain-middleware, without Promise.onPossiblyUnresolveRejection

We didn't discover the util.inspect bottleneck until 5/16. Prior to that, we thought the other changes in this heading were what we needed. (TODO: explain these other changes above or below)

So, combining all of those plus the util.inspect change produces this:

time ./runner.sh 20 "./query-real-instance.sh 150" -> 2m20.092s

time ./runner.sh 20 "./post-bitbucket-pull-request.sh 150" -> 2m30.510s

time ./runner.sh 20 "./query-non-existent-instance.sh 150" -> 2m22.575s

time ./runner.sh 20 "./post-new-digest.sh 150" -> 3m56.215s

While some of those took longer, there is no inordinate slow down for the query-non-existent-instance.sh script like above. Most importantly, the counter in csErrorHandler showed an even 6000 calls!

A second trial:

time ./runner.sh 20 "./query-real-instance.sh 150" -> 2m15.562s time ./runner.sh 20 "./post-bitbucket-pull-request.sh 150" -> 2m27.745s time ./runner.sh 20 "./query-non-existent-instance.sh 150" -> 2m17.538s time ./runner.sh 20 "./post-new-digest.sh 150" -> 3m53.232s

This shows consistency, and as before, the csErrorHandler was called exactly 6000 times.

NOTES

  • Use catchAsyncErrors around all validator middleware called in server.js
  • Return the eventStoreClient promise from each of the validator middlewares
  • Use catchAsyncErrors around digestCreate module in digestsController.js

Stories / Order of Stories

Quick wins

These can be done at any time, independent of the deeper technical work listed below. They are in order of most simple to least simple.

Deeper Stories Prereqs

  • S-62662: Remove the offensive util.inspect line from csErrorHandler -- This gives us the biggest bang for the least work, dramatically reducing the amount of time the Node process needs to create logs in csErrorHandler
  • S-62662: Make instanceAuthenticator, instanceToDigestValidator, and instanceToInboxValidator return Promises, and wrap them in server.js with catchAsyncErrors
  • TODO (in S-62662): Make commitsCreate return its Promise and utilize catchAsyncErrors around it, since this is where the vast majority of production-errors take place due to unrecognized payloads, such as Bitbucket Pull-Request
  • Upgrade Node and NPM version together -- may require some fancy manual removal of node_modules folder at the Kudu Console level in order for Azure to re-fetch the modules for the appropriate Node and NPM levels. Crazy, I know.
    • TODO: combine the two existing stories into one

Follow-up Discovery and Storification

  • Discovery: Will csErrorHandler get called when exceptions thrown in non-async code from route handlers, or must we wrap the entire handler function in a Promise?
    • Can determine by forcing exceptions in code prior to making calls to eventStoreClient.blah().then(...)
  • Discovery: Which handlers in api/instances can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Which handlers in api/digests can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Which additional handlers in api/inboxes can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Which handlers in api/health can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Which modules in api/helpers can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Which modules in api/translators can/should return Promises?
    • Create Stories for returning Promises and wrapping with catchAsyncErrors for each
  • Discovery: Is there anywhere else in the app that would benefit from Promisizing or not?
    • You know the drill by now...

Promisizing and wrapping

Once we've discovered each of the appropriately Promisizable handlers and modules, each one can be played in isolation of other changes and released as a patch after appropriate regression testing and performance testing.

Each of these could be done as a Tech Chore item or just done in the line of ordinary delivery one by one. Each one should be small enough that it does not take a whole day to complete. (But, I'll defer estimation promises here!)

Removal of express-domain-middleware and Promise.onPossiblyUnhandledRejection

All of the changes above can be done without the removal of express-domain-middleware. Indeed the first table of results where the only change was removing util.inspect(err, ...) was calculated with express-domain-middleware still in place. It's important to do it this way so that we do not have to do a big-bang-all-at-once upgrade of all the route handlers and modules. Thus, if we do a push that contains Promise improvements and catchAsyncErrors for routes in /api/instances only, then errors generated on those routes will be caught by the wrapper, and will be delivered directly to csErrorModule without going through express-domain-middleware.

Once we've thoroughly discovered and Promisized / wrapped al the appropriate handlers and modules, then removing this will lay it to rest safely.