Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build+deploy daily perf test #3107

Closed
warner opened this issue May 17, 2021 · 5 comments
Closed

build+deploy daily perf test #3107

warner opened this issue May 17, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request performance Performance related issues telemetry tooling repo-wide infrastructure
Milestone

Comments

@warner
Copy link
Member

warner commented May 17, 2021

What is the Problem Being Solved?

I want to know how the high-level performance of our system changes over time. I'd like an otherwise-quiet dedicated machine running the following sequence in a continuous 24-hour loop:

  • git clone the current version of agoric-sdk, run yarn install, yarn build, make -C packages/cosmic-swingset
  • git clone the current version of testnet-load-generator
  • create a single-node chain (using agoric start local-chain)
    • wait until it is ready for business, i.e. wait until it reaches block height 2
  • create an ag-solo client process (using agoric start local-solo 8000)
    • wait until it is ready (after the wallet is deployed)
  • run yarn loadgen and wait until it finishes preparing all the load tasks (after it installs the fungible-faucet contract)
  • run the "faucet" loadgen task once every 30s
  • after six hours, stop loadgen, stop the client, stop the chain
  • restart the chain, wait for it to start producing new blocks
  • restart the solo, wait for it to be ready
  • restart the loadgen
  • resume the once-per-30s loadgen tasks
  • after six hours, repeat
  • on the 4th restart (24h after initial startup), don't bother restarting the solo or the loadgen
  • at that point, stop everything, restart from a fresh git clone at the top of the loop

So we'll have 4 periods of 6 hours each. We'll start the chain process once, and restart it 4 times. The final restart will not be used for very long, as we're only interested in how long it takes.

I want to collect the following data as this runs:

  • a slogfile for each 6hr runtime of the chain node (compress these afterwards, of course)
  • a slogfile for the final (4th) restart
  • the time it takes to start the chain node the first time, and to restart it the other four times
    • this is from process start to first new block produced
  • the time it takes for the client startup process to deploy the wallet
  • samples every 5 minutes of:
    • the VmSize, RSS, and cumulative CPU time of the kernel process and each xsnap worker process
    • the used size of the kernelDB (du --si _agstate/agoric-servers/.../data/ag-cosmos-chain-state/data.mdb : the filesize is a constant 2GiB, but it is sparse, and we care about how much space is actually consumed)
    • the total used size of the chain node's state directory (including both swingset state and cosmos state)

There are other metrics we'll add in the future, or which we can derive by parsing the slogfile afterwards.

This data should be used to compute a simple regression for each of the 6hr periods, to derive:

  • change in (VmSize, RSS, and CPU time) vs (number of faucet task cycles) for the kernel process and each of the relevant vats: vattp, zoe, faucet contract. (it would be nice to separate out comms, but it doesn't run under xsnap, so it'll be merged in with the kernel process)
  • growth in kernelDB size and overall cosmos state vs faucet cycles

From the four restart events, I want to compute a linear regression of the restart time vs the number of faucet cycles that have taken place so far.

Then I want once-per-day graphs of all those slopes, plus:

  • initial chain startup time
  • client startup (wallet deploy) time
  • final restart time
  • yarn install/build time
  • size of agoric-sdk tree

The goal is to have a well-grounded model that says "if we run a chain and put X amount of traffic into it, it will consume Y bytes of disk, Z bytes of RAM, and take W seconds to restart".

We should have a dashboard somewhere where all developers can look at these graphs from the last several months, and see clear evidence of performance improvements or regressions that we make.

@warner warner added enhancement New feature or request performance Performance related issues labels May 17, 2021
@Tartuffo Tartuffo added the MN-1 label Feb 2, 2022
@Tartuffo
Copy link
Contributor

Tartuffo commented Feb 3, 2022

@mhofman @JimLarson This does not have an area label that is covered by our weekly tech / planning meetings. Can you assign the proper label? We cover: agd, agoric-cosmos, amm, core economy, cosmic-swingset, endo, getrun, governance, installation-bundling, metering, run-protocol, staking, swingset, swingset-runner, token economy, wallet, zoe contract. Or, if this is accurately labeled by an area label we should be covering in one of our weekly meetings, please LMK. @warner added for his opinion.

@Tartuffo Tartuffo added tooling repo-wide infrastructure and removed MN-1 labels Feb 5, 2022
@mhofman
Copy link
Member

mhofman commented Feb 9, 2022

From @Tartuffo in #2630 (comment)

Questions we need to answer:

  1. Is kernel object table growing, if so we are leaking objects.
  2. Is mem footprint of XNSAP runners growing over time (slow leak).
  3. Is kernel process memory size growing over time.
  4. Is amount of time taken to do a loadgen growing over time.
  5. Is amount of time in swingset growing over time (currently yes).
  6. How much load can we handle (and want point does time / transaction spike with # transactions / sec submitted and/or number of current submitters).
  7. Is block time growing, and past the point we are comfortable with (?)

@mhofman
Copy link
Member

mhofman commented Feb 9, 2022

I believe the original requirements have all been satisfied, except for the visualization aspect (Agoric/testnet-load-generator#63), and the following:

  • yarn install/build time
  • size of agoric-sdk tree

I believe the first one may not be fully representative because of caching (the first revision with a new dependency will take a longer time). The size of the sdk tree can probably be captured, but I'm not convinced of the usefulness.

@warner
Copy link
Member Author

warner commented Feb 10, 2022

I've opened #4525 to perform a single analysis of data from this service, so we can close this ticket once I've gotten instructions from @mhofman on how to get at the raw data.

@mhofman
Copy link
Member

mhofman commented Feb 13, 2022

The instructions are now available in the internal wiki, please let me know if anything is missing.

Once Agoric/testnet-load-generator#67 lands to close a couple issues, I believe this can be closed.

@mhofman mhofman closed this as completed Feb 16, 2022
@Tartuffo Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022
@Tartuffo Tartuffo modified the milestones: Mainnet 1, RUN Protocol RC0 Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance related issues telemetry tooling repo-wide infrastructure
Projects
None yet
Development

No branches or pull requests

3 participants