Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Sep 27, 2024

Fixes #8928

Description

Adds new cosmos parameters for controlling terminated vat cleanup and uses them in cosmic-swingset to allow budget-limited cleanup in otherwise empty blocks. Also includes a fair amount of refactoring in support of the same, so best reviewed as individual commits.

Security Considerations

Too-high budget values are a denial of service risk. All values are currently defaulted to zero for no minimal cleanup, but we may instead want to use small values.

Scaling Considerations

See above.

Documentation Considerations

Hmm, do we have documentation of cosmos-sdk parameters anywhere?

Testing Considerations

Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.

Upgrade Considerations

Upgrade will set the new parameters to their default values as defined in default-params.go. If we set non-zero defaults, then work could commence as soon as this code is live, which means it requires a chain-halting upgrade.

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup-budget passing code needs to change.. the policy language is funky, I have to admit, and the fact that it's easy to accidentally enable unlimited cleanups is a problem. Let's brainstorm tomorrow to see if there's a way to write a test around this part, maybe a unit test which imports just the policy creator and then pokes it in a couple of ways, without a kernel or cosmos or anything else.

NewQueueSize(QueueInbound, DefaultInboundQueueMax),
}

// FIXME: Settle on default values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd like our initial values to be { default: 5, kv: 50 } (swingset will use .default for everything else), so deletion happens slowly as soon as upgrade18 is deployed. That will let us trigger deletion of some vat (probably v112-scaledPriceAuthority-stkATOM, being the smallest one) as part of upgrade18, instead of a three-vote sequence of:

  • upgrade18
  • governance proposal to change parameters to { default: 5, kv: 50 }
  • governance proposal to trigger deletion of v112

The important values are that exports and imports use no more than 5 (so the BOYDs remain a reasonable size, and we don't clobber the chain with large cleanup work), and that kv uses more like 50 (so the deletion takes weeks instead of months or years).

The total number of transcripts and snapshots aren't large enough for 5 vs 50 to make a big difference. Deleting both are pretty quick, so it would probably be safe to use 50, but I'd rather measure the load of using "5" before changing that rate.

I don't think I have a strong opinion on whether the default specifies values for all the fields, or just default and kv. My original intention was to allow the governance proposal to be minimal, and only supply the ones that needed to be overridden, but I'm not clear on how much the governance parameter messages are deltas vs absolutes. If you have to supply all the parameters (not just vat cleanup, but also beans-per-computrons and queue sizes) in each message, then it's not a big lift to supply all of default, kv, imports, exports, snapshots, transcripts in each one.

The important thing is legibility: the voters inspecting the change-parameter proposal should be able to tell what exactly is changing. If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's important that one of these defaults being set to 0 doesn't override the safely-rate-limited behavior we get from { default: 5 }. I'm thinking that DefaultVatCleanupBudget should have one or two entries in it, and not provide a value for e.g. exports and imports and snapshots and transcripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • governance proposal to change parameters to { default: 5, kv: 50 }

That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.

If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.

golang/cosmos/x/swingset/types/params.go Outdated Show resolved Hide resolved
packages/cosmic-swingset/src/launch-chain.js Outdated Show resolved Hide resolved
// https://github.com/Agoric/agoric-sdk/issues/8928#issuecomment-2053357870
const shouldRun = () =>
!cleanupStarted && (ignoreBlockLimit || totalBeans < blockComputeLimit);
const allowCleanup = () => cleanupStarted && !cleanupDone;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't get us the right budget.. allowCleanup() === true means an unlimited budget, and the kernel only calls didCleanup at the end of the crank. So a non-finite budget will let the kernel perform unlimited cleanup, and then it will report a huge didCleanup(cleanups), and our decrementer will stop it, but by then it will be too late.

We need allowCleanup to return the remaining budget each time (if we're in the cleanup window). We also need to let the kernel perform GC actions and BOYD. The kernel will ask about one vat with allowCleanup(), do an amount of cleanup work limited by the returned budget, notify the policy with a didCleanup(), then it returns to the "decide what work to do next" function, which will query the policy again. We need the didCleanup() to return true (meaning "keep going, don't stop controller.run() yet"), then that second call to allowCleanup() must say "no more cleanups", to allow it to move on to the gc-action queue, or the dirty-vat-do-BOYD check. We still count computrons, so if e.g. the cleanup resulted in multiple vats becoming dirty and eligible for BOYD, we might perform one BOYD, decide the computrons it spent were too much, and then end the block before we get to the second BOYD. The next block's crankerPhase = 'leftover' run will finish off those BOYDs.

This only really works because we only do cleanup on empty blocks, so we know the run-queue is empty before clean is allowed to proceed. The run-policy language doesn't have a way to say "do the gcActions and BOYDs triggered by cleanup, but don't allow any normal cranks to be delivered, because we're only doing cleanup-related stuff now".

(I wish there were a good way to unit test this)

So something more like:

let inCleanup = false;
const startCleanup = () => if (totalBeans > 0n) { return false; } else { inCleanup = true; return true; };
const allowCleanup = () => inCleanup && remainingCleanups; // returns false or budget
const didCleanup = details => { .. decrement remainingCleanups; return true; };
const shouldRun = () => ignoreBlockLimit || totalBeans < blockComputeLimit;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, yeah. Updated to something that actually has the right return type for allowCleanup.

@@ -88,6 +90,24 @@ const parseUpgradePlanInfo = (upgradePlan, prefix = '') => {
* `chainStorageEntries: [ ['c.o', '"top"'], ['c.o.i'], ['c.o.i.n', '42'], ['c.o.w', '"moo"'] ]`).
*/

/**
* @typedef {'leftover' | 'forced' | 'high-priority' | 'intermission' | 'queued' | 'cleanup'} CrankerPhase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot, it makes the flow much easier to think about.

packages/cosmic-swingset/test/scenario2.js Show resolved Hide resolved
packages/cosmic-swingset/src/sim-params.js Outdated Show resolved Hide resolved
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.

Can the policy (possibly with the cranker) be factored out of launch-chain and exercised in a test, maybe with a mock swingset controller?

Upgrade Considerations

I believe it's worth noting that the upgrade will set the default parameters for the new param.

DefaultVatCleanupKv = sdk.NewUint(0) // 50
DefaultVatCleanupSnapshots = sdk.NewUint(0) // 50
DefaultVatCleanupTranscripts = sdk.NewUint(0) // 50
DefaultVatCleanupBudget = []UintMapEntry{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question keeps coming up of where the defaults should be defined, and whether defaults are persistent, aka what should happen with values if the default changes and the values were not modified. I can't shake the feeling that the way we've implemented params does not allow the flexibility to update unchanged defaults.

This is particularly true when we have a param value whose name is default.

I believe here it might be better to just have a single entry in DefaultVatCleanupBudget: the VatCleanupDefault one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of the latest push, cosmic-swingset has a default budget of { default: Infinity } to preserve current behavior (and requires an explicit default value when any other field is present) and golang/cosmos/x/swingset/types/default-params.go defines DefaultVatCleanupBudget as { default: 5, kv: 50 } per @warner's instruction.

NewQueueSize(QueueInbound, DefaultInboundQueueMax),
}

// FIXME: Settle on default values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • governance proposal to change parameters to { default: 5, kv: 50 }

That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.

If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this PR, but I didn't realize #8931 resulted in a copy of the proto definition files to be checked into the repo. @turadg any reason why these are not excluded from git?

@@ -80,7 +80,7 @@ func TestAddStorageBeanCost(t *testing.T) {
},
} {
t.Run(tt.name, func(t *testing.T) {
got, err := appendMissingDefaultBeansPerUnit(tt.in, []StringBeans{defaultStorageCost})
got, err := appendMissingDefaults(tt.in, []StringBeans{defaultStorageCost})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably to be consistent we should fill in all missing defaults here, not just the beans per unit ones.

packages/cosmic-swingset/test/scenario2.js Show resolved Hide resolved
*/

/**
* @typedef {(phase: CrankerPhase) => Promise<boolean>} Cranker runs the kernel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the Cranker name. It controls a macro "swingset run", which itself is composed of many cranks. Does that make a vat delivery a "turner"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, the existing term that this relates to is a "run" or a "swingset run", meaning one invocation of controller.run(). Each cosmos block triggers a certain number of runs (one catchup, then maybe some high-priority, then maybe one timer, then maybe some low-priority, then maybe one cleanup).

I think Runner is disqualified (too generic), but I'm ok with Cranker as a second choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really needed a name, and "Runner" is just too generic. I suppose "Turner" would therefore accurately describe a function that performs a single vat delivery, although if we ever need to name that we'll probably find something better.

packages/cosmic-swingset/src/launch-chain.js Outdated Show resolved Hide resolved
packages/cosmic-swingset/src/launch-chain.js Outdated Show resolved Hide resolved
Comment on lines +514 to +523
const allowCleanup = runPolicy.startCleanup();
if (!allowCleanup) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep wondering if we have the right layering here. I know @warner would like for controller.run() to automatically include cleanup for "simple hosts", but it feels like we're jumping through hoops just to enable non cosmic-swingset embedder.

An imperative controller.performCleanup() would be so much cleaner. We could simply call it after we've returned from processBlockActions: has the policy done anything? no, cool, then start cleanup with a specific cleanup policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true, I don't want to add API burden to hosts that don't need rate limiting, but I agree that there are some significant hoops here.

Adding controller.performCleanup(budget) wouldn't be too hard, it might be a wrapper around controller.run() that uses an internally-built runPolicy. If we added that, then the cosmic-swingset runPolicy could have a allowCleanup: () => false, which is a cheap way to say "I'm going to drive cleanup manually, thanks very much".

OTOH, that API assumes that we're ok with arbitrarily large computron usage during the cleanup runs. In practice, that's probably ok: the GC action deliveries are generally trivial (they just change some refcounts), and the BOYD is usually the last thing that runs so it's not like a computron limit would actually stop any deliveries from happening.

But if we wanted to be consistent with normal runs, we'd need a way to provide a limit, which would mean performCleanup() would need to get a host-provided RunPolicy, and merging some automatically-built one with the host-provided one sounds more messy than having the host manage everything in the first place.

Copy link

cloudflare-workers-and-pages bot commented Oct 2, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: bb28c19
Status: ✅  Deploy successful!
Preview URL: https://353b4b2f.agoric-sdk.pages.dev
Branch Preview URL: https://gibson-8928-parameterized-al.agoric-sdk.pages.dev

View logs

@gibson042
Copy link
Member Author

OK, I'm reasonably happy with this but still need to brainstorm on how to verify expected behavior through testing. sim-chain.js looks like the right foundation, since it's already wrapping the parts of launch-chain.js affected by this PR with a mostly-mocked harness.

remainingCleanups[phase] -= count;
if (remainingCleanups[phase] <= 0) cleanupDone = true;
}
return !cleanupDone && shouldRun();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but we need didCleanup() to return true when the cleanup budget is exhausted, otherwise the controller.run() will exit and we won't deliver the GC actions or run a BOYD until the the leftover phase next block (which might have more important work to do).

Basically, having allowCleanup return false is how you prevent cleanup from happening. The return value of didCleanup controls whether we do anything after it is called, and GC actions / BOYD don't qualify as cleanup (they happen all the time, even without vats being terminated).

So I think this should be replaced with just return true. We could have it return shouldRun(), which would return false when we've hit the computron limit, but that's kind of of pointless because cleanup actions don't spend any computrons or perform any deliveries, they just delete vatKeeper data and enqueue GC actions. The computron-spending activity (GC actions, BOYD) will get reported to the policy with crankComplete as usual, and that has the opportunity to stop the run because of a computron limit.

I'm sorry this stuff is so opaque. I'll make a note to update SwingSet/docs/run-policy.md.. the section on didCleanup is accurate, and the examples all have it return true, but the docs ought to make that recommendation explicit, and explain why it is usually important to have it return true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. Updated with an explanatory comment.

@warner warner self-requested a review October 4, 2024 20:38
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think that looks good, at least the policy parts. I'll rely upon @mhofman for the cosmos/golang side, but it at least seems consistent with the rest of the code.

I do worry about proposals that accidentally revert parameters to their default values, but I believe the cleanup-budget defaults established by this PR are safe. So I believe the worst case is such a proposal would just slow down termination even more than necessary, which is much better than reverting to "allow full-speed deletion".

if err != nil {
return params, err
}
newVcb, err := appendMissingDefaults(params.VatCleanupBudget, DefaultVatCleanupBudget)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so now my current beliefs are:

  • we get one transaction type to influence all swingset-related parameters (four sets: beansPerUnit, powerFlagFees, queueMax, vatCleanupBudget)
  • if any field of any set is not included in the transaction/proposal, and is included in e.g. DefaultBeansPerUnit or DefaultPowerFlagFees or DefaultQueueMax or DefaultVatCleanupBudget, then that field is populated here
  • so if someone makes a governance parameter proposal to change something in params.QueueMax, and they forget to look at the chain and copy in the current fields from VatCleanupBudget, they'll be accidentally resetting the cleanup budget back to the default
  • also, because both .default and .kv are in DefaultVatCleanupBudget, there is no way to say "please stop overriding kv and just let it use .default like everything else does"
    • the closest you could achieve is by saying { default: 5, kv: 5 }

That works, it's weird but I suspect it's the best we can get from a one-method-per-module parameter-setting scheme.

}
// We return true to allow processing of any BOYD/GC prompted by cleanup,
// even if cleanup as such is now done.
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, looks good

@warner
Copy link
Member

warner commented Oct 12, 2024

BTW, I'm adding a new promises phase to cleanup, to fix #10261. It should use the same default of 5 cleanups per run, like imports and exports (and not like kv). I'll be making a branch on top of this one to add the new fields if this doesn't land by the time the swingset -side changes are done.

@warner
Copy link
Member

warner commented Oct 12, 2024

#10268 is the PR which introduces budget.promises and work.promises on the swingset side. Given your careful work to make the integration code tolerate new fields, we don't strictly need to update anything in cosmic-swingset, but it might be nice to be able to control this budget item like we do the others.

mergify bot added a commit that referenced this pull request Oct 17, 2024
…ke stores (#10290)

(split from #10165)

## Description
chain-main.js passes a [non-map `mailboxStorage`](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/chain-main.js#L351-L359) to `launch`, which [threads it into mailbox code](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/launch-chain.js#L126) that returns [methods expecting a map-like `entries` method and `size` property](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/SwingSet/src/devices/mailbox/mailbox.js#L121-L146). This PR adds type data to reject such issues and refactors mailbox code accordingly by converting those methods into independent functions that accept appropriate backing maps.

### Security Considerations
None known.

### Scaling Considerations
n/a

### Documentation Considerations
n/a

### Testing Considerations
Existing tests should continue to pass.

### Upgrade Considerations
This code does not affect any vat and is safe to include in the next release.
@gibson042 gibson042 force-pushed the gibson-8928-parameterized-allowcleanup branch from 9fe3b50 to bb28c19 Compare October 22, 2024 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

terminate vats slowly
3 participants