-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup-budget passing code needs to change.. the policy language is funky, I have to admit, and the fact that it's easy to accidentally enable unlimited cleanups is a problem. Let's brainstorm tomorrow to see if there's a way to write a test around this part, maybe a unit test which imports just the policy creator and then pokes it in a couple of ways, without a kernel or cosmos or anything else.
NewQueueSize(QueueInbound, DefaultInboundQueueMax), | ||
} | ||
|
||
// FIXME: Settle on default values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd like our initial values to be { default: 5, kv: 50 }
(swingset will use .default
for everything else), so deletion happens slowly as soon as upgrade18 is deployed. That will let us trigger deletion of some vat (probably v112-scaledPriceAuthority-stkATOM, being the smallest one) as part of upgrade18, instead of a three-vote sequence of:
- upgrade18
- governance proposal to change parameters to
{ default: 5, kv: 50 }
- governance proposal to trigger deletion of v112
The important values are that exports
and imports
use no more than 5 (so the BOYDs remain a reasonable size, and we don't clobber the chain with large cleanup work), and that kv
uses more like 50 (so the deletion takes weeks instead of months or years).
The total number of transcripts and snapshots aren't large enough for 5 vs 50 to make a big difference. Deleting both are pretty quick, so it would probably be safe to use 50, but I'd rather measure the load of using "5" before changing that rate.
I don't think I have a strong opinion on whether the default specifies values for all the fields, or just default
and kv
. My original intention was to allow the governance proposal to be minimal, and only supply the ones that needed to be overridden, but I'm not clear on how much the governance parameter messages are deltas vs absolutes. If you have to supply all the parameters (not just vat cleanup, but also beans-per-computrons and queue sizes) in each message, then it's not a big lift to supply all of default, kv, imports, exports, snapshots, transcripts
in each one.
The important thing is legibility: the voters inspecting the change-parameter proposal should be able to tell what exactly is changing. If a proposal that omits kv
and really means "kv reverts to default
", that might be surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's important that one of these defaults being set to 0 doesn't override the safely-rate-limited behavior we get from { default: 5 }
. I'm thinking that DefaultVatCleanupBudget
should have one or two entries in it, and not provide a value for e.g. exports
and imports
and snapshots
and transcripts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- governance proposal to change parameters to
{ default: 5, kv: 50 }
That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.
If a proposal that omits
kv
and really means "kv reverts todefault
", that might be surprising.
I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.
// https://github.com/Agoric/agoric-sdk/issues/8928#issuecomment-2053357870 | ||
const shouldRun = () => | ||
!cleanupStarted && (ignoreBlockLimit || totalBeans < blockComputeLimit); | ||
const allowCleanup = () => cleanupStarted && !cleanupDone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't get us the right budget.. allowCleanup() === true
means an unlimited budget, and the kernel only calls didCleanup
at the end of the crank. So a non-finite budget will let the kernel perform unlimited cleanup, and then it will report a huge didCleanup(cleanups)
, and our decrementer will stop it, but by then it will be too late.
We need allowCleanup
to return the remaining budget each time (if we're in the cleanup window). We also need to let the kernel perform GC actions and BOYD. The kernel will ask about one vat with allowCleanup()
, do an amount of cleanup work limited by the returned budget, notify the policy with a didCleanup()
, then it returns to the "decide what work to do next" function, which will query the policy again. We need the didCleanup()
to return true
(meaning "keep going, don't stop controller.run()
yet"), then that second call to allowCleanup()
must say "no more cleanups", to allow it to move on to the gc-action queue, or the dirty-vat-do-BOYD check. We still count computrons, so if e.g. the cleanup resulted in multiple vats becoming dirty and eligible for BOYD, we might perform one BOYD, decide the computrons it spent were too much, and then end the block before we get to the second BOYD. The next block's crankerPhase = 'leftover'
run will finish off those BOYDs.
This only really works because we only do cleanup on empty blocks, so we know the run-queue is empty before clean is allowed to proceed. The run-policy language doesn't have a way to say "do the gcActions and BOYDs triggered by cleanup, but don't allow any normal cranks to be delivered, because we're only doing cleanup-related stuff now".
(I wish there were a good way to unit test this)
So something more like:
let inCleanup = false;
const startCleanup = () => if (totalBeans > 0n) { return false; } else { inCleanup = true; return true; };
const allowCleanup = () => inCleanup && remainingCleanups; // returns false or budget
const didCleanup = details => { .. decrement remainingCleanups; return true; };
const shouldRun = () => ignoreBlockLimit || totalBeans < blockComputeLimit;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, yeah. Updated to something that actually has the right return type for allowCleanup
.
@@ -88,6 +90,24 @@ const parseUpgradePlanInfo = (upgradePlan, prefix = '') => { | |||
* `chainStorageEntries: [ ['c.o', '"top"'], ['c.o.i'], ['c.o.i.n', '42'], ['c.o.w', '"moo"'] ]`). | |||
*/ | |||
|
|||
/** | |||
* @typedef {'leftover' | 'forced' | 'high-priority' | 'intermission' | 'queued' | 'cleanup'} CrankerPhase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot, it makes the flow much easier to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.
Can the policy (possibly with the cranker) be factored out of launch-chain
and exercised in a test, maybe with a mock swingset controller?
Upgrade Considerations
I believe it's worth noting that the upgrade will set the default parameters for the new param.
DefaultVatCleanupKv = sdk.NewUint(0) // 50 | ||
DefaultVatCleanupSnapshots = sdk.NewUint(0) // 50 | ||
DefaultVatCleanupTranscripts = sdk.NewUint(0) // 50 | ||
DefaultVatCleanupBudget = []UintMapEntry{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question keeps coming up of where the defaults should be defined, and whether defaults are persistent, aka what should happen with values if the default changes and the values were not modified. I can't shake the feeling that the way we've implemented params does not allow the flexibility to update unchanged defaults.
This is particularly true when we have a param value whose name is default.
I believe here it might be better to just have a single entry in DefaultVatCleanupBudget
: the VatCleanupDefault
one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of the latest push, cosmic-swingset has a default budget of { default: Infinity }
to preserve current behavior (and requires an explicit default
value when any other field is present) and golang/cosmos/x/swingset/types/default-params.go defines DefaultVatCleanupBudget as { default: 5, kv: 50 }
per @warner's instruction.
NewQueueSize(QueueInbound, DefaultInboundQueueMax), | ||
} | ||
|
||
// FIXME: Settle on default values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- governance proposal to change parameters to
{ default: 5, kv: 50 }
That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.
If a proposal that omits
kv
and really means "kv reverts todefault
", that might be surprising.
I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -80,7 +80,7 @@ func TestAddStorageBeanCost(t *testing.T) { | |||
}, | |||
} { | |||
t.Run(tt.name, func(t *testing.T) { | |||
got, err := appendMissingDefaultBeansPerUnit(tt.in, []StringBeans{defaultStorageCost}) | |||
got, err := appendMissingDefaults(tt.in, []StringBeans{defaultStorageCost}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably to be consistent we should fill in all missing defaults here, not just the beans per unit ones.
*/ | ||
|
||
/** | ||
* @typedef {(phase: CrankerPhase) => Promise<boolean>} Cranker runs the kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about the Cranker
name. It controls a macro "swingset run", which itself is composed of many cranks. Does that make a vat delivery a "turner"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah, the existing term that this relates to is a "run" or a "swingset run", meaning one invocation of controller.run()
. Each cosmos block triggers a certain number of runs (one catchup, then maybe some high-priority, then maybe one timer, then maybe some low-priority, then maybe one cleanup).
I think Runner
is disqualified (too generic), but I'm ok with Cranker
as a second choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really needed a name, and "Runner" is just too generic. I suppose "Turner" would therefore accurately describe a function that performs a single vat delivery, although if we ever need to name that we'll probably find something better.
const allowCleanup = runPolicy.startCleanup(); | ||
if (!allowCleanup) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep wondering if we have the right layering here. I know @warner would like for controller.run()
to automatically include cleanup for "simple hosts", but it feels like we're jumping through hoops just to enable non cosmic-swingset embedder.
An imperative controller.performCleanup()
would be so much cleaner. We could simply call it after we've returned from processBlockActions
: has the policy done anything? no, cool, then start cleanup with a specific cleanup policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true, I don't want to add API burden to hosts that don't need rate limiting, but I agree that there are some significant hoops here.
Adding controller.performCleanup(budget)
wouldn't be too hard, it might be a wrapper around controller.run()
that uses an internally-built runPolicy
. If we added that, then the cosmic-swingset runPolicy could have a allowCleanup: () => false
, which is a cheap way to say "I'm going to drive cleanup manually, thanks very much".
OTOH, that API assumes that we're ok with arbitrarily large computron usage during the cleanup runs. In practice, that's probably ok: the GC action deliveries are generally trivial (they just change some refcounts), and the BOYD is usually the last thing that runs so it's not like a computron limit would actually stop any deliveries from happening.
But if we wanted to be consistent with normal runs, we'd need a way to provide a limit, which would mean performCleanup()
would need to get a host-provided RunPolicy, and merging some automatically-built one with the host-provided one sounds more messy than having the host manage everything in the first place.
Deploying agoric-sdk with Cloudflare Pages
|
OK, I'm reasonably happy with this but still need to brainstorm on how to verify expected behavior through testing. sim-chain.js looks like the right foundation, since it's already wrapping the parts of launch-chain.js affected by this PR with a mostly-mocked harness. |
remainingCleanups[phase] -= count; | ||
if (remainingCleanups[phase] <= 0) cleanupDone = true; | ||
} | ||
return !cleanupDone && shouldRun(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, but we need didCleanup()
to return true
when the cleanup budget is exhausted, otherwise the controller.run()
will exit and we won't deliver the GC actions or run a BOYD until the the leftover
phase next block (which might have more important work to do).
Basically, having allowCleanup
return false is how you prevent cleanup from happening. The return value of didCleanup
controls whether we do anything after it is called, and GC actions / BOYD don't qualify as cleanup (they happen all the time, even without vats being terminated).
So I think this should be replaced with just return true
. We could have it return shouldRun()
, which would return false
when we've hit the computron limit, but that's kind of of pointless because cleanup actions don't spend any computrons or perform any deliveries, they just delete vatKeeper data and enqueue GC actions. The computron-spending activity (GC actions, BOYD) will get reported to the policy with crankComplete
as usual, and that has the opportunity to stop the run because of a computron limit.
I'm sorry this stuff is so opaque. I'll make a note to update SwingSet/docs/run-policy.md
.. the section on didCleanup
is accurate, and the examples all have it return true, but the docs ought to make that recommendation explicit, and explain why it is usually important to have it return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. Updated with an explanatory comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think that looks good, at least the policy parts. I'll rely upon @mhofman for the cosmos/golang side, but it at least seems consistent with the rest of the code.
I do worry about proposals that accidentally revert parameters to their default values, but I believe the cleanup-budget defaults established by this PR are safe. So I believe the worst case is such a proposal would just slow down termination even more than necessary, which is much better than reverting to "allow full-speed deletion".
if err != nil { | ||
return params, err | ||
} | ||
newVcb, err := appendMissingDefaults(params.VatCleanupBudget, DefaultVatCleanupBudget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so now my current beliefs are:
- we get one transaction type to influence all swingset-related parameters (four sets: beansPerUnit, powerFlagFees, queueMax, vatCleanupBudget)
- if any field of any set is not included in the transaction/proposal, and is included in e.g.
DefaultBeansPerUnit
orDefaultPowerFlagFees
orDefaultQueueMax
orDefaultVatCleanupBudget
, then that field is populated here - so if someone makes a governance parameter proposal to change something in
params.QueueMax
, and they forget to look at the chain and copy in the current fields fromVatCleanupBudget
, they'll be accidentally resetting the cleanup budget back to the default - also, because both
.default
and.kv
are inDefaultVatCleanupBudget
, there is no way to say "please stop overridingkv
and just let it use.default
like everything else does"- the closest you could achieve is by saying
{ default: 5, kv: 5 }
- the closest you could achieve is by saying
That works, it's weird but I suspect it's the best we can get from a one-method-per-module parameter-setting scheme.
} | ||
// We return true to allow processing of any BOYD/GC prompted by cleanup, | ||
// even if cleanup as such is now done. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, looks good
BTW, I'm adding a new |
#10268 is the PR which introduces |
…ke stores (#10290) (split from #10165) ## Description chain-main.js passes a [non-map `mailboxStorage`](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/chain-main.js#L351-L359) to `launch`, which [threads it into mailbox code](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/launch-chain.js#L126) that returns [methods expecting a map-like `entries` method and `size` property](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/SwingSet/src/devices/mailbox/mailbox.js#L121-L146). This PR adds type data to reject such issues and refactors mailbox code accordingly by converting those methods into independent functions that accept appropriate backing maps. ### Security Considerations None known. ### Scaling Considerations n/a ### Documentation Considerations n/a ### Testing Considerations Existing tests should continue to pass. ### Upgrade Considerations This code does not affect any vat and is safe to include in the next release.
9fe3b50
to
bb28c19
Compare
Fixes #8928
Description
Adds new cosmos parameters for controlling terminated vat cleanup and uses them in cosmic-swingset to allow budget-limited cleanup in otherwise empty blocks. Also includes a fair amount of refactoring in support of the same, so best reviewed as individual commits.
Security Considerations
Too-high budget values are a denial of service risk. All values are currently defaulted to zero for
nominimal cleanup, but we may instead want to use small values.Scaling Considerations
See above.
Documentation Considerations
Hmm, do we have documentation of cosmos-sdk parameters anywhere?
Testing Considerations
Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.
Upgrade Considerations
Upgrade will set the new parameters to their default values as defined in default-params.go. If we set non-zero defaults, then work could commence as soon as this code is live, which means it requires a chain-halting upgrade.