feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

gibson042 · 2024-09-27T06:24:47Z

Description

Adds new cosmos parameters for controlling terminated vat cleanup and uses them in cosmic-swingset to allow budget-limited cleanup in otherwise empty blocks. Also includes a fair amount of refactoring in support of the same, so best reviewed as individual commits.

Security Considerations

Too-high budget values are a denial of service risk. All values are currently defaulted to zero for no minimal cleanup, but we may instead want to use small values.

Scaling Considerations

See above.

Documentation Considerations

Hmm, do we have documentation of cosmos-sdk parameters anywhere?

Testing Considerations

Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.

Upgrade Considerations

Upgrade will set the new parameters to their default values as defined in default-params.go. If we set non-zero defaults, then work could commence as soon as this code is live, which means it requires a chain-halting upgrade.

warner

The cleanup-budget passing code needs to change.. the policy language is funky, I have to admit, and the fact that it's easy to accidentally enable unlimited cleanups is a problem. Let's brainstorm tomorrow to see if there's a way to write a test around this part, maybe a unit test which imports just the policy creator and then pokes it in a couple of ways, without a kernel or cosmos or anything else.

packages/cosmic-proto/src/codegen/agoric/swingset/swingset.ts

warner · 2024-10-01T00:46:26Z

golang/cosmos/x/swingset/types/default-params.go

 		NewQueueSize(QueueInbound, DefaultInboundQueueMax),
 	}
+
+	// FIXME: Settle on default values


Yeah, I'd like our initial values to be { default: 5, kv: 50 } (swingset will use .default for everything else), so deletion happens slowly as soon as upgrade18 is deployed. That will let us trigger deletion of some vat (probably v112-scaledPriceAuthority-stkATOM, being the smallest one) as part of upgrade18, instead of a three-vote sequence of:

upgrade18

governance proposal to change parameters to { default: 5, kv: 50 }

governance proposal to trigger deletion of v112

The important values are that exports and imports use no more than 5 (so the BOYDs remain a reasonable size, and we don't clobber the chain with large cleanup work), and that kv uses more like 50 (so the deletion takes weeks instead of months or years).

The total number of transcripts and snapshots aren't large enough for 5 vs 50 to make a big difference. Deleting both are pretty quick, so it would probably be safe to use 50, but I'd rather measure the load of using "5" before changing that rate.

I don't think I have a strong opinion on whether the default specifies values for all the fields, or just default and kv. My original intention was to allow the governance proposal to be minimal, and only supply the ones that needed to be overridden, but I'm not clear on how much the governance parameter messages are deltas vs absolutes. If you have to supply all the parameters (not just vat cleanup, but also beans-per-computrons and queue sizes) in each message, then it's not a big lift to supply all of default, kv, imports, exports, snapshots, transcripts in each one.

The important thing is legibility: the voters inspecting the change-parameter proposal should be able to tell what exactly is changing. If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

Also, it's important that one of these defaults being set to 0 doesn't override the safely-rate-limited behavior we get from { default: 5 }. I'm thinking that DefaultVatCleanupBudget should have one or two entries in it, and not provide a value for e.g. exports and imports and snapshots and transcripts.

governance proposal to change parameters to { default: 5, kv: 50 }

That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.

If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.

golang/cosmos/x/swingset/types/params.go

packages/cosmic-swingset/src/launch-chain.js

warner · 2024-10-02T01:38:49Z

packages/cosmic-swingset/src/launch-chain.js

+  // https://github.com/Agoric/agoric-sdk/issues/8928#issuecomment-2053357870
+  const shouldRun = () =>
+    !cleanupStarted && (ignoreBlockLimit || totalBeans < blockComputeLimit);
+  const allowCleanup = () => cleanupStarted && !cleanupDone;


This won't get us the right budget.. allowCleanup() === true means an unlimited budget, and the kernel only calls didCleanup at the end of the crank. So a non-finite budget will let the kernel perform unlimited cleanup, and then it will report a huge didCleanup(cleanups), and our decrementer will stop it, but by then it will be too late.

We need allowCleanup to return the remaining budget each time (if we're in the cleanup window). We also need to let the kernel perform GC actions and BOYD. The kernel will ask about one vat with allowCleanup(), do an amount of cleanup work limited by the returned budget, notify the policy with a didCleanup(), then it returns to the "decide what work to do next" function, which will query the policy again. We need the didCleanup() to return true (meaning "keep going, don't stop controller.run() yet"), then that second call to allowCleanup() must say "no more cleanups", to allow it to move on to the gc-action queue, or the dirty-vat-do-BOYD check. We still count computrons, so if e.g. the cleanup resulted in multiple vats becoming dirty and eligible for BOYD, we might perform one BOYD, decide the computrons it spent were too much, and then end the block before we get to the second BOYD. The next block's crankerPhase = 'leftover' run will finish off those BOYDs.

This only really works because we only do cleanup on empty blocks, so we know the run-queue is empty before clean is allowed to proceed. The run-policy language doesn't have a way to say "do the gcActions and BOYDs triggered by cleanup, but don't allow any normal cranks to be delivered, because we're only doing cleanup-related stuff now".

(I wish there were a good way to unit test this)

So something more like:

let inCleanup = false; const startCleanup = () => if (totalBeans > 0n) { return false; } else { inCleanup = true; return true; }; const allowCleanup = () => inCleanup && remainingCleanups; // returns false or budget const didCleanup = details => { .. decrement remainingCleanups; return true; }; const shouldRun = () => ignoreBlockLimit || totalBeans < blockComputeLimit;

Oof, yeah. Updated to something that actually has the right return type for allowCleanup.

warner · 2024-10-02T01:41:04Z

packages/cosmic-swingset/src/launch-chain.js

@@ -88,6 +90,24 @@ const parseUpgradePlanInfo = (upgradePlan, prefix = '') => {
 *   `chainStorageEntries: [ ['c.o', '"top"'], ['c.o.i'], ['c.o.i.n', '42'], ['c.o.w', '"moo"'] ]`).
 */

+/**
+ * @typedef {'leftover' | 'forced' | 'high-priority' | 'intermission' | 'queued' | 'cleanup'} CrankerPhase


I like this a lot, it makes the flow much easier to think about.

packages/cosmic-swingset/test/scenario2.js

packages/cosmic-swingset/src/sim-params.js

mhofman

Help wanted; the fully-threaded pathway is end-to-end but we may be able to do something more tractable. The policy logic is somewhat intricate and should be covered by tests.

Can the policy (possibly with the cranker) be factored out of launch-chain and exercised in a test, maybe with a mock swingset controller?

Upgrade Considerations

I believe it's worth noting that the upgrade will set the default parameters for the new param.

mhofman · 2024-10-02T03:41:20Z

golang/cosmos/x/swingset/types/default-params.go

+	DefaultVatCleanupKv          = sdk.NewUint(0) // 50
+	DefaultVatCleanupSnapshots   = sdk.NewUint(0) // 50
+	DefaultVatCleanupTranscripts = sdk.NewUint(0) // 50
+	DefaultVatCleanupBudget      = []UintMapEntry{


The question keeps coming up of where the defaults should be defined, and whether defaults are persistent, aka what should happen with values if the default changes and the values were not modified. I can't shake the feeling that the way we've implemented params does not allow the flexibility to update unchanged defaults.

This is particularly true when we have a param value whose name is default.

I believe here it might be better to just have a single entry in DefaultVatCleanupBudget: the VatCleanupDefault one.

As of the latest push, cosmic-swingset has a default budget of { default: Infinity } to preserve current behavior (and requires an explicit default value when any other field is present) and golang/cosmos/x/swingset/types/default-params.go defines DefaultVatCleanupBudget as { default: 5, kv: 50 } per @warner's instruction.

mhofman · 2024-10-02T03:43:52Z

golang/cosmos/x/swingset/types/default-params.go

 		NewQueueSize(QueueInbound, DefaultInboundQueueMax),
 	}
+
+	// FIXME: Settle on default values


governance proposal to change parameters to { default: 5, kv: 50 }

That seems unnecessary, if we have non-0 defaults already set as part of the software upgrade.

If a proposal that omits kv and really means "kv reverts to default", that might be surprising.

I believe that is almost how the current "fill-in missing values" logic would work. It would revert the value to the "kv default" set below. An alternative is to merge with existing values, but I'm not sure we have a place to stand to do that. Also in that case we now have the problem of how to express "please delete/unset this previously explicitly set value, and return it to default", particularly if we decide to not persist default values.

mhofman · 2024-10-02T03:56:09Z

packages/cosmic-proto/proto/agoric/swingset/swingset.proto

Not specific to this PR, but I didn't realize #8931 resulted in a copy of the proto definition files to be checked into the repo. @turadg any reason why these are not excluded from git?

packages/cosmic-proto/src/codegen/agoric/swingset/swingset.ts

mhofman · 2024-10-02T03:58:30Z

golang/cosmos/x/swingset/types/params_test.go

@@ -80,7 +80,7 @@ func TestAddStorageBeanCost(t *testing.T) {
 		},
 	} {
 		t.Run(tt.name, func(t *testing.T) {
-			got, err := appendMissingDefaultBeansPerUnit(tt.in, []StringBeans{defaultStorageCost})
+			got, err := appendMissingDefaults(tt.in, []StringBeans{defaultStorageCost})


Arguably to be consistent we should fill in all missing defaults here, not just the beans per unit ones.

packages/cosmic-swingset/test/scenario2.js

mhofman · 2024-10-02T04:13:14Z

packages/cosmic-swingset/src/launch-chain.js

+ */
+
+/**
+ * @typedef {(phase: CrankerPhase) => Promise<boolean>} Cranker runs the kernel


I'm not sure how I feel about the Cranker name. It controls a macro "swingset run", which itself is composed of many cranks. Does that make a vat delivery a "turner"?

Hm, yeah, the existing term that this relates to is a "run" or a "swingset run", meaning one invocation of controller.run(). Each cosmos block triggers a certain number of runs (one catchup, then maybe some high-priority, then maybe one timer, then maybe some low-priority, then maybe one cleanup).

I think Runner is disqualified (too generic), but I'm ok with Cranker as a second choice.

This really needed a name, and "Runner" is just too generic. I suppose "Turner" would therefore accurately describe a function that performs a single vat delivery, although if we ever need to name that we'll probably find something better.

packages/cosmic-swingset/src/launch-chain.js

mhofman · 2024-10-02T04:31:21Z

packages/cosmic-swingset/src/launch-chain.js

+        const allowCleanup = runPolicy.startCleanup();
+        if (!allowCleanup) return false;


I keep wondering if we have the right layering here. I know @warner would like for controller.run() to automatically include cleanup for "simple hosts", but it feels like we're jumping through hoops just to enable non cosmic-swingset embedder.

An imperative controller.performCleanup() would be so much cleaner. We could simply call it after we've returned from processBlockActions: has the policy done anything? no, cool, then start cleanup with a specific cleanup policy.

It's true, I don't want to add API burden to hosts that don't need rate limiting, but I agree that there are some significant hoops here.

Adding controller.performCleanup(budget) wouldn't be too hard, it might be a wrapper around controller.run() that uses an internally-built runPolicy. If we added that, then the cosmic-swingset runPolicy could have a allowCleanup: () => false, which is a cheap way to say "I'm going to drive cleanup manually, thanks very much".

OTOH, that API assumes that we're ok with arbitrarily large computron usage during the cleanup runs. In practice, that's probably ok: the GC action deliveries are generally trivial (they just change some refcounts), and the BOYD is usually the last thing that runs so it's not like a computron limit would actually stop any deliveries from happening.

But if we wanted to be consistent with normal runs, we'd need a way to provide a limit, which would mean performCleanup() would need to get a host-provided RunPolicy, and merging some automatically-built one with the host-provided one sounds more messy than having the host manage everything in the first place.

cloudflare-workers-and-pages · 2024-10-02T19:12:25Z

Deploying agoric-sdk with Cloudflare Pages

Latest commit:	`bb28c19`
Status:	✅ Deploy successful!
Preview URL:	https://353b4b2f.agoric-sdk.pages.dev
Branch Preview URL:	https://gibson-8928-parameterized-al.agoric-sdk.pages.dev

View logs

gibson042 · 2024-10-02T20:27:31Z

OK, I'm reasonably happy with this but still need to brainstorm on how to verify expected behavior through testing. sim-chain.js looks like the right foundation, since it's already wrapping the parts of launch-chain.js affected by this PR with a mostly-mocked harness.

warner · 2024-10-03T19:24:57Z

packages/cosmic-swingset/src/launch-chain.js

+      remainingCleanups[phase] -= count;
+      if (remainingCleanups[phase] <= 0) cleanupDone = true;
+    }
+    return !cleanupDone && shouldRun();


Better, but we need didCleanup() to return true when the cleanup budget is exhausted, otherwise the controller.run() will exit and we won't deliver the GC actions or run a BOYD until the the leftover phase next block (which might have more important work to do).

Basically, having allowCleanup return false is how you prevent cleanup from happening. The return value of didCleanup controls whether we do anything after it is called, and GC actions / BOYD don't qualify as cleanup (they happen all the time, even without vats being terminated).

So I think this should be replaced with just return true. We could have it return shouldRun(), which would return false when we've hit the computron limit, but that's kind of of pointless because cleanup actions don't spend any computrons or perform any deliveries, they just delete vatKeeper data and enqueue GC actions. The computron-spending activity (GC actions, BOYD) will get reported to the policy with crankComplete as usual, and that has the opportunity to stop the run because of a computron limit.

I'm sorry this stuff is so opaque. I'll make a note to update SwingSet/docs/run-policy.md.. the section on didCleanup is accurate, and the examples all have it return true, but the docs ought to make that recommendation explicit, and explain why it is usually important to have it return true.

Ah, that makes sense. Updated with an explanatory comment.

warner

Ok I think that looks good, at least the policy parts. I'll rely upon @mhofman for the cosmos/golang side, but it at least seems consistent with the rest of the code.

I do worry about proposals that accidentally revert parameters to their default values, but I believe the cleanup-budget defaults established by this PR are safe. So I believe the worst case is such a proposal would just slow down termination even more than necessary, which is much better than reverting to "allow full-speed deletion".

warner · 2024-10-04T20:45:19Z

golang/cosmos/x/swingset/types/params.go

+	if err != nil {
+		return params, err
+	}
+	newVcb, err := appendMissingDefaults(params.VatCleanupBudget, DefaultVatCleanupBudget)


ok so now my current beliefs are:

we get one transaction type to influence all swingset-related parameters (four sets: beansPerUnit, powerFlagFees, queueMax, vatCleanupBudget)

if any field of any set is not included in the transaction/proposal, and is included in e.g. DefaultBeansPerUnit or DefaultPowerFlagFees or DefaultQueueMax or DefaultVatCleanupBudget, then that field is populated here

so if someone makes a governance parameter proposal to change something in params.QueueMax, and they forget to look at the chain and copy in the current fields from VatCleanupBudget, they'll be accidentally resetting the cleanup budget back to the default

also, because both .default and .kv are in DefaultVatCleanupBudget, there is no way to say "please stop overriding kv and just let it use .default like everything else does"

the closest you could achieve is by saying { default: 5, kv: 5 }

That works, it's weird but I suspect it's the best we can get from a one-method-per-module parameter-setting scheme.

warner · 2024-10-04T20:46:03Z

packages/cosmic-swingset/src/launch-chain.js

+    }
+    // We return true to allow processing of any BOYD/GC prompted by cleanup,
+    // even if cleanup as such is now done.
+    return true;


great, looks good

warner · 2024-10-12T00:10:21Z

BTW, I'm adding a new promises phase to cleanup, to fix #10261. It should use the same default of 5 cleanups per run, like imports and exports (and not like kv). I'll be making a branch on top of this one to add the new fields if this doesn't land by the time the swingset -side changes are done.

warner · 2024-10-12T01:50:31Z

#10268 is the PR which introduces budget.promises and work.promises on the swingset side. Given your careful work to make the integration code tolerate new fields, we don't strictly need to update anything in cosmic-swingset, but it might be nice to be able to control this budget item like we do the others.

…ke stores (#10290) (split from #10165) ## Description chain-main.js passes a [non-map `mailboxStorage`](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/chain-main.js#L351-L359) to `launch`, which [threads it into mailbox code](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/cosmic-swingset/src/launch-chain.js#L126) that returns [methods expecting a map-like `entries` method and `size` property](https://github.com/Agoric/agoric-sdk/blob/94aaebcd255840daf13514ab1050282b985684d2/packages/SwingSet/src/devices/mailbox/mailbox.js#L121-L146). This PR adds type data to reject such issues and refactors mailbox code accordingly by converting those methods into independent functions that accept appropriate backing maps. ### Security Considerations None known. ### Scaling Considerations n/a ### Documentation Considerations n/a ### Testing Considerations Existing tests should continue to pass. ### Upgrade Considerations This code does not affect any vat and is safe to include in the next release.

Ref #8928

…`make` test fails

…eanup budget data Ref #8928

…nation

…eanup ...in otherwise empty blocks Fixes #8928

…kv: 50 }

…default`

gibson042 requested review from warner and mhofman September 27, 2024 06:24

gibson042 requested a review from a team as a code owner September 27, 2024 06:24

warner mentioned this pull request Oct 1, 2024

use mainfork to measure one-fell-swoop remediation cost #8644

Closed

warner requested changes Oct 2, 2024

View reviewed changes

mhofman reviewed Oct 2, 2024

View reviewed changes

gibson042 requested review from mhofman and warner October 2, 2024 20:13

warner requested changes Oct 3, 2024

View reviewed changes

warner self-requested a review October 4, 2024 20:38

warner approved these changes Oct 4, 2024

View reviewed changes

gibson042 mentioned this pull request Oct 16, 2024

refactor: Add mailbox types and differentiate use of maps from map-like stores #10290

Merged

gibson042 added 13 commits October 21, 2024 21:24

feat(x/swingset): Add parameters for controlling vat cleanup budget

35575e8

Ref #8928

build(golang/cosmos): Generate types with make proto-gen

1fa1830

build(cosmic-proto): Generate types with yarn build && yarn codegen

e682a33

refactor(x/swingset): DRY out UpdateParams

e07a20d

chore(cosmic-swingset): Add new parameters to sim-params.js

d3f7432

test(cosmic-swingset): Include the last three lines of stderr when a …

3281ff2

…`make` test fails

feat(cosmic-swingset): Update parseParams to read and validate vat cl…

c845907

…eanup budget data Ref #8928

refactor(cosmic-swingset): DRY out parseParams

2376a1c

refactor(cosmic-swingset): Prepare launch-chain.js for slow vat termi…

04271be

…nation

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cl…

fb8a66a

…eanup ...in otherwise empty blocks Fixes #8928

refactor(cosmic-swingset): Improve CrankerPhase values

91daea3

feat(x/swingset): Define default vat cleanup budget as { default: 5, …

0976f30

…kv: 50 }

feat(x/swingset): Require a non-empty vat cleanup budget to include `…

bb28c19

…default`

gibson042 force-pushed the gibson-8928-parameterized-allowcleanup branch from 9fe3b50 to bb28c19 Compare October 22, 2024 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

gibson042 commented Sep 27, 2024 •

edited

Loading

warner left a comment

warner Oct 1, 2024

warner Oct 2, 2024

mhofman Oct 2, 2024

warner Oct 2, 2024

gibson042 Oct 2, 2024

warner Oct 2, 2024

mhofman left a comment

mhofman Oct 2, 2024

gibson042 Oct 2, 2024

mhofman Oct 2, 2024

mhofman Oct 2, 2024

mhofman Oct 2, 2024

mhofman Oct 2, 2024

warner Oct 2, 2024

gibson042 Oct 2, 2024

mhofman Oct 2, 2024

warner Oct 2, 2024

cloudflare-workers-and-pages bot commented Oct 2, 2024 •

edited

Loading

gibson042 commented Oct 2, 2024

warner Oct 3, 2024

gibson042 Oct 3, 2024

warner left a comment

warner Oct 4, 2024

warner Oct 4, 2024

warner commented Oct 12, 2024

warner commented Oct 12, 2024 •

edited

Loading

		const allowCleanup = runPolicy.startCleanup();
		if (!allowCleanup) return false;

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

Are you sure you want to change the base?

feat(cosmic-swingset): Use vat cleanup budget values to allow slow cleanup #10165

Conversation

gibson042 commented Sep 27, 2024 • edited Loading

Description

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Upgrade Considerations

warner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhofman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Oct 2, 2024 • edited Loading

Deploying agoric-sdk with Cloudflare Pages

gibson042 commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warner commented Oct 12, 2024

warner commented Oct 12, 2024 • edited Loading

gibson042 commented Sep 27, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 2, 2024 •

edited

Loading

warner commented Oct 12, 2024 •

edited

Loading