Redesign the runner #14096

rkapka · 2024-06-10T11:31:31Z

UPDATE

I made some changes since the original description. The main realization is that the main for loop should be as "pure" as possible. With every new case we increase the likelihood of duty handling not starting at the very beginning of the slot. Originally the main loop had only one purpose which was handling duties. I wanted to go back to this and so I moved all other channels to goroutines dedicated just for them. That way there is no contention in the select statement and the slot ticker handler can fire immediately. I also removed close(eventsChannel) from the event stream because reading from a closed channel always succeeds, resulting in a never-ending sequence of empty events when the event stream gets disconnected. https://rauljordan.com/no-sleep-until-we-build-the-perfect-library-in-go/ is a great resource showing what we should build eventually.

Main functionality

The purpose of this PR is to make the validator client's runner more robust. This can be split into 3 categories.

1. Updating duties at proper times

This PR introduces an updateDutiesNeeded variable that is initially set to false and is checked every slot. We update duties once before the main for loop of the runner and then conditionally at each slot start if needed, which is:

at the beginning of an epoch
after a switchover to a backup node
after account changes

Currently UpdateDuties exits early if the validator has any duties and we are not at the start of an epoch. This means that the runner itself doesn't have control when duties will be updated as it can't enforce the update. Because of this we remove the condition from UpdateDuties and leave the decision to the caller.

2. Pushing proposer settings at proper times

Similarly to updating duties, there are times when we want to push proposer settings:

at the beginning of an epoch
when the beacon node becomes healthy after a period of being unhealthy
after account changes

But while updating duties should be blocking because performing roles depend on it, pushing proposer settings can be done asynchronously. If the validator client serves many keys, pushing proposer settings can take a non-trivial amount of time and we want to keep attesting/proposing.

This PR utilizes a goroutine+channel combination to signal that proposer settings should be updated. This allows a simple one-liner if we need to add more triggers in the future, and is probably quite idiomatic Go.

3. Switching over to a backup node

This PR simplifies how we track the health status of the beacon node inside the runner, switching to a polling solution instead of using a channel. Polling the latest value is much simpler when we consider that the current implementation works only for one subscriber, whereas the polling mechanism works for any number of subscribers. Therefore adding more subscribers in the future is trivial.

We move the health ticker into the tracker and update the health status internally. We also add a new ticker to the runner that gets fired at the last second of the slot, upon which we poll for health. One additional change is introducing an initializationNeeded variable that signals we need to make sure the beacon node is synced before we try performing duties again. Without this, switching over to a beacon node that is not 100% ready (or the original node becoming healthy again) results in lots of errors in the validator client's output.

Other

There a few other things included in this PR, which are small improvements and bug fixes.

Always cancel the slot context to prevent context leaks.
Use the default HTTP client in JsonRestHandler and save the default timeout in the struct. This is because setting a timeout on an HTTP client ignores the timeout set on a request's context. In our case, pushing proposer settings has a timeout of 5 minutes but it would never be respected.
Use a lock when we update the pubkeyToValidatorIndex map. Otherwise there is a possibility of a panic due to concurrent map writes (I ran into it when testing pushing proposer settings).
Use a lock in event-stream related functions. Otherwise the event stream can be started multiple times. This happens in practice when starting the validator client on a not yet synced node. As the node syncs to the head, a lot of things happen and the stream is usually started 2 times.

Follow-up questions

Can we make pushing proposer settings fast enough so that we can just do it every slot? The 5 minute deadline is very long.
As mentioned earlier, a lot of things happen as the node syncs to the head and the for loop execution becomes undeterministic, resulting in things like proposer settings being pushed several times, or the event stream being started several times. Although this could be probably fixed by changing how we execute each of these, similar issues might arise in the future where we rely on a specific code path being executed only once, but the for loop behaves unexpectedly. Can we make it more predictable?

james-prysm · 2024-06-10T14:35:43Z

validator/client/runner.go

@@ -125,18 +126,39 @@ func run(ctx context.Context, v iface.Validator) {
 				continue
 			}
 			performRoles(slotCtx, allRoles, v, slot, &wg, span)
-		case isHealthyAgain := <-healthTracker.HealthUpdates():
-			if isHealthyAgain {
+		case <-v.LastSecondOfSlot():


i'm kinda worried about contents of this delaying the normal processes of the next slot above. if it happens at the last second of the slot wouldn't that mean we only have 1 second to do the below? otherwise this loop would be delayed.

with the newest changes there is next to no work here

james-prysm · 2024-06-10T14:39:03Z

api/client/beacon/health.go

-func (n *NodeHealthTracker) HealthUpdates() <-chan bool {
-	return n.healthChan
+	log.Infof("Starting health check routine. Health check will be performed every %d seconds", params.BeaconConfig().SecondsPerSlot)
+	ticker := time.NewTicker(time.Duration(params.BeaconConfig().SecondsPerSlot) * time.Second)


nice this seems like an improvement

rkapka · 2024-06-13T13:20:51Z

validator/client/validator.go

@@ -269,6 +274,7 @@ func (v *validator) WaitForChainStart(ctx context.Context) error {
 	}

 	v.genesisTime = chainStartRes.GenesisTime
+	log.WithField("genesisTime", time.Unix(int64(v.genesisTime), 0)).Info("Beacon chain started")


Moved from setTicker

rkapka · 2024-06-13T13:21:26Z

validator/client/validator.go

 	v.dutiesLock.Unlock()

+	v.logDuties(slot, resp.CurrentEpochDuties, resp.NextEpochDuties)


No need to hold the lock

rkapka · 2024-06-13T13:47:55Z

validator/client/runner.go

+				v.ChangeHost()
+				initializationNeeded = true
+				updateDutiesNeeded = true


One question that might come up is "why do we update duties but don't push proposer settings here?"

The answer is because they would have been pushed immediately, without waiting for the beacon node to become healthy, whereas update duties will not be called before the node comes back up.

james-prysm · 2024-06-13T20:51:44Z

validator/client/runner.go

 		}
+	}()
+	pushProposerSettingsChan <- headSlot


does this trigger make sense? you can get the canonical head slot from within that go channel too right? you can call v.CanonicalHeadSlot

james-prysm · 2024-06-13T20:52:06Z

validator/client/runner.go

+			case <-ctx.Done():
+				return
+			case slot := <-pushProposerSettingsChan:
+				go func() {


I'm afraid of spawning too many go channels

Do you mean too many goroutines? They are super lightweight, it's fine to have thousands of them

right too many goroutines 🤔

james-prysm · 2024-06-13T20:53:27Z

validator/client/runner.go

-		} else {
-			log.WithError(err).Fatal("Failed to update proposer settings") // allow fatal. skipcq
+
+	pushProposerSettingsChan := make(chan primitives.Slot, 1)


if you start the validator on the first slot of the epoch would it block or error on this ( should call this channel twice)

Block, but only for a split second because each receive from the channel spawns a goroutine

james-prysm · 2024-06-13T20:55:22Z

validator/client/runner.go

+		slotSpan             *trace.Span
+		slotCtx              context.Context
+		slotCancel           context.CancelFunc
+		initializationNeeded = false


maybe we should just have some kind of validator client status, feels hacky to use a variable like this

I agree it might be a good idea, but would probably result in a much different design, further complicating the already non-trivial PR.

prestonvanloon · 2024-06-14T14:33:22Z

api/client/beacon/health_test.go

@@ -1,112 +0,0 @@
-package beacon


Deleting all tests for the health tracker? Shouldn't there be some tests?

Two reasons:

there is barely any logic

the update takes place every 12 seconds, so the test would be super slow (we could make the interval a parameter, but it would be only for testing, therefore I am not a big fan of this)

prestonvanloon · 2024-06-14T14:34:34Z

api/client/beacon/mock/BUILD.bazel

@@ -2,11 +2,8 @@ load("@prysm//tools/go:def.bzl", "go_library")

 go_library(
    name = "go_default_library",


Suggested change

name = "go_default_library",

name = "go_default_library",

testonly = True,

Please add test only attribute so we are sure this doesn't link into a production binary

prestonvanloon · 2024-06-14T14:45:00Z

validator/client/grpc-api/grpc_validator_client.go

+	c.eventStreamLock.Lock()
+	defer c.eventStreamLock.Unlock()


Is this called from multiple go routines? If so, we may want to use RLock here

No, starting should be done from only one place.

prestonvanloon · 2024-06-14T14:48:35Z

validator/client/beacon-api/json_rest_handler.go

+	_, hasDeadline := ctx.Deadline()
+	if !hasDeadline {
+		var cancel context.CancelFunc
+		ctx, cancel = context.WithTimeout(ctx, c.timeout)
+		defer cancel()
+	}


Why only set a timeout, if one isn't set already? I would expect that c.timeout sets the timeout everytime.

context.Context will cancel if the parents context has a shorter deadline anyway, so it should be safe to always set the timeout

The idea here is to not set the timeout to c.timeout when the context has a deadline longer than c.timeout. Otherwise the request will finish too quickly. Currently we use a custom Client with a pre-set timeout. Even though we want to wait up to 5 minutes when pushing proposer settings, the client's timeout disallows this.

prestonvanloon · 2024-06-14T14:51:44Z

validator/client/beacon-api/json_rest_handler.go

+	_, hasDeadline := ctx.Deadline()
+	if !hasDeadline {
+		var cancel context.CancelFunc
+		ctx, cancel = context.WithTimeout(ctx, c.timeout)
+		defer cancel()
+	}


Same question here, I think you should always set the timeout.

james-prysm · 2024-06-25T21:32:22Z

validator/client/runner.go

-			log.WithError(err).Fatal("Failed to update proposer settings") // allow fatal. skipcq
+
+	pushProposerSettingsChan := make(chan struct{}, 1)
+	go func() {


I wonder if breaking out some of these go routines into other functions would make this cleaner to review... there's a lot going on with this file and a lot of variables declared

james-prysm · 2024-06-25T21:34:32Z

validator/client/runner.go

+			}
+		}
+	}()
+	pushProposerSettingsChan <- struct{}{}


is this supposed to push this empty struct? why?

You need to push something to the channel. The empty struct means that the value is not important and should be discarded.

implementation

ba1fecf

rkapka requested a review from a team as a code owner June 10, 2024 11:31

rkapka requested review from kasey, nisdas and james-prysm June 10, 2024 11:31

rkapka force-pushed the redesign-health-check branch 2 times, most recently from 9f030c5 to ca08db6 Compare June 10, 2024 11:43

tests and mocks

0da32f4

rkapka force-pushed the redesign-health-check branch from ca08db6 to 0da32f4 Compare June 10, 2024 11:44

rkapka added Ready For Review A pull request ready for code review Validator Client labels Jun 10, 2024

james-prysm reviewed Jun 10, 2024

View reviewed changes

push proposer settings in goroutine

98825a9

rkapka marked this pull request as draft June 10, 2024 16:23

rkapka changed the title ~~Redesign health check~~ Redesign the runner Jun 10, 2024

rkapka added 4 commits June 12, 2024 14:39

many small things

dae15e7

getting closer

ea72707

some cleanup

4b338d5

comment fixes

14fee14

rkapka commented Jun 13, 2024

View reviewed changes

update duties after changing host

56ed16d

rkapka commented Jun 13, 2024

View reviewed changes

prestonvanloon self-requested a review June 13, 2024 14:36

james-prysm reviewed Jun 13, 2024

View reviewed changes

prestonvanloon reviewed Jun 14, 2024

View reviewed changes

rkapka added 4 commits June 21, 2024 18:42

log head slot received

14d4b58

don't close event channel

eb18335

run each channel is a separate goroutine

a0ecc3e

review

83af1f6

james-prysm reviewed Jun 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign the runner #14096

Redesign the runner #14096

rkapka commented Jun 10, 2024 •

edited

Loading

james-prysm Jun 10, 2024

rkapka Jun 13, 2024

james-prysm Jun 10, 2024

rkapka Jun 13, 2024

rkapka Jun 13, 2024

rkapka Jun 13, 2024

james-prysm Jun 13, 2024

james-prysm Jun 13, 2024

rkapka Jun 21, 2024

james-prysm Jun 25, 2024

james-prysm Jun 13, 2024

rkapka Jun 21, 2024

james-prysm Jun 13, 2024

rkapka Jun 21, 2024

prestonvanloon Jun 14, 2024

rkapka Jun 21, 2024

prestonvanloon Jun 14, 2024

prestonvanloon Jun 14, 2024

rkapka Jun 21, 2024

prestonvanloon Jun 14, 2024

rkapka Jun 21, 2024 •

edited

Loading

prestonvanloon Jun 14, 2024

james-prysm Jun 25, 2024

james-prysm Jun 25, 2024

rkapka Jun 26, 2024

		v.dutiesLock.Unlock()

		v.logDuties(slot, resp.CurrentEpochDuties, resp.NextEpochDuties)

		@@ -2,11 +2,8 @@ load("@prysm//tools/go:def.bzl", "go_library")

		go_library(
		name = "go_default_library",

	name = "go_default_library",
	name = "go_default_library",
	testonly = True,

Redesign the runner #14096

Are you sure you want to change the base?

Redesign the runner #14096

Conversation

rkapka commented Jun 10, 2024 • edited Loading

Main functionality

1. Updating duties at proper times

2. Pushing proposer settings at proper times

3. Switching over to a backup node

Other

Follow-up questions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkapka Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkapka commented Jun 10, 2024 •

edited

Loading

rkapka Jun 21, 2024 •

edited

Loading