Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign the runner #14096

Draft
wants to merge 12 commits into
base: develop
Choose a base branch
from
Draft

Redesign the runner #14096

wants to merge 12 commits into from

Conversation

rkapka
Copy link
Contributor

@rkapka rkapka commented Jun 10, 2024

UPDATE

I made some changes since the original description. The main realization is that the main for loop should be as "pure" as possible. With every new case we increase the likelihood of duty handling not starting at the very beginning of the slot. Originally the main loop had only one purpose which was handling duties. I wanted to go back to this and so I moved all other channels to goroutines dedicated just for them. That way there is no contention in the select statement and the slot ticker handler can fire immediately. I also removed close(eventsChannel) from the event stream because reading from a closed channel always succeeds, resulting in a never-ending sequence of empty events when the event stream gets disconnected. https://rauljordan.com/no-sleep-until-we-build-the-perfect-library-in-go/ is a great resource showing what we should build eventually.


Main functionality

The purpose of this PR is to make the validator client's runner more robust. This can be split into 3 categories.

1. Updating duties at proper times

This PR introduces an updateDutiesNeeded variable that is initially set to false and is checked every slot. We update duties once before the main for loop of the runner and then conditionally at each slot start if needed, which is:

  • at the beginning of an epoch
  • after a switchover to a backup node
  • after account changes

Currently UpdateDuties exits early if the validator has any duties and we are not at the start of an epoch. This means that the runner itself doesn't have control when duties will be updated as it can't enforce the update. Because of this we remove the condition from UpdateDuties and leave the decision to the caller.

2. Pushing proposer settings at proper times

Similarly to updating duties, there are times when we want to push proposer settings:

  • at the beginning of an epoch
  • when the beacon node becomes healthy after a period of being unhealthy
  • after account changes

But while updating duties should be blocking because performing roles depend on it, pushing proposer settings can be done asynchronously. If the validator client serves many keys, pushing proposer settings can take a non-trivial amount of time and we want to keep attesting/proposing.

This PR utilizes a goroutine+channel combination to signal that proposer settings should be updated. This allows a simple one-liner if we need to add more triggers in the future, and is probably quite idiomatic Go.

3. Switching over to a backup node

This PR simplifies how we track the health status of the beacon node inside the runner, switching to a polling solution instead of using a channel. Polling the latest value is much simpler when we consider that the current implementation works only for one subscriber, whereas the polling mechanism works for any number of subscribers. Therefore adding more subscribers in the future is trivial.

We move the health ticker into the tracker and update the health status internally. We also add a new ticker to the runner that gets fired at the last second of the slot, upon which we poll for health. One additional change is introducing an initializationNeeded variable that signals we need to make sure the beacon node is synced before we try performing duties again. Without this, switching over to a beacon node that is not 100% ready (or the original node becoming healthy again) results in lots of errors in the validator client's output.

Other

There a few other things included in this PR, which are small improvements and bug fixes.

  • Always cancel the slot context to prevent context leaks.
  • Use the default HTTP client in JsonRestHandler and save the default timeout in the struct. This is because setting a timeout on an HTTP client ignores the timeout set on a request's context. In our case, pushing proposer settings has a timeout of 5 minutes but it would never be respected.
  • Use a lock when we update the pubkeyToValidatorIndex map. Otherwise there is a possibility of a panic due to concurrent map writes (I ran into it when testing pushing proposer settings).
  • Use a lock in event-stream related functions. Otherwise the event stream can be started multiple times. This happens in practice when starting the validator client on a not yet synced node. As the node syncs to the head, a lot of things happen and the stream is usually started 2 times.

Follow-up questions

  • Can we make pushing proposer settings fast enough so that we can just do it every slot? The 5 minute deadline is very long.
  • As mentioned earlier, a lot of things happen as the node syncs to the head and the for loop execution becomes undeterministic, resulting in things like proposer settings being pushed several times, or the event stream being started several times. Although this could be probably fixed by changing how we execute each of these, similar issues might arise in the future where we rely on a specific code path being executed only once, but the for loop behaves unexpectedly. Can we make it more predictable?

@rkapka rkapka requested a review from a team as a code owner June 10, 2024 11:31
@rkapka rkapka force-pushed the redesign-health-check branch 2 times, most recently from 9f030c5 to ca08db6 Compare June 10, 2024 11:43
@rkapka rkapka added Ready For Review A pull request ready for code review Validator Client labels Jun 10, 2024
@@ -125,18 +126,39 @@ func run(ctx context.Context, v iface.Validator) {
continue
}
performRoles(slotCtx, allRoles, v, slot, &wg, span)
case isHealthyAgain := <-healthTracker.HealthUpdates():
if isHealthyAgain {
case <-v.LastSecondOfSlot():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm kinda worried about contents of this delaying the normal processes of the next slot above. if it happens at the last second of the slot wouldn't that mean we only have 1 second to do the below? otherwise this loop would be delayed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the newest changes there is next to no work here

func (n *NodeHealthTracker) HealthUpdates() <-chan bool {
return n.healthChan
log.Infof("Starting health check routine. Health check will be performed every %d seconds", params.BeaconConfig().SecondsPerSlot)
ticker := time.NewTicker(time.Duration(params.BeaconConfig().SecondsPerSlot) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice this seems like an improvement

@rkapka rkapka marked this pull request as draft June 10, 2024 16:23
@rkapka rkapka changed the title Redesign health check Redesign the runner Jun 10, 2024
@@ -269,6 +274,7 @@ func (v *validator) WaitForChainStart(ctx context.Context) error {
}

v.genesisTime = chainStartRes.GenesisTime
log.WithField("genesisTime", time.Unix(int64(v.genesisTime), 0)).Info("Beacon chain started")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from setTicker

v.dutiesLock.Unlock()

v.logDuties(slot, resp.CurrentEpochDuties, resp.NextEpochDuties)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to hold the lock

Comment on lines 172 to 174
v.ChangeHost()
initializationNeeded = true
updateDutiesNeeded = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question that might come up is "why do we update duties but don't push proposer settings here?"

The answer is because they would have been pushed immediately, without waiting for the beacon node to become healthy, whereas update duties will not be called before the node comes back up.

@prestonvanloon prestonvanloon self-requested a review June 13, 2024 14:36
}
}()
pushProposerSettingsChan <- headSlot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this trigger make sense? you can get the canonical head slot from within that go channel too right? you can call v.CanonicalHeadSlot

case <-ctx.Done():
return
case slot := <-pushProposerSettingsChan:
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid of spawning too many go channels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean too many goroutines? They are super lightweight, it's fine to have thousands of them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right too many goroutines 🤔

} else {
log.WithError(err).Fatal("Failed to update proposer settings") // allow fatal. skipcq

pushProposerSettingsChan := make(chan primitives.Slot, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you start the validator on the first slot of the epoch would it block or error on this ( should call this channel twice)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block, but only for a split second because each receive from the channel spawns a goroutine

slotSpan *trace.Span
slotCtx context.Context
slotCancel context.CancelFunc
initializationNeeded = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just have some kind of validator client status, feels hacky to use a variable like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might be a good idea, but would probably result in a much different design, further complicating the already non-trivial PR.

@@ -1,112 +0,0 @@
package beacon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting all tests for the health tracker? Shouldn't there be some tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  • there is barely any logic
  • the update takes place every 12 seconds, so the test would be super slow (we could make the interval a parameter, but it would be only for testing, therefore I am not a big fan of this)

@@ -2,11 +2,8 @@ load("@prysm//tools/go:def.bzl", "go_library")

go_library(
name = "go_default_library",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "go_default_library",
name = "go_default_library",
testonly = True,

Please add test only attribute so we are sure this doesn't link into a production binary

Comment on lines +266 to +267
c.eventStreamLock.Lock()
defer c.eventStreamLock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called from multiple go routines? If so, we may want to use RLock here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, starting should be done from only one place.

Comment on lines +84 to +89
_, hasDeadline := ctx.Deadline()
if !hasDeadline {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, c.timeout)
defer cancel()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only set a timeout, if one isn't set already? I would expect that c.timeout sets the timeout everytime.

context.Context will cancel if the parents context has a shorter deadline anyway, so it should be safe to always set the timeout

Copy link
Contributor Author

@rkapka rkapka Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to not set the timeout to c.timeout when the context has a deadline longer than c.timeout. Otherwise the request will finish too quickly. Currently we use a custom Client with a pre-set timeout. Even though we want to wait up to 5 minutes when pushing proposer settings, the client's timeout disallows this.

Comment on lines +45 to +50
_, hasDeadline := ctx.Deadline()
if !hasDeadline {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, c.timeout)
defer cancel()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, I think you should always set the timeout.

log.WithError(err).Fatal("Failed to update proposer settings") // allow fatal. skipcq

pushProposerSettingsChan := make(chan struct{}, 1)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if breaking out some of these go routines into other functions would make this cleaner to review... there's a lot going on with this file and a lot of variables declared

}
}
}()
pushProposerSettingsChan <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to push this empty struct? why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to push something to the channel. The empty struct means that the value is not important and should be discarded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready For Review A pull request ready for code review Validator Client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants