Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Late-start with no state #477

Closed
2 tasks done
Stebalien opened this issue Jul 16, 2024 · 13 comments · Fixed by #488
Closed
2 tasks done

Late-start with no state #477

Stebalien opened this issue Jul 16, 2024 · 13 comments · Fixed by #488
Assignees

Comments

@Stebalien
Copy link
Member

Stebalien commented Jul 16, 2024

@ZenGround0 ran into an issue where F3 failed to load state on start. @rvagg pointed out that this was likely because F3 reached back too far (after the splitstore deleted state).

This likely happened because calibnet failed to start F3 due to insufficient power. Restarting the node attempted to re-bootstrap F3 and failed because the power table had been garbage collected.

We need to solve this in a couple of ways:

  1. For all nodes participating in F3, we cannot GC state until we have a finality certificate for that Epoch. Alternatively, we could pre-compute power tables and save them to the certificate store once we reach EC finality. This means we'll need to store power tables and power table deltas before computing finality certificates, so it's a bit trickier.
  2. We need to handle the late-bootstrap case. I.e., a node joins later with a snapshot. Eventually, we'll be able to solve this by shipping the latest finality certificate with that snapshot, but we can't do that till F3 actually starts working... it would be kind of nice to have a way to free-start from some epoch. I.e., take the current epoch, go back 900 epochs, ask someone for a finality certificate including that specific tipset, validate that the finality certificate has the correct power table, then start there.
  3. Finally, we need to generally be robust against missing state.

Tasks

Preview Give feedback
  1. Stebalien
  2. Kubuxu
@Stebalien
Copy link
Member Author

For 1.b. (store power tables), we could trigger this from splitstore GC. I.e., before the splitstore garbage collects, it would make sure to "inject" power table deltas. But again, we'd need to change the finality certificate store to store power deltas separate from finality certificate information for this to work.

@Stebalien
Copy link
Member Author

For 2.b, we'd either need an index of epoch->finality cert, or we'd need to binary search for the right finality certificate. Honestly, I'm tempted to just binary search (maybe with an in-memory cache so we can easily jump to the correct range of finality certificates to search.

@Stebalien
Copy link
Member Author

Alternative (and probably better) way to solve 2:

  1. Before the bootstrap epoch, do nothing.
  2. If F3 doesn't produce a finality certificate before the initial epoch (bootstrap-900) falls out of the snapshot, start including the power table from that epoch in the snapshot.
  3. Finally, once F3 starts working, ship the last finality certificate with the snapshot (or all of them if they're small enough...).

This doesn't require any new protocols.

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 16, 2024

Now that this cropped up, I remember why I wanted to ship the initial power table in the manifest.
One alternative is to use a frozen initial power table and ship it in Manifest.

On the other hand, I don't think we need to store all of them; we need the initial one. After that, we should be able to use finality certificates (with their deltas) to move forward until we are within the range of the available state.
I want Lotus to be able to validate finality certificates from instance number zero, allowing for a trustless bootstrap from a snapshot.

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 16, 2024

If F3 doesn't produce a finality certificate before the initial epoch (bootstrap-900) falls out of the snapshot, start including the power table from that epoch in the snapshot.

This doesn't solve the splitstore + bootstrap issue, we would have to tag the power table as "not to be removed", but also it gets more complex in the late start scenario.

@Stebalien
Copy link
Member Author

This doesn't solve the splitstore + bootstrap issue, we would have to tag the power table as "not to be removed", but also it gets more complex in the late start scenario.

Ah, yeah, we'd need to include all the "deltas" as well.

On the other hand, I don't think we need to store all of them; we need the initial one. After that, we should be able to use finality certificates (with their deltas) to move forward until we are within the range of the available state.

The issue is phase 2. We won't have deltas/certificates in that case because F3 hasn't really "started".

Now that this cropped up, I remember why I wanted to ship the initial power table in the manifest.
One alternative is to use a frozen initial power table and ship it in Manifest.

Same as above. We need the deltas as well.

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 16, 2024

We can fetch the deltas from cert-exchange, assuming we made progress initially while the state was available.
I'm not sure what to do if we don't make progress though.

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 16, 2024

For bootstrap (if no finality certificates were formed), we could shift the bootstrap epoch to every N epoch.
For example: After k*450 epochs are finalized past the bootstrap, use that point as bootstrap.

@Stebalien
Copy link
Member Author

For bootstrap (if no finality certificates were formed), we could shift the bootstrap epoch to every N epoch.
For example: After k*450 epochs are finalized past the bootstrap, use that point as bootstrap.

We have no good way to know that for sure.

I'm not sure what to do if we don't make progress though.

We can:

  1. Save the power table from the bootstrap base epoch (bootstrap - 900).
  2. As the chain advances, continuously save power table deltas.

@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 16, 2024

I think this is the best idea. Do we have the full power table request thing in the finality cert exchange?

@Stebalien
Copy link
Member Author

Do we have the full power table request thing in the finality cert exchange?

Yes, but the certificate exchange doesn't have access to power tables after we finalize them. Furthermore, we store deltas along with certificates which we won't have yet.

Basically, we need to implement 1.b from the first comment: store power tables/deltas independently of finality certificates.

@jennijuju
Copy link
Member

Why is this an epic? Is there more follow-up issues need to be resolved on this topic?

@jennijuju jennijuju moved this to In progress in F3 Jul 17, 2024
@jennijuju jennijuju moved this from In progress to In review in F3 Jul 17, 2024
@Kubuxu
Copy link
Contributor

Kubuxu commented Jul 17, 2024

It explains the issue but not the solution. There is a separate issue for the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants