Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validator historical state restoration #922

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

marcinczenko
Copy link
Contributor

@marcinczenko marcinczenko commented Sep 26, 2024

At the moment the validator has started, there may already be existing active storage requests containing slots with slot ids that should be monitored by the validator: these are slot ids that either belong to the group observed by the validator or basically any slot id if validator is to monitor the whole slot id space.

In this context we want the validator to find and monitor all those slots id associated with storage requests that started earlier and have not yet concluded.

For more details, please refer to the discussion in #458.

It is work in progress, but open to suggestions...

@marcinczenko marcinczenko added Marketplace See https://miro.com/app/board/uXjVNZ03E-c=/ for details WIP Work in Progress (but open to comments) labels Sep 26, 2024
@marcinczenko marcinczenko self-assigned this Sep 26, 2024
@marcinczenko marcinczenko marked this pull request as draft September 26, 2024 14:28
@marcinczenko marcinczenko removed the WIP Work in Progress (but open to comments) label Sep 26, 2024
@marcinczenko marcinczenko changed the base branch from master to validator-partitioning-slotid-space September 26, 2024 18:09
@marcinczenko marcinczenko force-pushed the validator-partitioning-slotid-space branch from 3b4a7fe to 89738ff Compare October 1, 2024 02:15
@marcinczenko marcinczenko force-pushed the validator-partitioning-slotid-space branch 3 times, most recently from fc34613 to 1459644 Compare October 2, 2024 20:20
Base automatically changed from validator-partitioning-slotid-space to master October 2, 2024 23:09
@marcinczenko marcinczenko force-pushed the validator-historical-state branch 3 times, most recently from 92b6e96 to 7b20c4b Compare October 9, 2024 01:58
@marcinczenko marcinczenko force-pushed the validator-historical-state branch 2 times, most recently from 77fcad2 to 4df6a21 Compare October 14, 2024 23:12
@marcinczenko marcinczenko marked this pull request as ready for review October 15, 2024 01:28
Comment on lines 202 to 207
let slotFreedSubscription =
await trackSlotsFreed(requestId, marketplace)

let expectedSlotsFreed = nodes - tolerance
check eventually((slotsFreed.len == expectedSlotsFreed),
timeout=(duration - expiry).int * 1000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markspanbroek suggest that based on the observation that expectedSlotsFreed should be equal to tolerance + 1, it might be sufficient to simply wait for the storage request to fail instead of tracking slots freed. Make sense to me, I will check it out and push changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I took a mental step back. The purpose of this test is to check that validators are doing their work. We can detect that by looking at failing request, but this is actually not the most direct way of sensing that. Tracking events is. We can sense that slots are being freed and that request fails and this would not happen without properly working validator. Sensing a failing request by polling the state of the purchase should bring us to the same results, yet, the sensing path is longer and more pone to an accidental errors which does not currently seem to be completely avoidable. Errors like:

Exception in blockexc read loop   topics="codex blockexcnetworkpeer" tid=13304 msg="Stream Closed!"

or:

Unhandled exception in async proc, aborting topics="codex" tid=13640 msg="C:\\msys64\\home\\code\\...\\Nim\\lib\\pure\\json.nim(824, 10) `node.kind == JArray` : items() can not iterate a JsonNode of kind JNull"

Especially that those problems mostly happen on slow and less reliable Windows runners.

Thus, while looking for failing purchase is elegant and strips some of the code from the test, it is not the shortest and the most reliable way of testing what is supposed to be tested here.

In this spirit, and to keep the integration tests as short as possible, I decided to track the marketplace events to confirm that validator correctly detected the missing proofs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first exception shouldn't really be affecting validation as it has to do with the block exchange. This could affect datasets from being downloaded, but that's doesn't seem to be an issue.

The second exception was related to when polled (http) subscriptions returned null instead of [] (possibly due to an expired filter on the client). This was fixed in codex-storage/nim-ethers#94.

"Sensing a failing request" can be done by:

  1. Checking the request state: await marketplace.getRequest(requestId.get) (although we should probably expose this on a REST API endpoint instead of querying the contract directly, for another PR)
  2. Checking the slotState for all slots in the request (longer, more verbose)

If the contract is showing that a request is failed, we can trust that it is failed for certain.

Comment on lines 119 to 129
let slotFreedSubscription =
await trackSlotsFreed(requestId, marketplace)

let expectedSlotsFreed = nodes - tolerance
check eventually((slotsFreed.len == expectedSlotsFreed),
timeout=(duration - expiry).int * 1000)

# Because of erasure coding, if e.g. 2 out of 3 nodes are freed, the last
# node will not be freed but marked as "Failed" because the whole request
# will fail. For this reason we need an extra check:
await checkSlotsFailed(slotsFilled, slotsFreed, marketplace)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcinczenko marcinczenko force-pushed the validator-historical-state branch 4 times, most recently from 7e06bd8 to 165d514 Compare October 18, 2024 03:14
@marcinczenko
Copy link
Contributor Author

The reason for occasional integration tests failure on Windows is described in codex-storage/nim-ethers#79.

Copy link
Contributor

@emizzle emizzle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, Marcin, thank you. It took me quite a few hours to review this PR. In the future, any kind of split up of the work (into separate PRs) would be much appreciated!

# CodexConfigs.init(nodes=1)
# .withEthProvider("http://localhost:8545")
# .some,
# ...
template multinodesuite*(name: string, body: untyped) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a new template, maybe the provider url could be an optional parameter instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the fixes from #993, we do not really need this new template. It will be removed.

@@ -408,18 +407,168 @@ method subscribeProofSubmission*(market: OnChainMarket,
method unsubscribe*(subscription: OnChainMarketSubscription) {.async.} =
await subscription.eventSubscription.unsubscribe()

method queryPastEvents*[T: MarketplaceEvent](
proc blockNumberForBlocksEgo*(provider: Provider,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a spelling mistake. Ego => Ago

🟡 just a preference suggestion. Would be more clear to call this with something that includes "block tag" since that's what we're returning, eg pastBlockTag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, block numbers probably do not have much of an "ego". Thanks!

For the blocksAgo param: passing a BlockTag would mean I do know to which block number I want to go. This function is meant to find that block knowing only how many block I want go backward. That's also why int is natural choose. All did I misunderstand you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you misunderstood. I was referring to the return type, not the parameter type. Calling this function pastBlock(blocksAgo: int): BlockTag or pastBlockTag(blocksAgo: int): BlockTag just seems to be a bit easier to swallow to me.

await provider.blockNumberAndTimestamp(BlockTag.init(low))
let (_, highTimestamp) =
await provider.blockNumberAndTimestamp(BlockTag.init(high))
debug "[binarySearchFindClosestBlock]:", epochTime = epochTime,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The text in most of these logging statements are just the function name, which indicates to me that they are mostly useful for debugging during development. I'm not entirely sure these will be useful to have in the logs going forward. Wdyt? At the very least, they should be trace statements, but my opinion here is that they would create a lot of logging clutter. Imo, we should be able to rely on the tests to know the functionality is working correctly.
  • tip: with chronicles, if the name of the logging property matches the symbol name, you don't need the = symbol part, eg debug "[binarySearchFindClosestBlock]:", epochTime, lowTimestamp, highTimestamp, low, high

let head = await provider.getBlockNumber()
return BlockTag.init(head - blocksAgo.abs.u256)

proc blockNumberAndTimestamp*(provider: Provider, blockTag: BlockTag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these functions introduced in the Market module, should use Market as the first parameter. Then, inside the proc, provider can be accessed by market.contract.provider. Otherwise, outside callers would be able to use these procs with a foreign provider which could cause undetermined outcomes.

Why are these procs exported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I don't think these functions belong in the market abstraction and should probably go into ethers...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proc is kind of taking the values returned from provider.getBlock and making assumptions about raising exceptions. In all the callsites, these errors are not handled. A few suggestions to improve this, assuming also these functions will be moved to ethers, would be:

  1. Annotate this proc with {.async: (raises: [ErrorType1, ...ErrorTypeN].} (start with raises:[] and the compiler will tell you what will be raised). Do the same for all the callsites that don't handle these exceptions.
  2. Do not export it, because it should be internal to the provider use. External callers can get blocks and handle exceptions as they need, without assumptions made for them.
  3. Assert that blockTag != BlockTag.pending at the beginning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good point to move it into nim-ethers. Let me think shortly about it.

Comment on lines +512 to +523
if earliestBlockTimestamp == epochTimeUInt256:
return earliestBlockNumber
if latestBlockTimestamp == epochTimeUInt256:
return latestBlockNumber

if earliestBlockNumber > 0 and earliestBlockTimestamp > epochTimeUInt256:
let availableHistoryInDays =
(latestBlockTimestamp - earliestBlockTimestamp) div
initDuration(days = 1).inSeconds.u256
warn "Short block history detected.", earliestBlockTimestamp =
earliestBlockTimestamp, days = availableHistoryInDays
return earliestBlockNumber
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from getting the block number and timestamp for earliest/latest, this block is mostly validation for the binary search routine. I think it would be a bit cleaner if we moved the validation into the binary search routine as conditional returns and converted that routine into a recursive function.

cancelExpression = isRequestCancelled(requestId))

# extra check
await marketplace.checkSlotsFailed(requestId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more than one check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By checking that the remaining slots, for which we did not capture slotFreed event, have status failed, we de-facto confirm that the request has failed.
This extra check documents the logic (that in the end the request should fail), and does not cost anything, so why not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the extra slots would have a slotState of failed is because the request state is Failed. Having contract event emissions not aligned with request and slot states is not something we should be testing here, and should leave it to testing inside codex-contracts-eth and formal verification. We do not need to check contract events to be sure of this.

Also, keeping harness logic to a minimum means the less of a chance we have of introducing a bug in the harness itself.

# extra check
await marketplace.checkSlotsFailed(requestId)

await stopTrackingEvents()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the event tracking work, but I'm not convinced we really need it...

Comment on lines +265 to +273
without purchaseState =? client0.getPurchase(purchaseId).?state:
fail()
return

debug "validation suite", purchaseState = purchaseState

if purchaseState != "started":
fail()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like extra if we're already checking this on line 261.

Comment on lines +306 to +311
check eventuallyS(checkSlotsFreed(requestId, expectedSlotsFreed),
timeout = secondsTillRequestEnd + 60, step = 5,
cancelExpression = isRequestCancelled(requestId))

# extra check
await marketplace.checkSlotsFailed(requestId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as previous test

Comment on lines +14 to +22
template marketplacesuite*(name: string,
body: untyped) =
marketplacesuiteWithProviderUrl name, "http://localhost:8545":
body

# we can't just overload the name and use marketplacesuite here
# see: https://github.com/nim-lang/Nim/issues/14827
template marketplacesuiteWithProviderUrl*(name: string,
jsonRpcProviderUrl: string, body: untyped) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, it would be helpful to the reviewers if we could try to keep indirectly related changes in a separate PR if possible, please 🙏

…sts as a workaround for dropped subscriptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Marketplace See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants