Validator historical state restoration #922

marcinczenko · 2024-09-26T14:27:55Z

At the moment the validator has started, there may already be existing active storage requests containing slots with slot ids that should be monitored by the validator: these are slot ids that either belong to the group observed by the validator or basically any slot id if validator is to monitor the whole slot id space.

In this context we want the validator to find and monitor all those slots id associated with storage requests that started earlier and have not yet concluded.

For more details, please refer to the discussion in #458.

It is work in progress, but open to suggestions...

marcinczenko · 2024-10-15T07:53:20Z

tests/integration/testvalidator.nim

+    let slotFreedSubscription = 
+      await trackSlotsFreed(requestId, marketplace)
+
+    let expectedSlotsFreed = nodes - tolerance
+    check eventually((slotsFreed.len == expectedSlotsFreed),
+      timeout=(duration - expiry).int * 1000)


@markspanbroek suggest that based on the observation that expectedSlotsFreed should be equal to tolerance + 1, it might be sufficient to simply wait for the storage request to fail instead of tracking slots freed. Make sense to me, I will check it out and push changes.

ok, so I took a mental step back. The purpose of this test is to check that validators are doing their work. We can detect that by looking at failing request, but this is actually not the most direct way of sensing that. Tracking events is. We can sense that slots are being freed and that request fails and this would not happen without properly working validator. Sensing a failing request by polling the state of the purchase should bring us to the same results, yet, the sensing path is longer and more pone to an accidental errors which does not currently seem to be completely avoidable. Errors like:

Exception in blockexc read loop topics="codex blockexcnetworkpeer" tid=13304 msg="Stream Closed!"

or:

Unhandled exception in async proc, aborting topics="codex" tid=13640 msg="C:\\msys64\\home\\code\\...\\Nim\\lib\\pure\\json.nim(824, 10) `node.kind == JArray` : items() can not iterate a JsonNode of kind JNull"

Especially that those problems mostly happen on slow and less reliable Windows runners.

Thus, while looking for failing purchase is elegant and strips some of the code from the test, it is not the shortest and the most reliable way of testing what is supposed to be tested here.

In this spirit, and to keep the integration tests as short as possible, I decided to track the marketplace events to confirm that validator correctly detected the missing proofs.

The first exception shouldn't really be affecting validation as it has to do with the block exchange. This could affect datasets from being downloaded, but that's doesn't seem to be an issue.

The second exception was related to when polled (http) subscriptions returned null instead of [] (possibly due to an expired filter on the client). This was fixed in codex-storage/nim-ethers#94.

"Sensing a failing request" can be done by:

Checking the request state: await marketplace.getRequest(requestId.get) (although we should probably expose this on a REST API endpoint instead of querying the contract directly, for another PR)

Checking the slotState for all slots in the request (longer, more verbose)

If the contract is showing that a request is failed, we can trust that it is failed for certain.

marcinczenko · 2024-10-15T07:56:39Z

tests/integration/testvalidator.nim

+    let slotFreedSubscription = 
+      await trackSlotsFreed(requestId, marketplace)
+
+    let expectedSlotsFreed = nodes - tolerance
+    check eventually((slotsFreed.len == expectedSlotsFreed),
+      timeout=(duration - expiry).int * 1000)
+
+    # Because of erasure coding, if e.g. 2 out of 3 nodes are freed, the last
+    # node will not be freed but marked as "Failed" because the whole request
+    # will fail. For this reason we need an extra check:
+    await checkSlotsFailed(slotsFilled, slotsFreed, marketplace)


Same as https://github.com/codex-storage/nim-codex/pull/922/files#r1800638990.

marcinczenko · 2024-10-22T02:46:00Z

The reason for occasional integration tests failure on Windows is described in codex-storage/nim-ethers#79.

emizzle

Nice work, Marcin, thank you. It took me quite a few hours to review this PR. In the future, any kind of split up of the work (into separate PRs) would be much appreciated!

emizzle · 2024-11-25T00:48:34Z

tests/integration/multinodes.nim

+#       CodexConfigs.init(nodes=1)
+#         .withEthProvider("http://localhost:8545")
+#         .some,
+#     ...
 template multinodesuite*(name: string, body: untyped) =


Instead of a new template, maybe the provider url could be an optional parameter instead?

After the fixes from #993, we do not really need this new template. It will be removed.

emizzle · 2024-11-25T04:34:56Z

codex/contracts/market.nim

@@ -408,18 +407,168 @@ method subscribeProofSubmission*(market: OnChainMarket,
 method unsubscribe*(subscription: OnChainMarketSubscription) {.async.} =
  await subscription.eventSubscription.unsubscribe()

-method queryPastEvents*[T: MarketplaceEvent](
+proc blockNumberForBlocksEgo*(provider: Provider,


I think there is a spelling mistake. Ego => Ago

🟡 just a preference suggestion. Would be more clear to call this with something that includes "block tag" since that's what we're returning, eg pastBlockTag.

Indeed, block numbers probably do not have much of an "ego". Thanks!

For the blocksAgo param: passing a BlockTag would mean I do know to which block number I want to go. This function is meant to find that block knowing only how many block I want go backward. That's also why int is natural choose. All did I misunderstand you?

Yes, I think you misunderstood. I was referring to the return type, not the parameter type. Calling this function pastBlock(blocksAgo: int): BlockTag or pastBlockTag(blocksAgo: int): BlockTag just seems to be a bit easier to swallow to me.

emizzle · 2024-11-25T04:42:09Z

codex/contracts/market.nim

+    await provider.blockNumberAndTimestamp(BlockTag.init(low))
+  let (_, highTimestamp) =
+    await provider.blockNumberAndTimestamp(BlockTag.init(high))
+  debug "[binarySearchFindClosestBlock]:", epochTime = epochTime,


The text in most of these logging statements are just the function name, which indicates to me that they are mostly useful for debugging during development. I'm not entirely sure these will be useful to have in the logs going forward. Wdyt? At the very least, they should be trace statements, but my opinion here is that they would create a lot of logging clutter. Imo, we should be able to rely on the tests to know the functionality is working correctly.

tip: with chronicles, if the name of the logging property matches the symbol name, you don't need the = symbol part, eg debug "[binarySearchFindClosestBlock]:", epochTime, lowTimestamp, highTimestamp, low, high

emizzle · 2024-11-25T05:24:25Z

codex/contracts/market.nim

+  let head = await provider.getBlockNumber()
+  return BlockTag.init(head - blocksAgo.abs.u256)
+
+proc blockNumberAndTimestamp*(provider: Provider, blockTag: BlockTag):


All of these functions introduced in the Market module, should use Market as the first parameter. Then, inside the proc, provider can be accessed by market.contract.provider. Otherwise, outside callers would be able to use these procs with a foreign provider which could cause undetermined outcomes.

Why are these procs exported?

On second thought, I don't think these functions belong in the market abstraction and should probably go into ethers...

This proc is kind of taking the values returned from provider.getBlock and making assumptions about raising exceptions. In all the callsites, these errors are not handled. A few suggestions to improve this, assuming also these functions will be moved to ethers, would be:

Annotate this proc with {.async: (raises: [ErrorType1, ...ErrorTypeN].} (start with raises:[] and the compiler will tell you what will be raised). Do the same for all the callsites that don't handle these exceptions.

Do not export it, because it should be internal to the provider use. External callers can get blocks and handle exceptions as they need, without assumptions made for them.

Assert that blockTag != BlockTag.pending at the beginning.

Perhaps a good point to move it into nim-ethers. Let me think shortly about it.

emizzle · 2024-11-25T05:45:41Z

codex/contracts/market.nim

+  if earliestBlockTimestamp == epochTimeUInt256:
+    return earliestBlockNumber
+  if latestBlockTimestamp == epochTimeUInt256:
+    return latestBlockNumber
+
+  if earliestBlockNumber > 0 and earliestBlockTimestamp > epochTimeUInt256:
+    let availableHistoryInDays = 
+        (latestBlockTimestamp - earliestBlockTimestamp) div
+          initDuration(days = 1).inSeconds.u256
+    warn "Short block history detected.", earliestBlockTimestamp =  
+      earliestBlockTimestamp, days = availableHistoryInDays
+    return earliestBlockNumber


Apart from getting the block number and timestamp for earliest/latest, this block is mostly validation for the binary search routine. I think it would be a bit cleaner if we moved the validation into the binary search routine as conditional returns and converted that routine into a recursive function.

emizzle · 2024-11-26T07:23:42Z

tests/integration/testvalidator.nim

+      cancelExpression = isRequestCancelled(requestId))
+
+    # extra check
+    await marketplace.checkSlotsFailed(requestId)


Do we need more than one check?

By checking that the remaining slots, for which we did not capture slotFreed event, have status failed, we de-facto confirm that the request has failed.
This extra check documents the logic (that in the end the request should fail), and does not cost anything, so why not?

The reason the extra slots would have a slotState of failed is because the request state is Failed. Having contract event emissions not aligned with request and slot states is not something we should be testing here, and should leave it to testing inside codex-contracts-eth and formal verification. We do not need to check contract events to be sure of this.

Also, keeping harness logic to a minimum means the less of a chance we have of introducing a bug in the harness itself.

emizzle · 2024-11-26T07:24:05Z

tests/integration/testvalidator.nim

+    # extra check
+    await marketplace.checkSlotsFailed(requestId)
+
+    await stopTrackingEvents()


I like the event tracking work, but I'm not convinced we really need it...

emizzle · 2024-11-26T07:26:24Z

tests/integration/testvalidator.nim

+    without purchaseState =? client0.getPurchase(purchaseId).?state:
+      fail()
+      return
+
+    debug "validation suite", purchaseState = purchaseState
+
+    if purchaseState != "started":
+      fail()
+      return


Seems like extra if we're already checking this on line 261.

emizzle · 2024-11-26T07:28:44Z

tests/integration/testvalidator.nim

+    check eventuallyS(checkSlotsFreed(requestId, expectedSlotsFreed),
+      timeout = secondsTillRequestEnd + 60, step = 5,
+      cancelExpression = isRequestCancelled(requestId))
+
+    # extra check
+    await marketplace.checkSlotsFailed(requestId)


Same comments as previous test

emizzle · 2024-11-26T07:30:38Z

tests/integration/marketplacesuite.nim

+template marketplacesuite*(name: string,
+    body: untyped) =
+  marketplacesuiteWithProviderUrl name, "http://localhost:8545":
+    body
+
+# we can't just overload the name and use marketplacesuite here
+# see: https://github.com/nim-lang/Nim/issues/14827
+template marketplacesuiteWithProviderUrl*(name: string,
+    jsonRpcProviderUrl: string, body: untyped) =


In the future, it would be helpful to the reviewers if we could try to keep indirectly related changes in a separate PR if possible, please 🙏

…sts as a workaround for dropped subscriptions

…uest instead of tracking slots

…t suite

…dows

… rpc provider

marcinczenko added Marketplace See https://miro.com/app/board/uXjVNZ03E-c=/ for details WIP Work in Progress (but open to comments) labels Sep 26, 2024

marcinczenko self-assigned this Sep 26, 2024

marcinczenko marked this pull request as draft September 26, 2024 14:28

marcinczenko removed the WIP Work in Progress (but open to comments) label Sep 26, 2024

marcinczenko changed the base branch from master to validator-partitioning-slotid-space September 26, 2024 18:09

marcinczenko force-pushed the validator-partitioning-slotid-space branch from 3b4a7fe to 89738ff Compare October 1, 2024 02:15

marcinczenko force-pushed the validator-historical-state branch from 9e9882b to 6879c27 Compare October 1, 2024 02:22

marcinczenko force-pushed the validator-partitioning-slotid-space branch 3 times, most recently from fc34613 to 1459644 Compare October 2, 2024 20:20

Base automatically changed from validator-partitioning-slotid-space to master October 2, 2024 23:09

marcinczenko force-pushed the validator-historical-state branch 3 times, most recently from 92b6e96 to 7b20c4b Compare October 9, 2024 01:58

marcinczenko force-pushed the validator-historical-state branch 2 times, most recently from 77fcad2 to 4df6a21 Compare October 14, 2024 23:12

marcinczenko marked this pull request as ready for review October 15, 2024 01:28

marcinczenko requested review from markspanbroek, emizzle and AuHau October 15, 2024 01:28

marcinczenko commented Oct 15, 2024

View reviewed changes

marcinczenko force-pushed the validator-historical-state branch 4 times, most recently from 7e06bd8 to 165d514 Compare October 18, 2024 03:14

marcinczenko force-pushed the validator-historical-state branch from 98e3378 to a91c76f Compare November 24, 2024 23:56

emizzle reviewed Nov 26, 2024

View reviewed changes

marcinczenko added 27 commits November 27, 2024 04:34

adds more logging to validator

98245ed

integration test: validator only looks 30 days back for historical state

fe477e2

adds logging of the slotState when removing slots during validation

61f146b

review and refactor validator integration tests

c3bd81f

adds validation to the set of integration tests

0f28269

Fixes mistyped name of the mock provider module in testMarket

656205f

Fixes a typo in the name of the validation suite in integration tests

54103b0

Makes validation unit test a bit easier to follow

f426962

better use of logScopes to reduce duplication

0c47e56

improves timing and clarifies the test conditions

934434b

uses http as default RPC provider for nodes running in integration te…

34182ee

…sts as a workaround for dropped subscriptions

simplifies the validation integration tests by waiting for failed req…

7c4c215

…uest instead of tracking slots

adds config option allowing selectively to set different provider url

d0bd817

Brings back the default settings for RPC provider in integration tests

f3f2174

use http RPC provider for clients in validation integration tests

afd8eb4

fine-tune the tests

bfb82ad

Makes validator integration test more robust - adds extra tracking

cd077b2

brings tracking of marketplace event back to validator integration test

c6989ec

refactors integration tests

e6b51a1

deletes tmp file

77bbae8

adds <<return>> after forcing integration test to fail preliminarily

c5c2cdc

re-enables all integration tests and matrix

71afe0d

stops debug output in CI

9a3f87c

allows to choose a different RPC provider for a given integration tes…

3ab25f8

…t suite

fixes signature of <<getBlock>> method in mockProvider

1f1c74f

adds missing import which seem to be braking integration tests on win…

c8e2eed

…dows

makes sure that clients, SPs, and validators use the same provider url

8c479cb

marcinczenko force-pushed the validator-historical-state branch from a9df4ce to 8c479cb Compare November 27, 2024 03:51

marcinczenko added 2 commits November 27, 2024 05:50

makes validator integration tests using http at 127.0.0.1:8545

22c0b62

testvalidator: stop resubscribing as we are now using http polling as…

8278758

… rpc provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator historical state restoration #922

Validator historical state restoration #922

marcinczenko commented Sep 26, 2024 •

edited

Loading

marcinczenko Oct 15, 2024

marcinczenko Oct 17, 2024

emizzle Nov 28, 2024

marcinczenko Oct 15, 2024

marcinczenko commented Oct 22, 2024

emizzle left a comment

emizzle Nov 25, 2024

marcinczenko Nov 27, 2024

emizzle Nov 25, 2024

marcinczenko Nov 27, 2024

emizzle Nov 28, 2024

emizzle Nov 25, 2024

emizzle Nov 25, 2024

emizzle Nov 26, 2024

emizzle Nov 26, 2024

marcinczenko Nov 27, 2024

emizzle Nov 25, 2024

emizzle Nov 26, 2024

marcinczenko Nov 27, 2024

emizzle Nov 28, 2024

emizzle Nov 26, 2024

emizzle Nov 26, 2024

emizzle Nov 26, 2024

emizzle Nov 26, 2024

Validator historical state restoration #922

Are you sure you want to change the base?

Validator historical state restoration #922

Conversation

marcinczenko commented Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcinczenko commented Oct 22, 2024

emizzle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcinczenko commented Sep 26, 2024 •

edited

Loading