-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validator historical state restoration #922
base: master
Are you sure you want to change the base?
Conversation
3b4a7fe
to
89738ff
Compare
9e9882b
to
6879c27
Compare
fc34613
to
1459644
Compare
92b6e96
to
7b20c4b
Compare
77fcad2
to
4df6a21
Compare
tests/integration/testvalidator.nim
Outdated
let slotFreedSubscription = | ||
await trackSlotsFreed(requestId, marketplace) | ||
|
||
let expectedSlotsFreed = nodes - tolerance | ||
check eventually((slotsFreed.len == expectedSlotsFreed), | ||
timeout=(duration - expiry).int * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markspanbroek suggest that based on the observation that expectedSlotsFreed
should be equal to tolerance + 1
, it might be sufficient to simply wait for the storage request to fail instead of tracking slots freed. Make sense to me, I will check it out and push changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I took a mental step back. The purpose of this test is to check that validators are doing their work. We can detect that by looking at failing request, but this is actually not the most direct way of sensing that. Tracking events is. We can sense that slots are being freed and that request fails and this would not happen without properly working validator. Sensing a failing request by polling the state of the purchase should bring us to the same results, yet, the sensing path is longer and more pone to an accidental errors which does not currently seem to be completely avoidable. Errors like:
Exception in blockexc read loop topics="codex blockexcnetworkpeer" tid=13304 msg="Stream Closed!"
or:
Unhandled exception in async proc, aborting topics="codex" tid=13640 msg="C:\\msys64\\home\\code\\...\\Nim\\lib\\pure\\json.nim(824, 10) `node.kind == JArray` : items() can not iterate a JsonNode of kind JNull"
Especially that those problems mostly happen on slow and less reliable Windows runners.
Thus, while looking for failing purchase is elegant and strips some of the code from the test, it is not the shortest and the most reliable way of testing what is supposed to be tested here.
In this spirit, and to keep the integration tests as short as possible, I decided to track the marketplace events to confirm that validator correctly detected the missing proofs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first exception shouldn't really be affecting validation as it has to do with the block exchange. This could affect datasets from being downloaded, but that's doesn't seem to be an issue.
The second exception was related to when polled (http) subscriptions returned null
instead of []
(possibly due to an expired filter on the client). This was fixed in codex-storage/nim-ethers#94.
"Sensing a failing request" can be done by:
- Checking the request state:
await marketplace.getRequest(requestId.get)
(although we should probably expose this on a REST API endpoint instead of querying the contract directly, for another PR) - Checking the
slotState
for all slots in the request (longer, more verbose)
If the contract is showing that a request is failed, we can trust that it is failed for certain.
tests/integration/testvalidator.nim
Outdated
let slotFreedSubscription = | ||
await trackSlotsFreed(requestId, marketplace) | ||
|
||
let expectedSlotsFreed = nodes - tolerance | ||
check eventually((slotsFreed.len == expectedSlotsFreed), | ||
timeout=(duration - expiry).int * 1000) | ||
|
||
# Because of erasure coding, if e.g. 2 out of 3 nodes are freed, the last | ||
# node will not be freed but marked as "Failed" because the whole request | ||
# will fail. For this reason we need an extra check: | ||
await checkSlotsFailed(slotsFilled, slotsFreed, marketplace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7e06bd8
to
165d514
Compare
The reason for occasional integration tests failure on Windows is described in codex-storage/nim-ethers#79. |
98e3378
to
a91c76f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, Marcin, thank you. It took me quite a few hours to review this PR. In the future, any kind of split up of the work (into separate PRs) would be much appreciated!
# CodexConfigs.init(nodes=1) | ||
# .withEthProvider("http://localhost:8545") | ||
# .some, | ||
# ... | ||
template multinodesuite*(name: string, body: untyped) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a new template, maybe the provider url could be an optional parameter instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the fixes from #993, we do not really need this new template. It will be removed.
@@ -408,18 +407,168 @@ method subscribeProofSubmission*(market: OnChainMarket, | |||
method unsubscribe*(subscription: OnChainMarketSubscription) {.async.} = | |||
await subscription.eventSubscription.unsubscribe() | |||
|
|||
method queryPastEvents*[T: MarketplaceEvent]( | |||
proc blockNumberForBlocksEgo*(provider: Provider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a spelling mistake. Ego
=> Ago
🟡 just a preference suggestion. Would be more clear to call this with something that includes "block tag" since that's what we're returning, eg pastBlockTag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, block numbers probably do not have much of an "ego". Thanks!
For the blocksAgo
param: passing a BlockTag
would mean I do know to which block number I want to go. This function is meant to find that block knowing only how many block I want go backward. That's also why int
is natural choose. All did I misunderstand you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you misunderstood. I was referring to the return type, not the parameter type. Calling this function pastBlock(blocksAgo: int): BlockTag
or pastBlockTag(blocksAgo: int): BlockTag
just seems to be a bit easier to swallow to me.
await provider.blockNumberAndTimestamp(BlockTag.init(low)) | ||
let (_, highTimestamp) = | ||
await provider.blockNumberAndTimestamp(BlockTag.init(high)) | ||
debug "[binarySearchFindClosestBlock]:", epochTime = epochTime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The text in most of these logging statements are just the function name, which indicates to me that they are mostly useful for debugging during development. I'm not entirely sure these will be useful to have in the logs going forward. Wdyt? At the very least, they should be
trace
statements, but my opinion here is that they would create a lot of logging clutter. Imo, we should be able to rely on the tests to know the functionality is working correctly. - tip: with
chronicles
, if the name of the logging property matches the symbol name, you don't need the= symbol
part, egdebug "[binarySearchFindClosestBlock]:", epochTime, lowTimestamp, highTimestamp, low, high
let head = await provider.getBlockNumber() | ||
return BlockTag.init(head - blocksAgo.abs.u256) | ||
|
||
proc blockNumberAndTimestamp*(provider: Provider, blockTag: BlockTag): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these functions introduced in the Market module, should use Market
as the first parameter. Then, inside the proc, provider
can be accessed by market.contract.provider
. Otherwise, outside callers would be able to use these procs with a foreign provider
which could cause undetermined outcomes.
Why are these procs exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I don't think these functions belong in the market abstraction and should probably go into ethers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proc is kind of taking the values returned from provider.getBlock
and making assumptions about raising exceptions. In all the callsites, these errors are not handled. A few suggestions to improve this, assuming also these functions will be moved to ethers, would be:
- Annotate this proc with
{.async: (raises: [ErrorType1, ...ErrorTypeN].}
(start withraises:[]
and the compiler will tell you what will be raised). Do the same for all the callsites that don't handle these exceptions. - Do not export it, because it should be internal to the provider use. External callers can get blocks and handle exceptions as they need, without assumptions made for them.
- Assert that
blockTag != BlockTag.pending
at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a good point to move it into nim-ethers
. Let me think shortly about it.
if earliestBlockTimestamp == epochTimeUInt256: | ||
return earliestBlockNumber | ||
if latestBlockTimestamp == epochTimeUInt256: | ||
return latestBlockNumber | ||
|
||
if earliestBlockNumber > 0 and earliestBlockTimestamp > epochTimeUInt256: | ||
let availableHistoryInDays = | ||
(latestBlockTimestamp - earliestBlockTimestamp) div | ||
initDuration(days = 1).inSeconds.u256 | ||
warn "Short block history detected.", earliestBlockTimestamp = | ||
earliestBlockTimestamp, days = availableHistoryInDays | ||
return earliestBlockNumber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from getting the block number and timestamp for earliest/latest, this block is mostly validation for the binary search routine. I think it would be a bit cleaner if we moved the validation into the binary search routine as conditional returns and converted that routine into a recursive function.
cancelExpression = isRequestCancelled(requestId)) | ||
|
||
# extra check | ||
await marketplace.checkSlotsFailed(requestId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need more than one check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By checking that the remaining slots, for which we did not capture slotFreed
event, have status failed, we de-facto confirm that the request has failed.
This extra check documents the logic (that in the end the request should fail), and does not cost anything, so why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the extra slots would have a slotState
of failed
is because the request state is Failed
. Having contract event emissions not aligned with request and slot states is not something we should be testing here, and should leave it to testing inside codex-contracts-eth and formal verification. We do not need to check contract events to be sure of this.
Also, keeping harness logic to a minimum means the less of a chance we have of introducing a bug in the harness itself.
# extra check | ||
await marketplace.checkSlotsFailed(requestId) | ||
|
||
await stopTrackingEvents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the event tracking work, but I'm not convinced we really need it...
without purchaseState =? client0.getPurchase(purchaseId).?state: | ||
fail() | ||
return | ||
|
||
debug "validation suite", purchaseState = purchaseState | ||
|
||
if purchaseState != "started": | ||
fail() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like extra if we're already checking this on line 261.
check eventuallyS(checkSlotsFreed(requestId, expectedSlotsFreed), | ||
timeout = secondsTillRequestEnd + 60, step = 5, | ||
cancelExpression = isRequestCancelled(requestId)) | ||
|
||
# extra check | ||
await marketplace.checkSlotsFailed(requestId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as previous test
template marketplacesuite*(name: string, | ||
body: untyped) = | ||
marketplacesuiteWithProviderUrl name, "http://localhost:8545": | ||
body | ||
|
||
# we can't just overload the name and use marketplacesuite here | ||
# see: https://github.com/nim-lang/Nim/issues/14827 | ||
template marketplacesuiteWithProviderUrl*(name: string, | ||
jsonRpcProviderUrl: string, body: untyped) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, it would be helpful to the reviewers if we could try to keep indirectly related changes in a separate PR if possible, please 🙏
…sts as a workaround for dropped subscriptions
…uest instead of tracking slots
a9df4ce
to
8c479cb
Compare
At the moment the validator has started, there may already be existing active storage requests containing slots with slot ids that should be monitored by the validator: these are slot ids that either belong to the group observed by the validator or basically any slot id if validator is to monitor the whole slot id space.
In this context we want the validator to find and monitor all those slots id associated with storage requests that started earlier and have not yet concluded.
For more details, please refer to the discussion in #458.
It is work in progress, but open to suggestions...