limit amount of slot queue workers to 1 #988

markspanbroek · 2024-11-07T10:36:15Z

reason: proof generation does not happen in
separate threads yet, so to avoid taking too
long to provide initial storage proofs we only
pick up one slot at a time

emizzle

Thanks Mark!

emizzle · 2024-11-07T19:01:04Z

tests/integration/testmarketplace.nim

      collateral=200.u256,
-      nodes = 5,
-      tolerance = 2).get
+      nodes = 3,


Out of curiosity, is the motivation of limiting the number of nodes to hasten the completion of this test? In the past, having 5 nodes has surfaced a few issues. In saying that, I will eventually finish this PR which will have 5 or 6 nodes, so at least we'll have one test (in the future with 5/6 nodes): #976

Yes, it now takes about 4 minutes to fill 3 slots. Something still is off that this takes so long (~~most likely the downloading~~)

Hmm, even on the testnet I think downloading is quick... locally as an example, download takes 410ms in one of the integration tests (8 blocks)

The proof generation might take some time. Also, I noticed on the testnet that we're doing local verification of the proofs, which is not needed and takes time on each node.

Yes, you're right. I'm running this test with some more logging on my machine and it spends most of its time in the SaleInitialProving state.

emizzle · 2024-11-07T19:02:24Z

Another option to speed up tests: pass number of workers as a cli param (3 for tests, 1 for testnet). More work of course...

markspanbroek · 2024-11-07T19:04:26Z

Another option to speed up tests: pass number of workers as a cli param. More work of course...

This is just a temporary fix. We should increase the number of workers as soon as we have threading for creating proofs.

emizzle · 2024-11-07T19:15:16Z

tests/ethertest.nim

@@ -16,7 +16,7 @@ template ethersuite*(name, body) =
    var snapshot: JsonNode

    setup:
-      ethProvider = JsonRpcProvider.new("ws://localhost:8545")
+      ethProvider = JsonRpcProvider.new("http://localhost:8545")


I think this is the right change, but we also might run into the poll future failing... (something to be fixed in ethers)

You're right, the contract tests started failing with this change. Strange that works in the integration tests, but not in the contract tests. I'll revert the change on this line for now.

Hard to know if the reason they are failing is due to the poll failing issue though. In saying that, I'd like to get a fix in for the poll() fail. Trying to work on it now, but unsure if I'll get something in before my flight leaves in a few hours.

markspanbroek · 2024-11-08T12:53:27Z

I can't seem to get the integration tests to run stably. I fixed all the obvious cases where things could go wrong, and still they fail in unexpected places (while submitting a request, while running unit tests). I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️

AuHau · 2024-11-09T04:15:56Z

I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

markspanbroek · 2024-11-09T05:33:40Z

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

Yes, I think so. It seems that subscription handlers not being called is causing the unpredictable test failures. When using only websockets the tests actually work quite reliably up to the point that a test takes longer than 5 minutes. When using http and polling the tests fail at random points. My suspicion is that this is caused by what Eric noticed; that an exception is being raised in the poll function, that silently stops the polling.

AuHau · 2024-11-09T05:49:00Z

I see, there is the #985 which might show the error. It could be worth rebase this PR on top of that to test your hypothesis.
It is unfortunately not finished, but it should be working, there are just some polishing needed in order to merge it. I will try to finish those tomorrow, I won't make it today unfortunately.

emizzle · 2024-11-09T10:30:42Z

So I've got somewhere on a solution for this, but it's not complete. The first part of the solution is to use the then util to attach callbacks to poll, allowing us to catch exceptions and then propagate them to the synchronous closure.

The second part, which I'm not sure about yet, is understanding whether or not we need to reconnect the RpcHttpClient when this exception is raised (this particular exception only or all exceptions)?

The third part is trying to understand if we need this reconnection logic from part 2 in any other areas of ethers (including websockets).

emizzle · 2024-11-09T11:57:09Z

@markspanbroek here's my work on keeping the polling loop alive: https://github.com/codex-storage/nim-ethers/tree/fix/keep-polling-alive.

NOTES

We cannot consume futures like we were without asyncSpawning or awaiting them. This is where the .then/.catch comes in: it's effectively an asyncSpawn but does not raise a Defect and instead exposes the exceptions.
We only restart the polling loop when an RpcPostError is encountered
We raise a Defect on other errors, but i'm not so sure this is the right approach. We might need to handle all possible errors differently.
There may be other situations in ethers where this happens?
The then.nim addition was taken from codex and was simplified a bit. It needs to be pulled into its own lib and used in both codex and ethers. The version in this branch is the most recent and will be used for the lib.

markspanbroek · 2024-11-10T21:02:52Z

here's my work on keeping the polling loop alive

Thanks @emizzle, I think I've now got a version of http polling that works based on these ideas. I also found 4 other bugs that made the tests unstable.

I'll try to get PRs for all of this in tomorrow.

markspanbroek · 2024-11-12T17:21:12Z

On Linux and MacOS the tests run stable now. The Windows tests are running so slow that they are even failing on the fairly simple contract tests that do not involve proofs.

I'm installing a Windows VM to see if I can reproduce this on my machine.

To work around this issue when subscriptions are inactive for more than 5 minutes: NomicFoundation/hardhat#2053 Use 100 millisecond polling; default polling interval of 4 seconds is too close to the 5 second timeout for `check eventually`.

confirm(0) doesn't wait at all, confirm(1) waits for the transaction to be mined

includes fixes for http polling and .confirm()

allow for a bit more time to withdraw funds

to ensure that the transaction has been processed before continuing with the test

there were two logic errors in this test: - a slot is freed anyway at the end of the contract - when starting the request takes a long time, the first slot can already be freed because there were too many missing proofs

currentTime() doesn't always correctly reflect the time of the next transaction

otherwise the windows runner in the CI won't be able to start the request before it expires

allow for a bit more time for a request to be submitted

windows ci is so slow, it can take up to 40 seconds just to submit a storage request to hardhat

markspanbroek · 2024-11-14T14:15:56Z

Moved most of the fixes into #993, and rebased this PR on top of it.

I'm not sure whether it still makes sense to limit the amount of slot queue workers now that I've done some measurements on proof generation. It doesn't take that much time (~3 seconds).

reason: with the increased period length of 90 seconds, it can take longer to wait for a stable challenge at the beginning of a period.

apparently it takes windows 2-3 seconds to resolve "localhost" to 127.0.0.1 for every json-rpc connection that we make 🤦

reason: proof generation does not happen in separate threads yet, so to avoid taking too long to provide initial storage proofs we only pick up one slot at a time

gmega · 2025-02-06T21:47:17Z

Should this be merged? Should this be closed? 🙂 @emizzle @markspanbroek

AuHau approved these changes Nov 7, 2024

View reviewed changes

markspanbroek marked this pull request as ready for review November 7, 2024 13:16

emizzle approved these changes Nov 7, 2024

View reviewed changes

emizzle reviewed Nov 7, 2024

View reviewed changes

markspanbroek force-pushed the limit-slotqueue-workers branch from a93eb6f to 703b943 Compare November 8, 2024 09:29

markspanbroek enabled auto-merge November 8, 2024 09:35

markspanbroek disabled auto-merge November 8, 2024 12:46

markspanbroek marked this pull request as draft November 8, 2024 12:53

markspanbroek added 9 commits November 13, 2024 10:22

use .confirm(1) instead of confirm(0)

abecadc

confirm(0) doesn't wait at all, confirm(1) waits for the transaction to be mined

speed up partial payout integration test

c57b4cd

update nim-ethers to version 0.10.0

77a6d0f

includes fixes for http polling and .confirm()

fix timing of marketplace tests

66bb2ab

allow for a bit more time to withdraw funds

use .confirm(1) in marketplace tests

eb74a6d

to ensure that the transaction has been processed before continuing with the test

fix timing issue in validation unit test

a5ec80b

fix proof integration test

a901da8

there were two logic errors in this test: - a slot is freed anyway at the end of the contract - when starting the request takes a long time, the first slot can already be freed because there were too many missing proofs

fix intermittent error in contract tests

0f7f913

currentTime() doesn't always correctly reflect the time of the next transaction

markspanbroek mentioned this pull request Nov 13, 2024

Fix concurrency issues #993

Merged

markspanbroek added 2 commits November 13, 2024 12:03

reduce number of slots in integration test

d770ac9

otherwise the windows runner in the CI won't be able to start the request before it expires

fix timing in purchasing test

b9d1ec7

allow for a bit more time for a request to be submitted

markspanbroek added 2 commits November 14, 2024 13:39

fix timing of request submission in test

ed9910e

windows ci is so slow, it can take up to 40 seconds just to submit a storage request to hardhat

increase proof period to 90 seconds

523e8cb

markspanbroek force-pushed the limit-slotqueue-workers branch from bf183f9 to eb1e5e2 Compare November 14, 2024 14:10

markspanbroek changed the base branch from master to fix-concurrency-issues November 14, 2024 14:11

markspanbroek added 2 commits November 14, 2024 16:38

adjust timing of integration tests

90e3686

reason: with the increased period length of 90 seconds, it can take longer to wait for a stable challenge at the beginning of a period.

increase CI timeout to 2 hours

0410f99

markspanbroek force-pushed the limit-slotqueue-workers branch from eb1e5e2 to 15991d8 Compare November 14, 2024 17:31

markspanbroek added 2 commits November 14, 2024 19:02

Fix slow builds on windows

69122d8

apparently it takes windows 2-3 seconds to resolve "localhost" to 127.0.0.1 for every json-rpc connection that we make 🤦

limit amount of slot queue workers to 1

2007867

reason: proof generation does not happen in separate threads yet, so to avoid taking too long to provide initial storage proofs we only pick up one slot at a time

markspanbroek force-pushed the limit-slotqueue-workers branch from 15991d8 to 2007867 Compare November 14, 2024 18:49

Base automatically changed from fix-concurrency-issues to master November 25, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limit amount of slot queue workers to 1 #988

limit amount of slot queue workers to 1 #988

markspanbroek commented Nov 7, 2024

emizzle left a comment

emizzle Nov 7, 2024

markspanbroek Nov 7, 2024 •

edited

Loading

emizzle Nov 7, 2024 •

edited

Loading

emizzle Nov 7, 2024

markspanbroek Nov 7, 2024

emizzle commented Nov 7, 2024 •

edited

Loading

markspanbroek commented Nov 7, 2024

emizzle Nov 7, 2024

markspanbroek Nov 7, 2024

emizzle Nov 7, 2024

markspanbroek commented Nov 8, 2024

AuHau commented Nov 9, 2024

markspanbroek commented Nov 9, 2024

AuHau commented Nov 9, 2024

emizzle commented Nov 9, 2024

emizzle commented Nov 9, 2024

markspanbroek commented Nov 10, 2024

markspanbroek commented Nov 12, 2024

markspanbroek commented Nov 14, 2024 •

edited

Loading

gmega commented Feb 6, 2025

limit amount of slot queue workers to 1 #988

Are you sure you want to change the base?

limit amount of slot queue workers to 1 #988

Conversation

markspanbroek commented Nov 7, 2024

emizzle left a comment

Choose a reason for hiding this comment

emizzle Nov 7, 2024

Choose a reason for hiding this comment

markspanbroek Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

emizzle Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

emizzle Nov 7, 2024

Choose a reason for hiding this comment

markspanbroek Nov 7, 2024

Choose a reason for hiding this comment

emizzle commented Nov 7, 2024 • edited Loading

markspanbroek commented Nov 7, 2024

emizzle Nov 7, 2024

Choose a reason for hiding this comment

markspanbroek Nov 7, 2024

Choose a reason for hiding this comment

emizzle Nov 7, 2024

Choose a reason for hiding this comment

markspanbroek commented Nov 8, 2024

AuHau commented Nov 9, 2024

markspanbroek commented Nov 9, 2024

AuHau commented Nov 9, 2024

emizzle commented Nov 9, 2024

emizzle commented Nov 9, 2024

NOTES

markspanbroek commented Nov 10, 2024

markspanbroek commented Nov 12, 2024

markspanbroek commented Nov 14, 2024 • edited Loading

gmega commented Feb 6, 2025

markspanbroek Nov 7, 2024 •

edited

Loading

emizzle Nov 7, 2024 •

edited

Loading

emizzle commented Nov 7, 2024 •

edited

Loading

markspanbroek commented Nov 14, 2024 •

edited

Loading