-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
limit amount of slot queue workers to 1 #988
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mark!
collateral=200.u256, | ||
nodes = 5, | ||
tolerance = 2).get | ||
nodes = 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is the motivation of limiting the number of nodes to hasten the completion of this test? In the past, having 5 nodes has surfaced a few issues. In saying that, I will eventually finish this PR which will have 5 or 6 nodes, so at least we'll have one test (in the future with 5/6 nodes): #976
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it now takes about 4 minutes to fill 3 slots. Something still is off that this takes so long (most likely the downloading)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, even on the testnet I think downloading is quick... locally as an example, download takes 410ms in one of the integration tests (8 blocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proof generation might take some time. Also, I noticed on the testnet that we're doing local verification of the proofs, which is not needed and takes time on each node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. I'm running this test with some more logging on my machine and it spends most of its time in the SaleInitialProving state.
Another option to speed up tests: pass number of workers as a cli param (3 for tests, 1 for testnet). More work of course... |
This is just a temporary fix. We should increase the number of workers as soon as we have threading for creating proofs. |
tests/ethertest.nim
Outdated
@@ -16,7 +16,7 @@ template ethersuite*(name, body) = | |||
var snapshot: JsonNode | |||
|
|||
setup: | |||
ethProvider = JsonRpcProvider.new("ws://localhost:8545") | |||
ethProvider = JsonRpcProvider.new("http://localhost:8545") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right change, but we also might run into the poll
future failing... (something to be fixed in ethers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the contract tests started failing with this change. Strange that works in the integration tests, but not in the contract tests. I'll revert the change on this line for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to know if the reason they are failing is due to the poll
failing issue though. In saying that, I'd like to get a fix in for the poll()
fail. Trying to work on it now, but unsure if I'll get something in before my flight leaves in a few hours.
a93eb6f
to
703b943
Compare
I can't seem to get the integration tests to run stably. I fixed all the obvious cases where things could go wrong, and still they fail in unexpected places (while submitting a request, while running unit tests). I guess we really need to fix the issue with |
Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled? |
Yes, I think so. It seems that subscription handlers not being called is causing the unpredictable test failures. When using only websockets the tests actually work quite reliably up to the point that a test takes longer than 5 minutes. When using http and polling the tests fail at random points. My suspicion is that this is caused by what Eric noticed; that an exception is being raised in the |
I see, there is the #985 which might show the error. It could be worth rebase this PR on top of that to test your hypothesis. |
So I've got somewhere on a solution for this, but it's not complete. The first part of the solution is to use the The second part, which I'm not sure about yet, is understanding whether or not we need to reconnect the The third part is trying to understand if we need this reconnection logic from part 2 in any other areas of ethers (including websockets). |
@markspanbroek here's my work on keeping the polling loop alive: https://github.com/codex-storage/nim-ethers/tree/fix/keep-polling-alive. NOTES
|
Thanks @emizzle, I think I've now got a version of http polling that works based on these ideas. I also found 4 other bugs that made the tests unstable. I'll try to get PRs for all of this in tomorrow. |
On Linux and MacOS the tests run stable now. The Windows tests are running so slow that they are even failing on the fairly simple contract tests that do not involve proofs. I'm installing a Windows VM to see if I can reproduce this on my machine. |
To work around this issue when subscriptions are inactive for more than 5 minutes: NomicFoundation/hardhat#2053 Use 100 millisecond polling; default polling interval of 4 seconds is too close to the 5 second timeout for `check eventually`.
confirm(0) doesn't wait at all, confirm(1) waits for the transaction to be mined
includes fixes for http polling and .confirm()
allow for a bit more time to withdraw funds
to ensure that the transaction has been processed before continuing with the test
there were two logic errors in this test: - a slot is freed anyway at the end of the contract - when starting the request takes a long time, the first slot can already be freed because there were too many missing proofs
currentTime() doesn't always correctly reflect the time of the next transaction
otherwise the windows runner in the CI won't be able to start the request before it expires
allow for a bit more time for a request to be submitted
windows ci is so slow, it can take up to 40 seconds just to submit a storage request to hardhat
bf183f9
to
eb1e5e2
Compare
Moved most of the fixes into #993, and rebased this PR on top of it. I'm not sure whether it still makes sense to limit the amount of slot queue workers now that I've done some measurements on proof generation. It doesn't take that much time (~3 seconds). |
reason: with the increased period length of 90 seconds, it can take longer to wait for a stable challenge at the beginning of a period.
eb1e5e2
to
15991d8
Compare
apparently it takes windows 2-3 seconds to resolve "localhost" to 127.0.0.1 for every json-rpc connection that we make 🤦
reason: proof generation does not happen in separate threads yet, so to avoid taking too long to provide initial storage proofs we only pick up one slot at a time
15991d8
to
2007867
Compare
reason: proof generation does not happen in
separate threads yet, so to avoid taking too
long to provide initial storage proofs we only
pick up one slot at a time