feat: add retry block #66
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, I've done some research and found out that out of all current miners, 134 where harmed (567 different responses) and validator marked some of their runs as "completed" while scoring era5, even though there were remaining lead times to be scored - the cause of that is that validator incorrectly assumes that miner is offline, marking his response in the database with status = "miner_offline" and error_message = "Miner offline during scoring"
I've used this query:
Basically, this status is inserted by _cleanup_offline_miner_from_run, which is used in era5 scoring, whenever the miner is registered and function _request_fresh_token returns None
Function _request_fresh_token is supposed to return token, full_zarr_url and manifest_content_hash, which is a result of a handshake and kerchunk request in query_single_miner function - for some reason this doesn't work from time to time and returns None - it can only happen when:
From miner logs, I'm not seeing any difference between a successful score and failed score, which makes me think that miner is incorrectly marked as offline.
Without validator logs I can't investigate this issue even further - meanwhile I can create a PR, which will add attempts to this section of code - but I'd still like you to take a closer look into what are the errors for those miners, as the fix might be more direct than just a retry block