-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TimeoutException
at the start of the orchestrator
#319
Comments
Production environments that had been operating normally suddenly began to experience similar exceptions, with a probability of orchestration startup failure of approximately 20~30%. Since no new deployments have been made, I suspect that the state held internally by Netherite has been corrupted. From checking the Event Hubs metrics, this one appears to be working fine. This may be related to what is reported in the following discussion. Information
Stack Trace
|
As additional information, we observed that the probability of the orchestrator failing to start increased step by step, eventually failing to start almost 100% of the time. We have confirmed that changing the We have the problematic blob still intact and can share it with you if you need it for your investigation. |
@davidmrdavid or @sebastianburckhardt does this issue look familiar? |
Your observations are consistent with the partition state getting corrupted, and one or more partitions becoming unresponsive as a result. If this keeps happening (and almost all partitions are corrupted) it would explain why orchestrators fail almost all the time. We have been trying to find the source of these state corruptions. We have added some fixes and mitigations since 1.3.5 - it may be worth trying the latest release (1.4.1). Also, I would be happy to look at traces to try to pinpoint the source of the corruption. This has been a problem for a while now. If this is running on a hosted plan (e.g. consumption or premium) I can take a look at our internal telemetry if you tell me that app name. |
@sebastianburckhardt Thanks for the reply. We are going to upgrade to v1.4.1 now and see if the issue has improved. We still have Azure Functions (Consumption plan) that had this problem a few days ago and would be happy to share the name of the app to help improve it. However, as this will be a production environment, we would prefer to provide details via email or other private channels. Can you please let us know how to do this? |
Feel free to send me an e-mail. If you don't know the address, it is on my homepage (and you can find my homepage from my github profile page I believe). |
Looking at the traces I can confirm that our suspicion was right.. several partitions enter a state where recovery fails repeatedly - in each case, the recovery appears to successfully read to the end of the object log but then stops making progress. Unfortunately I was not able to diagnose why the recovery hangs, all the storage accesses in the traces look normal.
If these are still around, we may be able to use them to debug the recovery. I am considering creating an executable that can be run directly from the command line which runs the FASTER recovery and can be attached to easily in the debugger. |
Thank you for your investigation. As for the file, it did exist, but it is difficult to share because it contains sensitive customer information. Instead, we have emailed you information about a development environment where a similar problem is occurring, and we are considering whether we can investigate the issue here. It would be very nice to have a tool that can revert a hung Task Hub, as it is now impossible to examine the logs of the process at this time. |
Last week we found a bug that manifests exactly like what is reported here. Since the symptoms are a precise match, there is a good chance that our current fix will address this issue. |
That's good news, and while I've been fortunate enough not to have experienced any hangs to this point since updating to the latest version of Netherite, I'm glad to see that the root cause is being addressed. |
Not to tag on here, but we see this bug in 1.4.2, which supposedly has the FASTER upgrades. We are working on our upgrade to 1.5.1 and the release notes look promising to close more bugs that we're seeing. Just wanted to you let you know that we this as of 1.4.2 / 2.0.23 of FASTER. |
We have sent in logs where we still receive this error in the 1.5.1 version, just FYI |
I am trying to start an orchestrator function by calling
StartNewAsync
, but the send to Event Hubs is successful as far as Application Insights is concerned, but the method call is blocking and times out.In older versions, this was a rare occurrence, but after updating the Netherite version to v1.4.0, it now occurs almost 100% of the time.
Information
Stack Trace
The text was updated successfully, but these errors were encountered: