Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutException at the start of the orchestrator #319

Open
shibayan opened this issue Oct 16, 2023 · 12 comments
Open

TimeoutException at the start of the orchestrator #319

shibayan opened this issue Oct 16, 2023 · 12 comments
Labels
P1 Priority 1

Comments

@shibayan
Copy link

I am trying to start an orchestrator function by calling StartNewAsync, but the send to Event Hubs is successful as far as Application Insights is concerned, but the method call is blocking and times out.

In older versions, this was a rare occurrence, but after updating the Netherite version to v1.4.0, it now occurs almost 100% of the time.

Information

  • Durable Functions version: v2.10.0
  • Netherite version: v1.4.0
  • Durable Functions Monitor version: v6.3.0

Stack Trace

Microsoft.Azure.WebJobs.Host.FunctionInvocationException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw で処理された内部例外 System.TimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.Client+<CreateTaskOrchestrationAsync>d__37.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:469)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.NetheriteOrchestrationService+<DurableTask-Core-IOrchestrationServiceClient-CreateTaskOrchestrationAsync>d__100.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:594)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Core.TaskHubClient+<InternalCreateOrchestrationInstanceWithRaisedEventAsync>d__27.MoveNext (DurableTask.Core, Version=2.13.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd: /_/src/DurableTask.Core/TaskHubClient.cs:614)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient+<Microsoft-Azure-WebJobs-Extensions-DurableTask-IDurableOrchestrationClient-StartNewAsync>d__34`1.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:210)
@shibayan
Copy link
Author

Production environments that had been operating normally suddenly began to experience similar exceptions, with a probability of orchestration startup failure of approximately 20~30%.

Since no new deployments have been made, I suspect that the state held internally by Netherite has been corrupted. From checking the Event Hubs metrics, this one appears to be working fine.

This may be related to what is reported in the following discussion.

Information

  • Durable Functions version: v2.9.6
  • Netherite version: v1.3.5
  • Durable Functions Monitor version: v6.2.1

Stack Trace

Microsoft.Azure.WebJobs.Host.FunctionInvocationException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)
Inner exception System.TimeoutException handled at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.Client+<CreateTaskOrchestrationAsync>d__37.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:469)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.NetheriteOrchestrationService+<DurableTask-Core-IOrchestrationServiceClient-CreateTaskOrchestrationAsync>d__99.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:575)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Core.TaskHubClient+<InternalCreateOrchestrationInstanceWithRaisedEventAsync>d__27.MoveNext (DurableTask.Core, Version=2.13.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd: /_/src/DurableTask.Core/TaskHubClient.cs:614)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient+<Microsoft-Azure-WebJobs-Extensions-DurableTask-IDurableOrchestrationClient-StartNewAsync>d__34`1.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:210)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)

@shibayan
Copy link
Author

As additional information, we observed that the probability of the orchestrator failing to start increased step by step, eventually failing to start almost 100% of the time.

We have confirmed that changing the hubName as a workaround will allow the orchestrator to start correctly again. Based on these results, we suspect a consistency problem or corruption on the Blob side that Netherite uses to store state.

We have the problematic blob still intact and can share it with you if you need it for your investigation.

@cgillum
Copy link
Member

cgillum commented Oct 20, 2023

@davidmrdavid or @sebastianburckhardt does this issue look familiar?

@nytian nytian added the P1 Priority 1 label Oct 24, 2023
@sebastianburckhardt
Copy link
Member

sebastianburckhardt commented Oct 24, 2023

Your observations are consistent with the partition state getting corrupted, and one or more partitions becoming unresponsive as a result. If this keeps happening (and almost all partitions are corrupted) it would explain why orchestrators fail almost all the time.

We have been trying to find the source of these state corruptions. We have added some fixes and mitigations since 1.3.5 - it may be worth trying the latest release (1.4.1).

Also, I would be happy to look at traces to try to pinpoint the source of the corruption. This has been a problem for a while now. If this is running on a hosted plan (e.g. consumption or premium) I can take a look at our internal telemetry if you tell me that app name.

@shibayan
Copy link
Author

@sebastianburckhardt Thanks for the reply. We are going to upgrade to v1.4.1 now and see if the issue has improved.

We still have Azure Functions (Consumption plan) that had this problem a few days ago and would be happy to share the name of the app to help improve it. However, as this will be a production environment, we would prefer to provide details via email or other private channels. Can you please let us know how to do this?

@sebastianburckhardt
Copy link
Member

Feel free to send me an e-mail. If you don't know the address, it is on my homepage (and you can find my homepage from my github profile page I believe).

@sebastianburckhardt
Copy link
Member

Looking at the traces I can confirm that our suspicion was right.. several partitions enter a state where recovery fails repeatedly - in each case, the recovery appears to successfully read to the end of the object log but then stops making progress.

Unfortunately I was not able to diagnose why the recovery hangs, all the storage accesses in the traces look normal.
At this point I would need some other way to inspect the failing recovery. These are the blobs being read during recovery:

336d6313-f2f9-4a71-8300-a602bc176941/p11/commit-lease
336d6313-f2f9-4a71-8300-a602bc176941/p11/commit-log/commit-log.28
336d6313-f2f9-4a71-8300-a602bc176941/p11/last-checkpoint.json
336d6313-f2f9-4a71-8300-a602bc176941/p11/cpr-checkpoints/a5f4698a-e9f9-47a8-bd9d-2dc2d39e3378/singletons.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/cpr-checkpoints/a5f4698a-e9f9-47a8-bd9d-2dc2d39e3378/info.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/index-checkpoints/39dbaaed-b115-484a-bb31-9ce773a0d0dd/info.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/index-checkpoints/39dbaaed-b115-484a-bb31-9ce773a0d0dd/ht.dat.0
336d6313-f2f9-4a71-8300-a602bc176941/p11/store/store.0
336d6313-f2f9-4a71-8300-a602bc176941/p11/store.obj/store.obj.0

If these are still around, we may be able to use them to debug the recovery.

I am considering creating an executable that can be run directly from the command line which runs the FASTER recovery and can be attached to easily in the debugger.

@shibayan
Copy link
Author

Thank you for your investigation. As for the file, it did exist, but it is difficult to share because it contains sensitive customer information.

Instead, we have emailed you information about a development environment where a similar problem is occurring, and we are considering whether we can investigate the issue here.

It would be very nice to have a tool that can revert a hung Task Hub, as it is now impossible to examine the logs of the process at this time.

@sebastianburckhardt
Copy link
Member

Last week we found a bug that manifests exactly like what is reported here.
#343

Since the symptoms are a precise match, there is a good chance that our current fix will address this issue.

@shibayan
Copy link
Author

That's good news, and while I've been fortunate enough not to have experienced any hangs to this point since updating to the latest version of Netherite, I'm glad to see that the root cause is being addressed.

@ericleigh007
Copy link

Not to tag on here, but we see this bug in 1.4.2, which supposedly has the FASTER upgrades.

We are working on our upgrade to 1.5.1 and the release notes look promising to close more bugs that we're seeing.

Just wanted to you let you know that we this as of 1.4.2 / 2.0.23 of FASTER.

@ericleigh007
Copy link

We have sent in logs where we still receive this error in the 1.5.1 version, just FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Priority 1
Projects
None yet
Development

No branches or pull requests

5 participants