`TimeoutException` at the start of the orchestrator #319

shibayan · 2023-10-16T08:43:21Z

I am trying to start an orchestrator function by calling StartNewAsync, but the send to Event Hubs is successful as far as Application Insights is concerned, but the method call is blocking and times out.

In older versions, this was a rare occurrence, but after updating the Netherite version to v1.4.0, it now occurs almost 100% of the time.

Information

Durable Functions version: v2.10.0
Netherite version: v1.4.0
Durable Functions Monitor version: v6.3.0

Stack Trace

Microsoft.Azure.WebJobs.Host.FunctionInvocationException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw で処理された内部例外 System.TimeoutException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.Client+<CreateTaskOrchestrationAsync>d__37.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:469)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.NetheriteOrchestrationService+<DurableTask-Core-IOrchestrationServiceClient-CreateTaskOrchestrationAsync>d__100.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:594)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Core.TaskHubClient+<InternalCreateOrchestrationInstanceWithRaisedEventAsync>d__27.MoveNext (DurableTask.Core, Version=2.13.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd: /_/src/DurableTask.Core/TaskHubClient.cs:614)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient+<Microsoft-Azure-WebJobs-Extensions-DurableTask-IDurableOrchestrationClient-StartNewAsync>d__34`1.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:210)

The text was updated successfully, but these errors were encountered:

shibayan · 2023-10-20T03:26:51Z

Production environments that had been operating normally suddenly began to experience similar exceptions, with a probability of orchestration startup failure of approximately 20~30%.

Since no new deployments have been made, I suspect that the state held internally by Netherite has been corrupted. From checking the Event Hubs metrics, this one appears to be working fine.

This may be related to what is reported in the following discussion.

Message queue stuck with processing (hangs) #282

Information

Durable Functions version: v2.9.6
Netherite version: v1.3.5
Durable Functions Monitor version: v6.2.1

Stack Trace

Microsoft.Azure.WebJobs.Host.FunctionInvocationException:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.39.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)
Inner exception System.TimeoutException handled at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw:
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.Client+<CreateTaskOrchestrationAsync>d__37.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:469)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Netherite.NetheriteOrchestrationService+<DurableTask-Core-IOrchestrationServiceClient-CreateTaskOrchestrationAsync>d__99.MoveNext (DurableTask.Netherite, Version=1.0.0.0, Culture=neutral, PublicKeyToken=ef8c4135b1b4225a: /_/src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:575)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at DurableTask.Core.TaskHubClient+<InternalCreateOrchestrationInstanceWithRaisedEventAsync>d__27.MoveNext (DurableTask.Core, Version=2.13.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd: /_/src/DurableTask.Core/TaskHubClient.cs:614)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient+<Microsoft-Azure-WebJobs-Extensions-DurableTask-IDurableOrchestrationClient-StartNewAsync>d__34`1.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:210)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)

shibayan · 2023-10-20T14:48:08Z

As additional information, we observed that the probability of the orchestrator failing to start increased step by step, eventually failing to start almost 100% of the time.

We have confirmed that changing the hubName as a workaround will allow the orchestrator to start correctly again. Based on these results, we suspect a consistency problem or corruption on the Blob side that Netherite uses to store state.

We have the problematic blob still intact and can share it with you if you need it for your investigation.

cgillum · 2023-10-20T17:39:14Z

@davidmrdavid or @sebastianburckhardt does this issue look familiar?

sebastianburckhardt · 2023-10-24T21:28:28Z

Your observations are consistent with the partition state getting corrupted, and one or more partitions becoming unresponsive as a result. If this keeps happening (and almost all partitions are corrupted) it would explain why orchestrators fail almost all the time.

We have been trying to find the source of these state corruptions. We have added some fixes and mitigations since 1.3.5 - it may be worth trying the latest release (1.4.1).

Also, I would be happy to look at traces to try to pinpoint the source of the corruption. This has been a problem for a while now. If this is running on a hosted plan (e.g. consumption or premium) I can take a look at our internal telemetry if you tell me that app name.

shibayan · 2023-10-27T17:10:38Z

@sebastianburckhardt Thanks for the reply. We are going to upgrade to v1.4.1 now and see if the issue has improved.

We still have Azure Functions (Consumption plan) that had this problem a few days ago and would be happy to share the name of the app to help improve it. However, as this will be a production environment, we would prefer to provide details via email or other private channels. Can you please let us know how to do this?

sebastianburckhardt · 2023-10-27T17:56:49Z

Feel free to send me an e-mail. If you don't know the address, it is on my homepage (and you can find my homepage from my github profile page I believe).

sebastianburckhardt · 2023-11-03T21:40:08Z

Looking at the traces I can confirm that our suspicion was right.. several partitions enter a state where recovery fails repeatedly - in each case, the recovery appears to successfully read to the end of the object log but then stops making progress.

Unfortunately I was not able to diagnose why the recovery hangs, all the storage accesses in the traces look normal.
At this point I would need some other way to inspect the failing recovery. These are the blobs being read during recovery:

336d6313-f2f9-4a71-8300-a602bc176941/p11/commit-lease
336d6313-f2f9-4a71-8300-a602bc176941/p11/commit-log/commit-log.28
336d6313-f2f9-4a71-8300-a602bc176941/p11/last-checkpoint.json
336d6313-f2f9-4a71-8300-a602bc176941/p11/cpr-checkpoints/a5f4698a-e9f9-47a8-bd9d-2dc2d39e3378/singletons.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/cpr-checkpoints/a5f4698a-e9f9-47a8-bd9d-2dc2d39e3378/info.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/index-checkpoints/39dbaaed-b115-484a-bb31-9ce773a0d0dd/info.dat
336d6313-f2f9-4a71-8300-a602bc176941/p11/index-checkpoints/39dbaaed-b115-484a-bb31-9ce773a0d0dd/ht.dat.0
336d6313-f2f9-4a71-8300-a602bc176941/p11/store/store.0
336d6313-f2f9-4a71-8300-a602bc176941/p11/store.obj/store.obj.0

If these are still around, we may be able to use them to debug the recovery.

I am considering creating an executable that can be run directly from the command line which runs the FASTER recovery and can be attached to easily in the debugger.

shibayan · 2023-11-22T05:02:31Z

Thank you for your investigation. As for the file, it did exist, but it is difficult to share because it contains sensitive customer information.

Instead, we have emailed you information about a development environment where a similar problem is occurring, and we are considering whether we can investigate the issue here.

It would be very nice to have a tool that can revert a hung Task Hub, as it is now impossible to examine the logs of the process at this time.

sebastianburckhardt · 2024-01-18T19:35:03Z

Last week we found a bug that manifests exactly like what is reported here.
#343

Since the symptoms are a precise match, there is a good chance that our current fix will address this issue.

shibayan · 2024-01-19T11:57:01Z

That's good news, and while I've been fortunate enough not to have experienced any hangs to this point since updating to the latest version of Netherite, I'm glad to see that the root cause is being addressed.

ericleigh007 · 2024-05-13T14:24:47Z

Not to tag on here, but we see this bug in 1.4.2, which supposedly has the FASTER upgrades.

We are working on our upgrade to 1.5.1 and the release notes look promising to close more bugs that we're seeing.

Just wanted to you let you know that we this as of 1.4.2 / 2.0.23 of FASTER.

ericleigh007 · 2024-05-16T20:05:49Z

We have sent in logs where we still receive this error in the 1.5.1 version, just FYI

microsoft-github-policy-service bot added the Needs: Triage 🔍 label Oct 16, 2023

nytian added the P1 Priority 1 label Oct 24, 2023

sebastianburckhardt added needs author response and removed Needs: Triage 🔍 labels Oct 24, 2023

sebastianburckhardt removed the needs author response label Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TimeoutException` at the start of the orchestrator #319

`TimeoutException` at the start of the orchestrator #319

shibayan commented Oct 16, 2023

shibayan commented Oct 20, 2023

shibayan commented Oct 20, 2023

cgillum commented Oct 20, 2023

sebastianburckhardt commented Oct 24, 2023 •

edited

Loading

shibayan commented Oct 27, 2023

sebastianburckhardt commented Oct 27, 2023

sebastianburckhardt commented Nov 3, 2023

shibayan commented Nov 22, 2023

sebastianburckhardt commented Jan 18, 2024

shibayan commented Jan 19, 2024

ericleigh007 commented May 13, 2024

ericleigh007 commented May 16, 2024

TimeoutException at the start of the orchestrator #319

TimeoutException at the start of the orchestrator #319

Comments

shibayan commented Oct 16, 2023

Information

Stack Trace

shibayan commented Oct 20, 2023

Information

Stack Trace

shibayan commented Oct 20, 2023

cgillum commented Oct 20, 2023

sebastianburckhardt commented Oct 24, 2023 • edited Loading

shibayan commented Oct 27, 2023

sebastianburckhardt commented Oct 27, 2023

sebastianburckhardt commented Nov 3, 2023

shibayan commented Nov 22, 2023

sebastianburckhardt commented Jan 18, 2024

shibayan commented Jan 19, 2024

ericleigh007 commented May 13, 2024

ericleigh007 commented May 16, 2024

`TimeoutException` at the start of the orchestrator #319

`TimeoutException` at the start of the orchestrator #319

sebastianburckhardt commented Oct 24, 2023 •

edited

Loading