Activity Function started multiple times before completion #392

marius-stanescu · 2024-05-03T12:26:20Z

Summary:
We're observing instances where the same Activity Function, is being invoked multiple times before completing.

Additional information:

This Activity Function is started by an Orchestrator that awaits its completion. Although the Activity Function took around 35 seconds to be completely executed (when it succeeded), the Orchestrator's overall duration was more than 35 minutes.

From the correlated dependency logs, each time the Activity Function is invoked, it makes some progress, but it doesn't fully finish. There are no exceptions or errors during the processing; the Function seems to just "die" all of a sudden, and it is started again at a later time. It finally completed the execution more than 33 minutes after it was first invoked.

Logs:
The series of logs from the execution environment that show the recurring invocation of the Activity Function without successful completion:

4/30/2024, 11:19:16.877 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=d9480089-03b2-43c3-bb43-1ae28422bf90)
4/30/2024, 11:20:20.344 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=9635f32b-d150-41f1-98a1-7ac11e254a79)
4/30/2024, 11:24:45.411 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=59e01faf-f031-427c-a5db-4c59f7ea32f2)
4/30/2024, 11:25:36.743 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=cbd95782-2130-4385-a6c9-655e46bbd0e4)
4/30/2024, 11:28:45.263 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=25923d2c-9df4-47a2-824d-5f33c5b57489)
4/30/2024, 11:30:49.465 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=90e1f92f-e204-44f7-967a-884196c05952)
4/30/2024, 11:37:34.184 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=858b82c5-7691-40e4-940b-4ee405e2201c)
4/30/2024, 11:40:32.827 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=0e8b82a6-fc7a-4b92-84e6-10c03711c4c9)
4/30/2024, 11:44:36.547 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=611be298-bdb1-48eb-888e-1192e040d7d9)
4/30/2024, 11:45:51.290 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=68763aed-1b55-40e6-848c-48070bb23520)
4/30/2024, 11:47:25.412 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=4cb06969-3175-464e-8c8a-0c653957c43d)
4/30/2024, 11:48:57.625 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=c070d3ff-b8df-4f1a-a420-7852b16b936e)
4/30/2024, 11:50:28.654 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=83f41053-62a0-46db-b961-f39d40d2e4f5)
4/30/2024, 11:52:12.297 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=49445402-c24e-4066-9922-fca6a2c5e9b5)
4/30/2024, 11:52:45.042 AM
Executed 'StartOfDayUpsertActivity' (Succeeded, Id=49445402-c24e-4066-9922-fca6a2c5e9b5, Duration=34883ms)

At the same time, we observed various types of warnings, such as performance issues, storage conflicts, timeouts, and resource utilization alerts, that might be related to the unexpected behaviour described above. Here are some examples of such warnings:

Error response [8959bebe-ae91-4a2a-addb-8fd3476e538e] 409 The specified container already exists. (02.6s) Server:Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id:5599b6e0-301e-0061-73f0-9aa36d000000 x-ms-client-request-id:8959bebe-ae91-4a2a-addb-8fd3476e538e x-ms-version:2022-11-02 x-ms-error-code:ContainerAlreadyExists Date:Tue, 30 Apr 2024 11:19:14 GMT Content-Length:230 Content-Type:application/xml

Error response [2badf8bd-5897-48be-9fb5-9f2c29b52033] 409 There is already a lease present. (02.0s) Server:Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id:4885e259-c01e-0038-5ff0-9a24ee000000 x-ms-client-request-id:2badf8bd-5897-48be-9fb5-9f2c29b52033 x-ms-version:2022-11-02 x-ms-error-code:LeaseAlreadyPresent Date:Tue, 30 Apr 2024 11:19:15 GMT Content-Length:221 Content-Type:application/xml

Part00 Performance issue detected: storage operation PageBlobClient.DownloadStreamingAsync (ReadFromDevice) took 1.5s on attempt 1, which is excessive; id=7213 position=51179008 length=3584 operationReadRange=[51179008, 51182592]

Part10 Performance issue detected: RenewLeaseAsync took 2.4s, which is excessive; -11.871818500000003s past expiry

[HostMonitor] Host CPU threshold exceeded (87 >= 80)

Possible thread pool starvation detected.

Part01.0173103172 Dropped query 4890768R16 during recovery because it has timed out

Client.cc4e2ab Request cc4e2abR3 for partition 01 timed out

What could be the cause of the Activity Function not being completed, and started over and over again? And how can this be prevented?

marius-stanescu · 2024-05-13T09:58:57Z

Hey. Any news about this?

adamconway · 2024-08-16T01:56:29Z

We have noticed a similar issue. In some cases, our Entity function appears to be triggered multiple times without completing. It seems to stop partway through without generating any further error logs or traces.

After approximately 3-15 minutes, the Entity will be signalled again and complete successfully. When this issue is not occuring, the whole process end-to-end only takes 100-300ms.

This is occurring fairly regularly, so if there are any debug logs we can collect to identify the issue, please let me know.

lilyjma · 2024-08-26T22:35:55Z

@marius-stanescu and @adamconway - what language are you using by the way?

@sebastianburckhardt - could you take a look? This seems to be happening for both Orchestrations and Entities when using Netherite.

marius-stanescu · 2024-08-27T07:33:19Z

@marius-stanescu and @adamconway - what language are you using by the way?

C#

sebastianburckhardt · 2024-08-28T22:38:04Z

What could be the cause of the Activity Function not being completed, and started over and over again? And how can this be prevented?

Common causes could include infinite loops in the application code, or out-of-memory issues. There could also be process crashes caused by something not related to the activity, but impacting the activity execution.

In all of those cases, it is expected that the activity will be retried again later.

Of the error messages you posted, none of the Netherite-internal ones look out of the ordinary, however these are worrisome:

  [HostMonitor] Host CPU threshold exceeded (87 >= 80)

  Possible thread pool starvation detected

they could indicate that you are using too much CPU on a worker which can cause all kinds of issues (including the ones you are describing).

I can look at our internal telemetry if you give me the app name and a time window (looks like I can infer the time window from what you posted earlier)

adamconway · 2024-09-04T07:12:22Z

Also using C# here. We've recently switched to Netherite, previously this code has been running for multiple years on the Storage Provider without issue which is why we suspect it was something in Netherite rather than application code.

I've provided Function app names/IDs, timings, logs etc. via support ticket 2408280040000227.

We also noticed if we set the minimum scale of our Function Apps to a higher number, it seems to reduce the impact of the issue (less cancelled executions and shorter delays), but the delay is still in the range of 30-600 seconds in some cases.

marius-stanescu · 2024-09-04T09:52:57Z

The app name is fa-uks-lbg-nft-more.
In the meantime, we moved away from Netherite, as it didn't seem stable enough. Maybe we'll give it another go in the future.

microsoft-github-policy-service bot added the Needs: Triage 🔍 label May 3, 2024

nytian added the P1 Priority 1 label May 7, 2024

lilyjma added Reliability Durable functions get stuck or don’t run as expected. and removed Needs: Triage 🔍 labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activity Function started multiple times before completion #392

Activity Function started multiple times before completion #392

marius-stanescu commented May 3, 2024

marius-stanescu commented May 13, 2024

adamconway commented Aug 16, 2024 •

edited

Loading

lilyjma commented Aug 26, 2024 •

edited

Loading

marius-stanescu commented Aug 27, 2024

sebastianburckhardt commented Aug 28, 2024

adamconway commented Sep 4, 2024

marius-stanescu commented Sep 4, 2024

Activity Function started multiple times before completion #392

Activity Function started multiple times before completion #392

Comments

marius-stanescu commented May 3, 2024

marius-stanescu commented May 13, 2024

adamconway commented Aug 16, 2024 • edited Loading

lilyjma commented Aug 26, 2024 • edited Loading

marius-stanescu commented Aug 27, 2024

sebastianburckhardt commented Aug 28, 2024

adamconway commented Sep 4, 2024

marius-stanescu commented Sep 4, 2024

adamconway commented Aug 16, 2024 •

edited

Loading

lilyjma commented Aug 26, 2024 •

edited

Loading