Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activity Function started multiple times before completion #392

Open
marius-stanescu opened this issue May 3, 2024 · 7 comments
Open

Activity Function started multiple times before completion #392

marius-stanescu opened this issue May 3, 2024 · 7 comments
Labels
P1 Priority 1 Reliability Durable functions get stuck or don’t run as expected.

Comments

@marius-stanescu
Copy link

Summary:
We're observing instances where the same Activity Function, is being invoked multiple times before completing.

Additional information:

This Activity Function is started by an Orchestrator that awaits its completion. Although the Activity Function took around 35 seconds to be completely executed (when it succeeded), the Orchestrator's overall duration was more than 35 minutes.

From the correlated dependency logs, each time the Activity Function is invoked, it makes some progress, but it doesn't fully finish. There are no exceptions or errors during the processing; the Function seems to just "die" all of a sudden, and it is started again at a later time. It finally completed the execution more than 33 minutes after it was first invoked.

Logs:
The series of logs from the execution environment that show the recurring invocation of the Activity Function without successful completion:

4/30/2024, 11:19:16.877 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=d9480089-03b2-43c3-bb43-1ae28422bf90)
4/30/2024, 11:20:20.344 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=9635f32b-d150-41f1-98a1-7ac11e254a79)
4/30/2024, 11:24:45.411 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=59e01faf-f031-427c-a5db-4c59f7ea32f2)
4/30/2024, 11:25:36.743 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=cbd95782-2130-4385-a6c9-655e46bbd0e4)
4/30/2024, 11:28:45.263 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=25923d2c-9df4-47a2-824d-5f33c5b57489)
4/30/2024, 11:30:49.465 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=90e1f92f-e204-44f7-967a-884196c05952)
4/30/2024, 11:37:34.184 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=858b82c5-7691-40e4-940b-4ee405e2201c)
4/30/2024, 11:40:32.827 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=0e8b82a6-fc7a-4b92-84e6-10c03711c4c9)
4/30/2024, 11:44:36.547 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=611be298-bdb1-48eb-888e-1192e040d7d9)
4/30/2024, 11:45:51.290 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=68763aed-1b55-40e6-848c-48070bb23520)
4/30/2024, 11:47:25.412 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=4cb06969-3175-464e-8c8a-0c653957c43d)
4/30/2024, 11:48:57.625 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=c070d3ff-b8df-4f1a-a420-7852b16b936e)
4/30/2024, 11:50:28.654 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=83f41053-62a0-46db-b961-f39d40d2e4f5)
4/30/2024, 11:52:12.297 AM
Executing 'StartOfDayUpsertActivity' (Reason='(null)', Id=49445402-c24e-4066-9922-fca6a2c5e9b5)
4/30/2024, 11:52:45.042 AM
Executed 'StartOfDayUpsertActivity' (Succeeded, Id=49445402-c24e-4066-9922-fca6a2c5e9b5, Duration=34883ms)

At the same time, we observed various types of warnings, such as performance issues, storage conflicts, timeouts, and resource utilization alerts, that might be related to the unexpected behaviour described above. Here are some examples of such warnings:

Error response [8959bebe-ae91-4a2a-addb-8fd3476e538e] 409 The specified container already exists. (02.6s) Server:Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id:5599b6e0-301e-0061-73f0-9aa36d000000 x-ms-client-request-id:8959bebe-ae91-4a2a-addb-8fd3476e538e x-ms-version:2022-11-02 x-ms-error-code:ContainerAlreadyExists Date:Tue, 30 Apr 2024 11:19:14 GMT Content-Length:230 Content-Type:application/xml

Error response [2badf8bd-5897-48be-9fb5-9f2c29b52033] 409 There is already a lease present. (02.0s) Server:Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id:4885e259-c01e-0038-5ff0-9a24ee000000 x-ms-client-request-id:2badf8bd-5897-48be-9fb5-9f2c29b52033 x-ms-version:2022-11-02 x-ms-error-code:LeaseAlreadyPresent Date:Tue, 30 Apr 2024 11:19:15 GMT Content-Length:221 Content-Type:application/xml

Part00 Performance issue detected: storage operation PageBlobClient.DownloadStreamingAsync (ReadFromDevice) took 1.5s on attempt 1, which is excessive; id=7213 position=51179008 length=3584 operationReadRange=[51179008, 51182592]

Part10 Performance issue detected: RenewLeaseAsync took 2.4s, which is excessive; -11.871818500000003s past expiry

[HostMonitor] Host CPU threshold exceeded (87 >= 80)

Possible thread pool starvation detected.

Part01.0173103172 Dropped query 4890768R16 during recovery because it has timed out

Client.cc4e2ab Request cc4e2abR3 for partition 01 timed out

What could be the cause of the Activity Function not being completed, and started over and over again? And how can this be prevented?

@nytian nytian added the P1 Priority 1 label May 7, 2024
@lilyjma lilyjma added Reliability Durable functions get stuck or don’t run as expected. and removed Needs: Triage 🔍 labels May 7, 2024
@marius-stanescu
Copy link
Author

Hey. Any news about this?

@adamconway
Copy link

adamconway commented Aug 16, 2024

We have noticed a similar issue. In some cases, our Entity function appears to be triggered multiple times without completing. It seems to stop partway through without generating any further error logs or traces.

After approximately 3-15 minutes, the Entity will be signalled again and complete successfully. When this issue is not occuring, the whole process end-to-end only takes 100-300ms.

This is occurring fairly regularly, so if there are any debug logs we can collect to identify the issue, please let me know.

@lilyjma
Copy link
Member

lilyjma commented Aug 26, 2024

@marius-stanescu and @adamconway - what language are you using by the way?

@sebastianburckhardt - could you take a look? This seems to be happening for both Orchestrations and Entities when using Netherite.

@marius-stanescu
Copy link
Author

@marius-stanescu and @adamconway - what language are you using by the way?

C#

@sebastianburckhardt
Copy link
Member

What could be the cause of the Activity Function not being completed, and started over and over again? And how can this be prevented?

Common causes could include infinite loops in the application code, or out-of-memory issues. There could also be process crashes caused by something not related to the activity, but impacting the activity execution.

In all of those cases, it is expected that the activity will be retried again later.

Of the error messages you posted, none of the Netherite-internal ones look out of the ordinary, however these are worrisome:

  [HostMonitor] Host CPU threshold exceeded (87 >= 80)

  Possible thread pool starvation detected

they could indicate that you are using too much CPU on a worker which can cause all kinds of issues (including the ones you are describing).

I can look at our internal telemetry if you give me the app name and a time window (looks like I can infer the time window from what you posted earlier)

@adamconway
Copy link

Also using C# here. We've recently switched to Netherite, previously this code has been running for multiple years on the Storage Provider without issue which is why we suspect it was something in Netherite rather than application code.

I've provided Function app names/IDs, timings, logs etc. via support ticket 2408280040000227.

We also noticed if we set the minimum scale of our Function Apps to a higher number, it seems to reduce the impact of the issue (less cancelled executions and shorter delays), but the delay is still in the range of 30-600 seconds in some cases.

@marius-stanescu
Copy link
Author

The app name is fa-uks-lbg-nft-more.
In the meantime, we moved away from Netherite, as it didn't seem stable enough. Maybe we'll give it another go in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Priority 1 Reliability Durable functions get stuck or don’t run as expected.
Projects
None yet
Development

No branches or pull requests

5 participants