High Latency between DurableTask WorkItems #447

mbergerMeltiq · 2025-02-14T13:47:40Z

Issue Description

We are experiencing high Latency between differerent DurableTask WorkItems. This happens under low, medium and high Load.

Service Facts:

App

We are running Durable Functions in dotnet-8 dotnet-isolated. Our Function is running on a Windows-AppServicePlan on EP1 SKU, with Autoscale enabled up to 20 Instances.
We have also set the "FUNCTIONS_WORKER_PROCESS_COUNT" Env. Variable to 4. As this worked the best in the past.

Function

The Function is configured with following host.json:

"durableTask": {
  "LogLevelLimit": "Debug",
  "StorageLogLevelLimit": "Debug",
  "TransportLogLevelLimit": "Debug",
  "EventLogLevelLimit": "Debug",
  "WorkItemLogLevelLimit": "Debug",
  "ClientLogLevelLimit": "Debug",
  "LoadMonitorLogLevelLimit": "Debug",
  "UseGracefulShutdown": true,
  "maxConcurrentActivityFunctions": 10,
  "maxConcurrentOrchestratorFunctions": 10,
  "tracing": {
    "distributedTracingEnabled": true,
    "Version": "V2"
  },
  "storageProvider": {
    "type": "Netherite",
    "partitionCount": 12,
    "CacheOrchestrationCursors": true
  },
  "hubName": "%FunctionHubName%"
}

DurableTask Versions

We tested the versions of Microsoft.Azure.Functions.Worker.Extensions.DurableTask 1.2.2and .Azure.Functions.Worker.Extensions.DurableTask.Netherite 3.0.0.

Instance IDs

When the HTTP Function starts the Run, we set the InstanceID to a Guid plus an additional Suffix Number. This is typically a ":0" therefore in this test, all Runs had a InstanceID as follows: <GUID>:0

Could this be a Problem in Our Case?
We had the fear, that this would result in our instance placement on the wrong partitions:

This seems Ok, but also a little bit unbalanced.

Test Description

We have run the Tests with the tool k6 from our client, and called the HTTP Trigger Function.
This Http Trigger starts an Orchestration, which loads some "Work" Information from a Cosmos-DB (A1, A2). This Information includes the Tasks with their corresponding Parameters it has to do. In This Sample, we had one Task configured. Then some Activity Functions prepare the Work (A3, A4, A5).
For the "Do Work" Activity Function (A6), we also start a Durable Timer (T) to wait for a configured Timeout. Here we wait with Task.WhenAny(doWork, timer) as per Documentation.
This Process can be seen in the following diagram:

For the Test, we run k6 with 10 Virtual Users, who each start this mentioned Process over 10s. Since the HTTP Function should instantly return after Orchestration creation, we should have way more runs, but in our case we had approx. 60 Runs.

Problem

We are seeing a large Number of Orchestration Runs, which take magnitudes longer than the majority. We think, that this is mostly because there is some Delay between Netherite Work-Items, but we don't know why these delays are happening.

Negative Test

When looking at the following Orchestration Run:

On Activity 2, you can see, that the Activity Work-Item starts way earlier (1,5s) than the "Invoke" of the Work is triggered.

Also, you can see, that the "create_orchestration" Function runs longer than the Orchestration itself. Even though the Function just starts the Orchestration and returns a Response.
This in turn makes the HttpStart Function run way longer than expected:

Positive Test

As a comparison, here is a positive Test, where everything runs in ~900ms:

Additional Metrics during Test

During Test Execution, the App scales to 7 Instances:

In the time of scaling, we also see some high Delays, as the partitions are load balanced across Instances.

One thing we also find interesting are the amount of "ChangeNotify" Calls, that ran even after the test has finished:

In the above mentionend Negative Test, we have also seen, that from the first Netherite Event, until the first Log Message from our code is run, it takes 300ms:

What we have tried

Higher Premium Plans

We have tried EP2, but for our case, we find that EP1 should be enough and we have other issues than raw performance.

Higher Concurrent Activities or Orchestrator

When running this test with maxConcurrentOrchestratorFunctions set to 20 and maxConcurrentActivityFunctions set to 50 we achieved similar resultsa, but we can retest if necessary.

Splitting on two Storage Accounts

When running some Tests, we noticed some Latency Spikes on the Storage-Account configured in AzureWebJobsStorage.
Therefore we set the AzureWebJobsStorage and WEBSITE_CONTENTAZUREFILECONNECTIONSTRING to two different Storage-Accounts, which did not help, but reduced Performance.

Logging to Warning

We also set all our Logs to "Warning" level, as we expected to reduce the stress on the service, which did also not help.

Questions

Can you please help us identify the Issue(s)?
Does the Environment-Variable "FUNCTIONS_WORKER_PROCESS_COUNT" negatively impact the scaling or partition-load-balancing?
2.1) Are the maxConcurrentOrchestrator/Activity Settigns per App-Istance or per Host-Process?
2.2) Are the partitions load balanced across just Instances or Instances and Host-Processes? At 12 Partitions and 3 Instances and 4 Host-Process per Instance, each Host-Process has one partition?
As mentioned before, we also see some high delays during scaling, can this be somewhat mitigated (other than having more always-ready instances)?

If you need more information, or some Run-Identifiers, please feel free to contact me or my collegues @ManiStras and @RafaelBak.
Thanks in Advance,
Markus

The text was updated successfully, but these errors were encountered:

microsoft-github-policy-service bot added the Needs: Triage 🔍 label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Latency between DurableTask WorkItems #447

High Latency between DurableTask WorkItems #447

mbergerMeltiq commented Feb 14, 2025

High Latency between DurableTask WorkItems #447

High Latency between DurableTask WorkItems #447

Comments

mbergerMeltiq commented Feb 14, 2025

Issue Description

Service Facts:

App

Function

DurableTask Versions

Instance IDs

Test Description

Problem

Negative Test

Positive Test

Additional Metrics during Test

What we have tried

Higher Premium Plans

Higher Concurrent Activities or Orchestrator

Splitting on two Storage Accounts

Logging to Warning

Questions