You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing high Latency between differerent DurableTask WorkItems. This happens under low, medium and high Load.
Service Facts:
App
We are running Durable Functions in dotnet-8 dotnet-isolated. Our Function is running on a Windows-AppServicePlan on EP1 SKU, with Autoscale enabled up to 20 Instances.
We have also set the "FUNCTIONS_WORKER_PROCESS_COUNT" Env. Variable to 4. As this worked the best in the past.
Function
The Function is configured with following host.json:
We tested the versions of Microsoft.Azure.Functions.Worker.Extensions.DurableTask1.2.2and .Azure.Functions.Worker.Extensions.DurableTask.Netherite3.0.0.
Instance IDs
When the HTTP Function starts the Run, we set the InstanceID to a Guid plus an additional Suffix Number. This is typically a ":0" therefore in this test, all Runs had a InstanceID as follows: <GUID>:0
Could this be a Problem in Our Case?
We had the fear, that this would result in our instance placement on the wrong partitions:
This seems Ok, but also a little bit unbalanced.
Test Description
We have run the Tests with the tool k6 from our client, and called the HTTP Trigger Function.
This Http Trigger starts an Orchestration, which loads some "Work" Information from a Cosmos-DB (A1, A2). This Information includes the Tasks with their corresponding Parameters it has to do. In This Sample, we had one Task configured. Then some Activity Functions prepare the Work (A3, A4, A5).
For the "Do Work" Activity Function (A6), we also start a Durable Timer (T) to wait for a configured Timeout. Here we wait with Task.WhenAny(doWork, timer) as per Documentation.
This Process can be seen in the following diagram:
For the Test, we run k6 with 10 Virtual Users, who each start this mentioned Process over 10s. Since the HTTP Function should instantly return after Orchestration creation, we should have way more runs, but in our case we had approx. 60 Runs.
Problem
We are seeing a large Number of Orchestration Runs, which take magnitudes longer than the majority. We think, that this is mostly because there is some Delay between Netherite Work-Items, but we don't know why these delays are happening.
Negative Test
When looking at the following Orchestration Run:
On Activity 2, you can see, that the Activity Work-Item starts way earlier (1,5s) than the "Invoke" of the Work is triggered.
Also, you can see, that the "create_orchestration" Function runs longer than the Orchestration itself. Even though the Function just starts the Orchestration and returns a Response.
This in turn makes the HttpStart Function run way longer than expected:
Positive Test
As a comparison, here is a positive Test, where everything runs in ~900ms:
Additional Metrics during Test
During Test Execution, the App scales to 7 Instances:
In the time of scaling, we also see some high Delays, as the partitions are load balanced across Instances.
One thing we also find interesting are the amount of "ChangeNotify" Calls, that ran even after the test has finished:
In the above mentionend Negative Test, we have also seen, that from the first Netherite Event, until the first Log Message from our code is run, it takes 300ms:
What we have tried
Higher Premium Plans
We have tried EP2, but for our case, we find that EP1 should be enough and we have other issues than raw performance.
Higher Concurrent Activities or Orchestrator
When running this test with maxConcurrentOrchestratorFunctions set to 20 and maxConcurrentActivityFunctions set to 50 we achieved similar resultsa, but we can retest if necessary.
Splitting on two Storage Accounts
When running some Tests, we noticed some Latency Spikes on the Storage-Account configured in AzureWebJobsStorage.
Therefore we set the AzureWebJobsStorage and WEBSITE_CONTENTAZUREFILECONNECTIONSTRING to two different Storage-Accounts, which did not help, but reduced Performance.
Logging to Warning
We also set all our Logs to "Warning" level, as we expected to reduce the stress on the service, which did also not help.
Questions
Can you please help us identify the Issue(s)?
Does the Environment-Variable "FUNCTIONS_WORKER_PROCESS_COUNT" negatively impact the scaling or partition-load-balancing?
2.1) Are the maxConcurrentOrchestrator/Activity Settigns per App-Istance or per Host-Process?
2.2) Are the partitions load balanced across just Instances or Instances and Host-Processes? At 12 Partitions and 3 Instances and 4 Host-Process per Instance, each Host-Process has one partition?
As mentioned before, we also see some high delays during scaling, can this be somewhat mitigated (other than having more always-ready instances)?
If you need more information, or some Run-Identifiers, please feel free to contact me or my collegues @ManiStras and @RafaelBak.
Thanks in Advance,
Markus
The text was updated successfully, but these errors were encountered:
Issue Description
We are experiencing high Latency between differerent DurableTask WorkItems. This happens under low, medium and high Load.
Service Facts:
App
We are running Durable Functions in dotnet-8
dotnet-isolated
. Our Function is running on a Windows-AppServicePlan on EP1 SKU, with Autoscale enabled up to 20 Instances.We have also set the "FUNCTIONS_WORKER_PROCESS_COUNT" Env. Variable to 4. As this worked the best in the past.
Function
The Function is configured with following host.json:
DurableTask Versions
We tested the versions of
Microsoft.Azure.Functions.Worker.Extensions.DurableTask
1.2.2
and.Azure.Functions.Worker.Extensions.DurableTask.Netherite
3.0.0
.Instance IDs
When the HTTP Function starts the Run, we set the InstanceID to a Guid plus an additional Suffix Number. This is typically a ":0" therefore in this test, all Runs had a InstanceID as follows:
<GUID>:0
Could this be a Problem in Our Case?
We had the fear, that this would result in our instance placement on the wrong partitions:
This seems Ok, but also a little bit unbalanced.
Test Description
We have run the Tests with the tool
k6
from our client, and called the HTTP Trigger Function.This Http Trigger starts an Orchestration, which loads some "Work" Information from a Cosmos-DB (A1, A2). This Information includes the Tasks with their corresponding Parameters it has to do. In This Sample, we had one Task configured. Then some Activity Functions prepare the Work (A3, A4, A5).
For the "Do Work" Activity Function (A6), we also start a Durable Timer (T) to wait for a configured Timeout. Here we wait with
Task.WhenAny(doWork, timer)
as per Documentation.This Process can be seen in the following diagram:
For the Test, we run
k6
with 10 Virtual Users, who each start this mentioned Process over 10s. Since the HTTP Functionshould
instantly return after Orchestration creation, we should have way more runs, but in our case we had approx. 60 Runs.Problem
We are seeing a large Number of Orchestration Runs, which take magnitudes longer than the majority. We think, that this is mostly because there is some Delay between Netherite Work-Items, but we don't know why these delays are happening.
Negative Test
When looking at the following Orchestration Run:
On Activity 2, you can see, that the Activity Work-Item starts way earlier (1,5s) than the "Invoke" of the Work is triggered.
Also, you can see, that the "create_orchestration" Function runs longer than the Orchestration itself. Even though the Function just starts the Orchestration and returns a Response.
This in turn makes the HttpStart Function run way longer than expected:
Positive Test
As a comparison, here is a positive Test, where everything runs in ~900ms:
Additional Metrics during Test
During Test Execution, the App scales to 7 Instances:
In the time of scaling, we also see some high Delays, as the partitions are load balanced across Instances.
One thing we also find interesting are the amount of "ChangeNotify" Calls, that ran even after the test has finished:
In the above mentionend Negative Test, we have also seen, that from the first Netherite Event, until the first Log Message from our code is run, it takes 300ms:
What we have tried
Higher Premium Plans
We have tried EP2, but for our case, we find that EP1 should be enough and we have other issues than raw performance.
Higher Concurrent Activities or Orchestrator
When running this test with
maxConcurrentOrchestratorFunctions
set to 20 andmaxConcurrentActivityFunctions
set to 50 we achieved similar resultsa, but we can retest if necessary.Splitting on two Storage Accounts
When running some Tests, we noticed some Latency Spikes on the Storage-Account configured in
AzureWebJobsStorage
.Therefore we set the
AzureWebJobsStorage
andWEBSITE_CONTENTAZUREFILECONNECTIONSTRING
to two different Storage-Accounts, which did not help, but reduced Performance.Logging to Warning
We also set all our Logs to "Warning" level, as we expected to reduce the stress on the service, which did also not help.
Questions
2.1) Are the maxConcurrentOrchestrator/Activity Settigns per App-Istance or per Host-Process?
2.2) Are the partitions load balanced across just Instances or Instances and Host-Processes? At 12 Partitions and 3 Instances and 4 Host-Process per Instance, each Host-Process has one partition?
If you need more information, or some Run-Identifiers, please feel free to contact me or my collegues @ManiStras and @RafaelBak.
Thanks in Advance,
Markus
The text was updated successfully, but these errors were encountered: