Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Media Platform Overloaded status #793

Open
rvleeuwen256 opened this issue Dec 12, 2024 · 3 comments
Open

Random Media Platform Overloaded status #793

rvleeuwen256 opened this issue Dec 12, 2024 · 3 comments

Comments

@rvleeuwen256
Copy link

We have built a multi-tenant solution using the Media platform that handles roughly 10000 calls per day with audio and H.264 video. We are very pleased by the stability of the platform. However since the platform has moved from using WCF to HTTP (version 1.28) it reports the Overloaded status seemingly random multiple times per day. This is quite problematic because in this status the platform cannot handle new calls.
Investigation showed that this overload is not related to high CPU or network traffic, because we have seen cases where the platform was overloaded while there was only a single active call. When we run a load-test on our solution we cannot trigger the overload.
We have added additional logging in all callbacks from the MediaPlatform to ensure we handle the callbacks from the MediaPlatform very quickly (<10 ms) and we always return from the callback.
The issue happens both when running on .Net Framework 4.7.2. as well as on .Net 8.

We have enabled the logging of the Media platform and we do see some disturbing error messages. There seems to be a relation between this random overload and the message 'thread considered stuck or inactive'.
Is it possible the development team of the Media platform can have a look at these logs?

Microsoft.Skype.Bots.Media Version 1.29.0.75-preview
Microsoft.Graph.Communications.Calls.Media 1.2.0.10563
Microsoft.Graph 5.54.0
Microsoft.Graph.Core 3.1.11

@adityaramgopal
Copy link

@rvleeuwen256
Could you please share the following info?

  1. What kind of hardware/SKU are you running it on?
  2. What is your bot's application ID?
  3. Can you also share the MediaPlatform logs that you have currently enabled during the time it reports overloaded?

Depending on what we find from the logs you shared we may increase the logging verbosity to debug further at which point you may have to rerun your tests. Hope that's ok.

@ssulzer
Copy link

ssulzer commented Dec 13, 2024

Hi @rvleeuwen256
If you don't want to share the bot application ID, then a recent call ID would also work.

@rvleeuwen256
Copy link
Author

We are indeed careful with providing the AppId of the Bot.
The service is running on a Cloud Service Extended Support on many Standard_D3_v2 machines. We have scaled both horizontally as well as vertically, but this did not make any difference.

An example of a 2 overlapping calls that triggered the Media overload is the following:

Teams Call Id: 34003b80-ddc7-47df-8231-adc02eda3399
Call Chain Id: c41d5539-43e8-437c-8ee7-5fa6042aa21d
Media session Id: 28054290-800c-400a-982b-e9fd972845d7
Start: 2024-12-12 16:34:57.982
Established: 2024-12-12 16:35:01.402
End: 2024-12-12 16:37:01.677

Teams Call Id: 64003a80-bda7-490d-9629-c9acb1346ce5
Call Chain Id: f91809e9-8d66-48da-bda9-6508274b1e18,
Media session Id: e4e2d290-8593-4780-9383-c6c5c608e275
Start: 2024-12-12 16:36:29.249
Established: 16:36:31.586
End: 2024-12-12 16:37:37.276

At 2024-12-12 16:37:21.210: Media platform health changed from 'Normal' to 'HeavilyLoaded'.
At 2024-12-12 16:38:04.206: Media platform health changed from 'HeavilyLoaded' to 'Overloaded'.

Interesting fact is that the overload happens while there was no active call anymore.

The Media platform logs are way too big (even with just Information) to share here on GitHub.
These warnings were found in the Media platform logs after the call was finished.

4800121 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD0 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110097 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 33 considered stuck or inactive
4800122 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD1 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110098 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 41 considered stuck or inactive
4800123 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD2 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110099 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 57 considered stuck or inactive
4800124 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD3 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0011009A (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 65 considered stuck or inactive
4802246 2024-12-12 16:38:11.040 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.0019AC16 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.001100B6 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 62 considered stuck or inactive
4802247 2024-12-12 16:38:11.040 Warning [MediaPlatform] [AvMP][AppId:][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.0019AC17 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.001100B7 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 63 considered stuck or inactive

There are no errors or warnings that have the media session id in the logging.
We do see many errors like 'GetAllChannelsQoe: context[4cbf23cc-b118-48a0-a95a-7f41a9423618] Endpoints not found'.

If you need more information please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants